I find this incredibly fascinating. After almost 20 years of carefully planning and executing website releases, Netflix’s process both makes sense and scares the hell out of me.
Netflix, the popular movie streaming site, deploys a hundred times per day, without the use of Chef or Puppet, without a quality assurance department and without release engineers. To do this, Netflix built an advanced in-house PaaS (Platform as a Service) that allows each team to deploy their own part of the infrastructure whenever they want, however many times they require.
Additionally, they purposely introduce failures into their infrastructure to test resiliancy:
Failure happens continuously in the Netflix infrastructure. Software needs to be able to deal with failing hardware, failing network connectivity and many other types of failure. Even if failure doesn’t occur naturally, it is induced forcefully using The Simian Army. The Simian Army consists of a number of (software) “monkeys” that randomly introduce failure. For instance, the Chaos Monkey randomly brings servers down and the Latency Monkey randomly introduces latency in the network. Ensuring that failure happens constantly makes it impossible for the team to ignore the problem and creates a culture that has failure resilience as a top priority.