Netflix. Does it ring any bells? It is 2018 and who doesn't know about Netflix. Netflix has made traditional TV viewing redundant as people watch shows, movies, live stream of events, sports and reality shows; all on the go. Who would have thought that one could stream all content on any device—be it a television set, mobile, laptop, desktop, tablet, or even a gaming console? But, have you ever stopped to think about what goes on on the back end? What kind of infrastructure, technology stack, analytics, data lakes, etc., are running in the background to help you stream content seamlessly? And how do they ensure that the service is always available? Let’s take a quick look at the virtual world of Netflix and how they handle failure.
Netflix moved from its traditional controlled, physical infrastructure to the cloud provided by Amazon Web Services (AWS) in 2010. By its very nature, the cloud is dynamic, vulnerable, and unpredictable as it depends on microservices and serverless methodologies. It leads to dependency on services outside one’s control and with that more complexity with many variables. It also makes failures and outages inevitable. For Netflix, it became critical to ensure that loss of an Amazon service didn't affect the streaming experience. So they created Chaos Monkey. Its core job was to disable production services randomly to ensure continuous transmission during common failures.
Chaos Monkey, and subsequently the Simian Army—the entire Netflix tool suite to handle failure was based on the then-emerging concept of "Chaos Engineering" which was designed to dramatically increase the quality and reliability of delivered services. With the application of DevOps principles that focus on quality attributes to meet business needs and leverage automated processes to achieve consistency and efficiency, Netflix engineers set about the journey of automating failures.
It seems contradictory to induce failure in order to create reliable and sustainable systems, but that’s what chaos engineering is all about. The simple idea behind it is to create chaotic scenarios to test the systems. Chaos engineering takes the complexity of the system to be tested as a given and tests it holistically by simulating extreme, turbulent, or novel conditions and observing how the system responds and performs.
Chaos engineering can be used in many situations, such as handling large traffic spikes, byzantine failures, race conditions, disk server crashes and other unpredictable circumstances, that could potentially lead to service outages. The goal of chaos engineering is to generate new information about how systems (as a whole) react when individual components fail and then use that information to redesign the system to be more resilient. Simulating failure allows IT teams to verify that cloud systems are behaving as expected. Containers, microservices, and distributed systems are becoming a staple for cloud computing. While all these are incredibly useful, they are just as vulnerable to downtime.
Investing in tools for chaos engineering can help set-up different scenarios and run simulations of different scenarios to beat unpredictability. In other words, it is designing a “chaosification” of the platform in a controlled manner, which could consist of actions like disabling firewall rules, modifying files at random, crashing systems intentionally, injecting malicious traffic into the Virtual Private Cloud (VPC), randomly killing processes while they are taking place, etc. Basically, it can be anything and everything that the chaos engineer can ever imagine to cause a glitch. Chaos engineers can set-up a myriad of scenarios with full freedom, run their simulations and revert them if a system shows degrading signs faster than expected. The intent is to offer exact control over every step of the simulation.
Therefore, chaos engineering involves running thoughtful, planned experiments to teach us how our systems will behave in the face of failure. These experiments follow three steps:
These experiments have the added benefit of building muscle memory in resolving outages, akin to a fire drill. Breaking things on purpose makes it possible to identify unknown issues that could impact systems and customers. Implementing a strategy based on chaos engineering helps to work the antifragility of a platform, including meeting the control objectives and requirements of PCI-DSS compliance in case of audits.
As times change and technology advances and empowers the masses, reliability, and resiliency of products and services has become imperative. In fact, any company could benefit greatly from implementing a chaos engineering tool in its security strategy. Besides technology, chaos engineering is expected to benefit sectors like transport, retail, construction, finance, health, education.
This practice benefits the customer by providing uninterrupted service. Moreover, it can help prevent significant losses in revenue and maintenance costs, create happier and more engaged engineers, improve on-call training for engineering teams, and improve incident management for the entire company. And finally, all this is done to enable the company to build more resilient systems, upskill the team, and retain/acquire customers.