The year 2021 witnessed several outages that significantly affected users across the globe. Remember the network outage on Facebook along with its associated services WhatsApp and Instagram that lasted for six hours? This was notably the second major outage affecting the social media giant in the year, with the first occurring on 19th March for 45 minutes, affecting the same services.
As per ITIC’s 12th annual 2021 Hourly Cost of Downtime Report, 44% of the surveyed firms indicated that hourly downtime costs exceed $1 million to over $5 million, exclusive of legal fees, fines, or penalties. Additionally, 91% of organizations said a single hour of downtime takes mission-critical server hardware and applications offline, averages over $300,000 due to lost business, productivity disruptions, and remediation efforts. Meanwhile, only 1% of organizations—mainly very small businesses with 50 or fewer employees—estimate that hourly downtime costs less than $100,000.
Looking at the examples of outages and their impact, we can deduce that disasters are unexpected and extraordinary. Which is why you require a well-defined and documented Disaster Recovery plan that helps in recovering lost data and accelerating a company’s return to normal business operations.
What is a Disaster Recovery plan? Why do you need it?
Disaster recovery (DR) is the practice of recovering an organization’s IT operations after a disruptive event or disaster hits it. The DR plan comprises policies and procedures to respond to the major outage to restore the infrastructure and applications online as soon as possible.
A DR plan and procedure includes the following steps:
- Recognizing different ways that systems can fail unexpectedly by identifying potential threats to IT operations
- Having a well-planned process to respond to and resolve outages
- Mitigating the outage’s risk and impact on the operations
Basically, as part of a business continuity plan (BCP) that focuses on restoring an organization's essential functions as determined by a business impact analysis (BIA), disaster recovery focuses on restoring IT operations.
So, is your system prepared to bounce back when the unexpected happens?
If not, read on to find out how.
But first, when is the right time to initiate a DR plan?
It’s crucial to be ready and able to respond in an emergency. Consider a disaster recovery plan just like an insurance policy—you need it, but you hope you never need it. Here too, you need to have a well-defined ‘plan activation.’
The decision to invoke DR is largely based on three categories of assessment:
- Technical
- Business
- Operations
So, from the time when an incident is notified, the Recovery Time Objective (RTO) clock starts ticking. After that, the maximum time at hand is:
D = RTO – MTTR (Mean Time to Recover)
Assuming a decision is taken on these bases, DR must be invoked before D is reached.
While in crisis, it’s tempting to wait a little bit more to avoid the DR invocation; this could be due to a lack of confidence in the outcome of the DR process. Having regular DR mock drills prepares for confident decision-making whenever the time comes.
How can chaos engineering help in DR planning?
Chaos engineering is about finding weaknesses in a system through controlled experiments to improve the system’s reliability. It helps identify and fix failure modes before they can cause any real damage to the system. This is done by running chaos experiments to inject harm into a system, application, or service. For instance, adding latency to network calls, shutting down a server, etc.
Just like Site Reliability Engineers (SREs) have runbooks, DRPs also contain instructions and steps to restore service after an incident. These are used to validate runbooks by performing fire drills that are intentionally created outages. GameDays (days dedicated to chaos experiments) are also conducted to test the resilience of the systems.
We can use chaos engineering to recreate or simulate a black swan event for disaster recovery. This gives us the opportunity to test the DRP and response procedures in a controlled scenario instead of recreating disaster-like conditions manually or waiting for a real disaster.
In one of my previous blogs on resilience, I discussed the top 5 pitfalls of chaos engineering and how to avoid them.
How do we inject faults to simulate a DR scenario?
You can start by defining specific DR scenarios related to a set of business functionalities. Based on the RTOs and RPOs of associated services and their dependencies, teams must also identify the what, where, and when of fault injection. This involves what type of faults are to be injected, where exactly these need to be injected, and when. The sequence of injections, severity of the faults, etc., are additional levers to stage the disaster scenario being tested.
There are several types of faults and ways to inject them, as mentioned below:
Category |
Type |
Tools/Techniques |
Resource Exhaustion |
||
CPU, Memory, and IO |
Stress-ng can help you inject failure by stressing the various physical subsystems of a computer and the operating system kernel interfaces using stressors. Some of the readily available stressors are CPU, CPU-cache, device, IO, interrupt, filesystem, memory, network, OS, pipe, scheduler, and VM. | |
Disk space |
The standard command-line utility dd can read and/or write from special device files such as /dev/random. You can utilize this behavior for obtaining a fixed amount of random data, and it can also be used to simulate a disk filling up. | |
Application APIs |
Load testing is a great technique to test an API before it reaches production. You can also utilize this in the context of DR-related stress testing with the help of Wrk, JMeter, and Gatling. | |
Network and dependencies level failure |
|
|
|
Latency and packet loss |
You can delay the loss for UDP or TCP packets (or limit the bandwidth usage of a particular service to simulate internet traffic conditions) using low-level command-line tools like tc (traffic control) or similar tools available in the GCP toolkit, Traffic Director. |
Network corruption |
You can simulate the chaotic network and system conditions of real-life systems by configuring many of the available toxics between different architecture components. Tools like Toxiproxy can be leveraged for this. | |
Application, process, and service level failure |
|
|
|
Application processes kill |
You can kill the application process or VM/Container running the process using native platform techniques. |
Database failure |
Injecting all types of possible database failures is difficult unless the database platform natively supports fault injection, such as Amazon Aurora. You need to apply custom techniques for specific scenarios. Type of failures could be instance failure, replica failure, disk failure, and cluster failure, to name a few. |
|
FaaS (serverless) failure |
Serverless computing/functions use a default safety throttle for concurrent executions of all instances of functions. You can use the same to inject failure – by setting the lowest possible limit of 1 or 0 (as per cloud-native implementation). | |
Infrastructure Level |
||
Instance failure, Availability zone failure, Datacenter failure |
The simplest of all is a random or planned termination of an instance or a group of instances (simulating everything off in an availability zone or data center). You can use one of the several tools available, from cloud-native ways to more organized ones such as Chaos Monkey. |
Depending upon the systems, platforms in use, etc. using one or the other of these low-level tools might not always be feasible and/or practical. The fault injection and orchestration tools also package most of the discussed tools and/or an alternative of them in an easy-to-use manner. These include:
- Chaos Toolkit
- Litmus
- Gremlin
Best practices to run a DR plan for highly complex scenarios
How can you be confident in running a DR plan while reducing the risk of an experiment that causes harm to your systems? How should you handle scenarios of scale? For instance, when triggering the DR, if it requires spinning up 500 servers, how can you optimize the operations to achieve maximum efficiency rather than operating sequentially?
Here are the key considerations for optimizing the recovery process:
- Identification of relative criticality of functions (typically having different RTOs/RPOs)
- Identification of inter-dependencies of functions, including those common functions which are omnipresent
Having a clearly defined dependency tree, you can initiate the recovery of independent nodes based on the following two factors:
- All common systems which are needed by one or more independent functions must first be recovered
- All independent functions can be initiated for recovery in-parallel thereafter
In this blog, I and my fellow experts list down the best practices of chaos engineering for successful implementation.
How Nagarro can help
In a 24/7, digital world, where disaster recovery is more important than ever, we, at Nagarro, can help you leverage chaos engineering to be better prepared for any disaster and minimize disruptions.
Connect with us today! For more on our offerings in this space, check out our Business Resilience page.