services
A holistic approach that accelerates your current vision while also making you future-proof. We help you face the future fluidically.
Digital Engineering

Value-driven and technology savvy. We future-proof your business.

Intelligent Enterprise
Helping you master your critical business applications, empowering your business to thrive.
Experience and Design
Harness the power of design to drive a whole new level of success.
Events and Webinars
Our Event Series
Featured Event
22 - 24 Jan
Booth #SA31 | ExCel, London
Our Latest Talk
By Kanchan Ray, Dr. Sudipta Seal
video icon 60 mins
About
nagarro
Discover more about us,
an outstanding digital
solutions developer and a
great place to work in.
Investor
relations
Financial information,
governance, reports,
announcements, and
investor events.
News &
press releases
Catch up to what we are
doing, and what people
are talking about.
Caring &
sustainability
We care for our world.
Learn about our
initiatives.

Fluidic
Enterprise

Beyond agility, the convergence of technology and human ingenuity.
talk to us
Welcome to digital product engineering
Thanks for your interest. How can we help?
 
 
Authors
Neharika Gianchandani
Neharika Gianchandani
connect

The year 2021 witnessed several outages that significantly affected users across the globe. Remember the network outage on Facebook along with its associated services WhatsApp and Instagram that lasted for six hours? This was notably the second major outage affecting the social media giant in the year, with the first occurring on 19th March for 45 minutes, affecting the same services.

As per ITIC’s 12th annual 2021 Hourly Cost of Downtime Report, 44% of the surveyed firms indicated that hourly downtime costs exceed $1 million to over $5 million, exclusive of legal fees, fines, or penalties. Additionally, 91% of organizations said a single hour of downtime takes mission-critical server hardware and applications offline, averages over $300,000 due to lost business, productivity disruptions, and remediation efforts. Meanwhile, only 1% of organizations—mainly very small businesses with 50 or fewer employees—estimate that hourly downtime costs less than $100,000.

Looking at the examples of outages and their impact, we can deduce that disasters are unexpected and extraordinary. Which is why you require a well-defined and documented Disaster Recovery plan that helps in recovering lost data and accelerating a company’s return to normal business operations.

 

What is a Disaster Recovery plan? Why do you need it?

Disaster recovery (DR) is the practice of recovering an organization’s IT operations after a disruptive event or disaster hits it. The DR plan comprises policies and procedures to respond to the major outage to restore the infrastructure and applications online as soon as possible.

A DR plan and procedure includes the following steps:

  • Recognizing different ways that systems can fail unexpectedly by identifying potential threats to IT operations
  • Having a well-planned process to respond to and resolve outages
  • Mitigating the outage’s risk and impact on the operations

Basically, as part of a business continuity plan (BCP) that focuses on restoring an organization's essential functions as determined by a business impact analysis (BIA), disaster recovery focuses on restoring IT operations.

So, is your system prepared to bounce back when the unexpected happens?

If not, read on to find out how.

But first, when is the right time to initiate a DR plan?

It’s crucial to be ready and able to respond in an emergency. Consider a disaster recovery plan just like an insurance policy—you need it, but you hope you never need it. Here too, you need to have a well-defined ‘plan activation.’

The decision to invoke DR is largely based on three categories of assessment:

  • Technical
  • Business
  • Operations
Disaster recovery is based on three categories of assessmen

So, from the time when an incident is notified, the Recovery Time Objective (RTO) clock starts ticking. After that, the maximum time at hand is:

D = RTO – MTTR (Mean Time to Recover)

Assuming a decision is taken on these bases, DR must be invoked before D is reached.

While in crisis, it’s tempting to wait a little bit more to avoid the DR invocation; this could be due to a lack of confidence in the outcome of the DR process. Having regular DR mock drills prepares for confident decision-making whenever the time comes.

How can chaos engineering help in DR planning?

Chaos engineering is about finding weaknesses in a system through controlled experiments to improve the system’s reliability. It helps identify and fix failure modes before they can cause any real damage to the system. This is done by running chaos experiments to inject harm into a system, application, or service. For instance, adding latency to network calls, shutting down a server, etc.

Just like Site Reliability Engineers (SREs) have runbooks, DRPs also contain instructions and steps to restore service after an incident. These are used to validate runbooks by performing fire drills that are intentionally created outages. GameDays (days dedicated to chaos experiments) are also conducted to test the resilience of the systems.

We can use chaos engineering to recreate or simulate a black swan event for disaster recovery. This gives us the opportunity to test the DRP and response procedures in a controlled scenario instead of recreating disaster-like conditions manually or waiting for a real disaster.

In one of my previous blogs on resilience, I discussed the top 5 pitfalls of chaos engineering and how to avoid them.

How do we inject faults to simulate a DR scenario?

You can start by defining specific DR scenarios related to a set of business functionalities. Based on the RTOs and RPOs of associated services and their dependencies, teams must also identify the what, where, and when of fault injection. This involves what type of faults are to be injected, where exactly these need to be injected, and when. The sequence of injections, severity of the faults, etc., are additional levers to stage the disaster scenario being tested.

There are several types of faults and ways to inject them, as mentioned below:

Category
Type
Tools/Techniques
Resource Exhaustion
   
 
CPU, Memory, and IO
Stress-ng can help you inject failure by stressing the various physical subsystems of a computer and the operating system kernel interfaces using stressors. Some of the readily available stressors are CPU, CPU-cache, device, IO, interrupt, filesystem, memory, network, OS, pipe, scheduler, and VM.
 
Disk space
The standard command-line utility dd can read and/or write from special device files such as /dev/random. You can utilize this behavior for obtaining a fixed amount of random data, and it can also be used to simulate a disk filling up.
 
Application APIs
Load testing is a great technique to test an API before it reaches production. You can also utilize this in the context of DR-related stress testing with the help of Wrk, JMeter, and Gatling.
Network and dependencies level failure
 


 
Latency and packet loss
You can delay the loss for UDP or TCP packets (or limit the bandwidth usage of a particular service to simulate internet traffic conditions) using low-level command-line tools like tc (traffic control) or similar tools available in the GCP toolkit, Traffic Director.
 
Network corruption
You can simulate the chaotic network and system conditions of real-life systems by configuring many of the available toxics between different architecture components. Tools like Toxiproxy can be leveraged for this.
Application, process, and service level failure
 
 
 
Application processes kill
You can kill the application process or VM/Container running the process using native platform techniques.
 
Database failure
 Injecting all types of possible database failures is difficult unless the database platform natively supports fault injection, such as Amazon Aurora.

You need to apply custom techniques for specific scenarios. Type of failures could be instance failure, replica failure, disk failure, and cluster failure, to name a few.
 
FaaS (serverless) failure
Serverless computing/functions use a default safety throttle for concurrent executions of all instances of functions. You can use the same to inject failure – by setting the lowest possible limit of 1 or 0 (as per cloud-native implementation).
Infrastructure Level
   
 
Instance failure, Availability zone failure, Datacenter failure
The simplest of all is a random or planned termination of an instance or a group of instances (simulating everything off in an availability zone or data center).
You can use one of the several tools available, from cloud-native ways to more organized ones such as Chaos Monkey.
 

Depending upon the systems, platforms in use, etc. using one or the other of these low-level tools might not always be feasible and/or practical. The fault injection and orchestration tools also package most of the discussed tools and/or an alternative of them in an easy-to-use manner. These include:

  • Chaos Toolkit
  • Litmus
  • Gremlin
Irrespective of the tools you choose, it is crucial to have clarity of scenarios regarding the close-to-real values that need to be achieved for the various types of faults to be injected.

Best practices to run a DR plan for highly complex scenarios

How can you be confident in running a DR plan while reducing the risk of an experiment that causes harm to your systems? How should you handle scenarios of scale? For instance, when triggering the DR, if it requires spinning up 500 servers, how can you optimize the operations to achieve maximum efficiency rather than operating sequentially?

Here are the key considerations for optimizing the recovery process:

  • Identification of relative criticality of functions (typically having different RTOs/RPOs)
  • Identification of inter-dependencies of functions, including those common functions which are omnipresent

Having a clearly defined dependency tree, you can initiate the recovery of independent nodes based on the following two factors:

  • All common systems which are needed by one or more independent functions must first be recovered
  • All independent functions can be initiated for recovery in-parallel thereafter

In this blog, I and my fellow experts list down the best practices of chaos engineering for successful implementation.

How Nagarro can help

In a 24/7, digital world, where disaster recovery is more important than ever, we, at Nagarro, can help you leverage chaos engineering to be better prepared for any disaster and minimize disruptions.


Connect with us today! For more on our offerings in this space, check out our Business Resilience page.