In today's hyperconnected digital landscape, the complexities of our interdependencies have never been more evident. The recent global IT outage caused by a faulty CrowdStrike software update is a stark reminder of how intricately linked our systems are and how vulnerable we can be to unforeseen disruptions. The widespread outage caused by CrowdStrike's digital presence affected businesses worldwide. This incident exposed weaknesses across industries that rely on seamless digital operations.
Aviation: As CNN reported, major airlines like Delta, United, and British Airways faced disruptions, grounding flights and delays. Multiple airlines cumulatively reported over 46,000 delays and 5,171 cancellations. The U.S. carrier, Delta Air Lines, suffered the most from the outage with over 700 cancellations and 400 flight delays (as per FlightAware report). With most of Delta’s applications linked to the affected Microsoft Windows operating system, the airline had to cancel more than 4,800 flights over the next three days and experienced numerous delays.Healthcare: Hospitals experienced data access issues, delaying surgeries and appointments, notably affecting the NHS. Direct financial loss in the healthcare sector – $1.94bn cumulatively
Financial Services: The global tech failure affected financial services, as banks and card payment networks experienced disruptions, preventing many customers from using online banking. Stock trading was interrupted on markets like the London Stock Exchange. According to Guidewire Cyence, the financial losses resulting from this outage are estimated to be between $1 billion and $3 billion USD, with a most likely estimate of $1.7 billion USD. [source].
First responders: Due to coordination challenges, emergency services, including 911 dispatch centers in several states, had to rely on manual processes.
In light of such incidents, enterprises must rethink their approach to resilience, focusing on building robust systems that can quickly adapt and recover in the face of adversity. As we delve into the lessons from this outage, the emphasis on resilience in the digital age becomes increasingly critical for business continuity and future-proofing operations.
We at Nagarro believe that every aspect of the digital ecosystem has a role in improving system resilience. In this case:
1. CrowdStrike, as the software provider, bears primary responsibility for the outage. And they must prioritize building systems that are secure and resistant to failures. A more robust development lifecycle is imperative, encompassing rigorous testing, phased deployment like canary deployment, which could have helped early detection before a widespread issue, and comprehensive incident response planning. Crucially, CrowdStrike should have prioritized independent third-party audits and vulnerability assessments to identify the potential problems before release. A culture of continuous improvement, including regular security code reviews and post-incident analysis, is vital for preventing future occurrences.2. Microsoft, as the underlying platform provider, shares significant responsibility. The key here is recognizing third-party dependency and designing systems to isolate issues and ensure the integrity of the overall platform. While not directly causing the outage, the incident highlighted opportunities for improvement. Enhanced platform stability through redundancy and failover mechanisms could have mitigated the impact. Proactive monitoring and alerting for anomalous software behavior would have enabled earlier detection of potential issues. Investing in advanced threat detection and response capabilities could have helped contain the incident. Additionally, fostering strong partnerships with software providers is essential for a resilient ecosystem.
Building a Resilient Enterprise
To ensure your mission-critical software stays resilient, enterprises need to adopt a multi-faceted approach. This means taking charge by diversifying vendor relationships, establishing robust business continuity plans, conducting regular audits, and investing in employee training.
Here are a few best practices we recommend organizations must prioritize:
- Develop comprehensive business continuity and incident response plans, outlining clear steps for recovery and responsibilities.
- Conduct regular supply chain risk assessments to identify and mitigate potential risks and ensure the reliability of third-party tools.
- Map dependencies of critical relationships between systems to identify potential vulnerabilities.
- Adopt phased deployment rings to control the pace and minimize risks of updates.
- Implement granular control over updates and utilize safe deployment policies to stage updates.
- Test in controlled environments, prioritizing critical systems for lower update frequencies.
- Establish rapid device recovery with system restore points and utilize cloud-based snapshots for virtual machines.
- Build rollback procedures with contingency plans for reverting to previous states, while exercising caution.
- Regularly train employees to handle disruptions effectively.
Production environments demand specific attention due to their critical role in business operations. Interferences and dependencies within production systems can have far-reaching consequences. Organizations must prioritize resilience in these environments through:
Beyond technology, operational resilience is paramount; this encompasses people, processes, and technology working in harmony. Independent assessments can help organizations identify weaknesses and build a culture of preparedness.
Nagarro specializes in helping organizations build resilient systems.
Nagarro has a proven track record in resilience engineering, which can help organizations prevent and recover from system disruptions. Nagarro’s resilience engineering services are designed to fortify your business against disruptions. We offer resiliency assessments to identify vulnerabilities, business continuity planning to ensure your operations stay functional during crises, and disaster recovery solutions to restore critical systems swiftly. Our incident management frameworks help organizations respond effectively to unexpected challenges, while continuous resilience testing ensures your systems are always prepared for real-world stresses. With these comprehensive services, we help you build stronger, more reliable systems that can adapt and recover from disruptions efficiently.
For example, in one of our projects, we assisted a global financial services company in achieving reliable performance of over 2000 payment transactions per second through robust resiliency testing. This involved simulating real-world failures, testing under extreme load conditions, and designing systems that could handle unpredictable disruptions without impacting operations.
In another case, Nagarro helped a major retail organization ensure business continuity by creating a comprehensive resilience engineering solution during a significant outage. This included outage management processes, failover mechanisms, and post-incident reviews, allowing the company to minimize downtime and maintain critical services during unforeseen interruptions.
Learn more about our Resilience Engineering Services and how we’ve helped other organizations build resilient ecosystems in the case studies below. The time for reactive fixes is past. Embrace a proactive approach to resilience, enhancing every part of your operation and supply chain.