How a Single Application Update Took Down 8.5 Million Windows Machines All Over the World
C-00000291*. sys— This is the tiny 40 KB file, a driver file associated with CrowdStrike's Falcon endpoint security software sensor, that took down over 8.5 million Windows computers worldwide. This file, a crucial software component, caused a worldwide outage, affecting airlines, airports, banks, hotels, hospitals, manufacturing, stock markets, broadcasting, gas stations, retail stores, governmental services, and Fortune 500 companies. So, hundreds of millions worldwide missed doctor's appointments, airline flights, and other critical services. This tiny file caused worldwide financial loss estimated at least USD 10 billion by multiple insurance firms, including cyber insurance firms. As I've talked about in an earlier article I wrote in February, compromising one of the three parts of the CIA triad, namely Confidentiality, Integrity, and Availability, can negatively affect human lives, health, jobs, businesses, and economies worldwide. Though not the result of or related to a cyberattack, the CrowdStrike outage significantly affected millions of people by compromising the availability of critical services. In the case of the recent CrowdStrike outage on Friday, July 19, 2024, the availability part of the CIA triad was broken.
So, how did this massive outage that affected over 8.5 Windows systems worldwide happen? To start, let's identify the problem's system, which, in this case, was the Blue Screen of Death (BSOD) on Windows machines running CrowdStrike's Falcon endpoint security platform, which is like anti-malware software on steroids. The BSOD is a critical system error in Windows operating systems that can cause the system to crash, leading to data loss and potential hardware damage. A computer generally has two areas: Ring 0, or kernel mode, and Ring 1, or user mode. Kernel mode is where an operating system and kernel-mode drivers, which have direct access to and control of the computer's hardware, run. On the other hand, user mode is where multiple applications, such as Microsoft Office or your favorite Windows-based game, run on top of the Windows operating system located in kernel mode. When an application crashes, its blast radius is isolated to Ring 1 and doesn't affect the underlying operating system running in Ring 0. However, when an issue occurs in Ring 0 or kernel mode, its blast radius of damage affects the entire system, bringing down all the applications running on top of it. The driver file associated with CrowdStrike's Falcon endpoint security software sensor, C-00000291*.sys, caused an 'out-of-bounds memory read' and subsequent Blue Screen of Death (BSOD) on Windows machines running CrowdStrike's Falcon endpoint security software, which runs in Ring 0 or kernel mode.
As you can see, any issue occurring in the kernel mode will bring down the entire system. It wasn't the Falcon sensor content, one of two updates to CrowdStrike's Falcon endpoint security software, that caused the issue; it was the sensor configuration update meant to provide instructions for detecting malware that caused the problem. Isn't it fascinating that CrowdStrike's Falcon sensor earned Microsoft's Windows Hardware Quality Labs (WHQL) certification? This certification, a rigorous testing process conducted by Microsoft, demonstrates that the Falcon sensor successfully cleared demanding tests to meet Microsoft's strict compatibility and reliability standards for the Windows operating system. However, it's crucial to recognize that even with the "WHQL" certification, there is still potential for issues to arise and create chaos.
The breakdown in CrowdStrike's CI/CD pipeline may have occurred during the continuous integration phase, specifically in the software update's quality control process. The faulty update was not properly validated before deployment, resulting in the outage. Like many CI/CD pipelines, CrowdStrike's CI/CD pipeline involves several automated steps, including code commits, building and compiling code, running automated tests, and performing code quality checks. However, the defect in the software update went undetected during these validation checks. This issue highlighted a gap in the automated testing and quality assurance processes within the CI/CD pipeline.
While the problem may have been caused by something during the change management process, the specific issue may be isolated to a content validator, a software tool that verifies the accuracy, integrity, and security of the content being deployed. The content validator ensures that updates, configuration, or any changes pushed through the CI/CD pipeline don't contain vulnerabilities, errors, or other harmful elements that can disrupt the system. So, in the CrowdStrike incident, the content validator can be expected to check for errors to catch code defects, perform security verification to ensure updates are free from malicious code or vulnerabilities, perform compatibility checks to make sure that updates work smoothly with the existing system and other software components, and, finally, verify compliance with regulatory and compliance requirements specific to the organization and industry. Considering the various functions of the content validator, we can see how its failure to catch errors in the software update before it was deployed caused BSODs in Windows machines worldwide, highlighting the critical role of effective content validation in maintaining the security and stability of CI/CD pipelines. Also, this outage emphasizes how strong validation mechanisms can prevent many of the consequences I've described earlier that were felt by millions worldwide.
CrowdStrike didn't reveal the specific name of the content validator tool they're using, but it's evident that it's a vital part of their internal validation and testing process. The recent failure showed the importance of strengthening validation mechanisms, improving testing procedures, and enhancing error-handling capabilities within CI/CD pipelines.
In response to the incident, CrowdStrike has committed to enhancing its content update testing, improving error handling, and implementing staggered deployments to prevent future occurrences. These proactive measures, combined with their existing release process, including extensive automated testing and a canary deployment strategy, demonstrate their dedication to preventing similar incidents. A canary deployment strategy involves rolling out a new software update to a small subset of users or systems first, allowing for early detection of any issues before an entire deployment. However, this incident serves as a stark reminder of the potential vulnerabilities in cybersecurity systems and the high stakes involved for companies reliant on these protections. It emphasizes the urgent need for more rigorous safeguards, particularly in error handling and deployment strategies.