The CrowdStrike Incident: A Shared Responsibility
SenseOn is a direct competitor to CrowdStrike.
On 19th July 2024 BST, an update to CrowdStrike endpoint software caused worldwide IT outages that resulted in over 8 million Windows devices being disabled. This caused major disruption to organisations in a range of industries, including aviation and healthcare.
Quality assurance gaps and deployment processes were not the only factors, or even the most significant factors, in the widespread disruption. Another played a critical role: Microsoft’s failure to modernise Windows.
What happened?
Based on what is known as of 22nd July 2024, CrowdStrike pushed a “Channel File” update which triggered a bug in one of its kernel drivers. This resulted in an invalid memory access, causing affected systems to crash. Unfortunately the issue recurred on reboot (likely because the CrowdStrike kernel driver was loaded on startup), sending devices into a “reboot loop”.
The term “Channel File” is specific to CrowdStrike, and appears to refer to files containing information for identifying malicious activity within data from the endpoint. Some vendors refer to their equivalents as “definition files” or “virus definitions”. While not “software” in the traditional sense, these are effectively data processing instructions that get interpreted by the antivirus software at runtime.
These instructions might be in the form of known malicious file hashes for comparison with files as users access them; regular expression patterns that are matched against certain system artefacts in real-time (such as command-line arguments after a command is executed); and neural network weights trained on the behaviour of historical malware used to identify likely malicious sequences of system calls.
Note that unlike some reports suggested, CrowdStrike does not appear to have pushed a kernel driver update in the lead up to the incident. However, the Channel File data that was updated appears to have been used in some way by their kernel driver.1 It may have been that their driver was responsible for parsing and loading Channel File content, and that the content was corrupted in a way that was missed by the driver’s input validation. This could have resulted in a data structure being populated with a pointer to an invalid memory address which the driver subsequently attempted to use, resulting in the crash.
User vs. Kernel
Software on modern operating systems runs in broadly one of two modes: user mode or kernel mode.
The vast majority of user-facing programs run exclusively in user-mode because of the many advantages this affords them. One is that if a user-mode program tries to behave in certain invalid ways (such as accessing a memory location that doesn’t exist), the operating system will force the program to immediately exit (or “crash”) without causing other programs or the operating system to fail. In this way, user mode enforces a degree of isolation between programs.
Kernel mode on the other hand lacks many of these protections. If a kernel-mode driver attempts to access an invalid memory location (as CrowdStrike’s did), this will cause the operating system to crash. The trade-off is that in kernel mode, software has the kind of access to system data and the ability to communicate with hardware components that is not possible in user mode. On Windows this access is necessary for certain types of software to do its job, such as hardware drivers and - unfortunately - endpoint detection and response applications.
Endpoint security and the Windows kernel
So why do EDR software need kernel drivers on Windows?
Because running in kernel mode on Windows is the only way to reliably get access to data that’s necessary for threat detection, and to reliably disrupt malicious activity once it has been detected. Such access and functionality is intentionally not exposed to user-mode applications. You can think of the kernel “owning” certain data and privileges that EDR software needs. For example, real-time data about files being accessed gives EDRs the ability to scan files “on-access” while the interception of file operations gives EDRs the ability to prevent other programs from accessing files determined to be malicious.
The problem is that this forces EDR vendors to take high risks with the availability of their customers’ systems, and forces customers to accept these risks if they want to be protected.
It doesn’t have to be this way.
Less risk with better operating system design
The risks of running software in kernel mode, and alternative approaches to operating system design that greatly reduce these risks, have been understood for a long time.
Microsoft hasn’t completely ignored this. They’ve implemented mitigations in the form of static and dynamic analysis tooling to help kernel driver developers reduce the risk of missing bugs during the development process; a sanctioned hooking framework; a mandatory driver certification programme; and early-stage research efforts to add support for Rust, a memory-safe language, to the Windows kernel. In total, these mitigations moderately reduce the risk of bugs that could cause total system failure.
But it’s just not enough. Humans will continue to make mistakes that will result in bugs. Software test suites will never be guaranteed to find all defects. Edge cases arising out of complex interactions between systems will go unforeseen. Some unknowns will remain unknown. More drastic changes to remove the need for running security software in kernel mode on Windows are called for.
Microsoft doesn’t have to look far afield to see an example of such changes that have been battle-tested in production, at scale, for a mainstream operating system. In macOS 10.15 (Catalina), Apple introduced DriverKit. DriverKit is a technology that allows software traditionally requiring kernel mode access to run in user mode instead. A dedicated framework - Endpoint Security - provides access to the data required by EDR products in user mode, with much lower risk to system stability.
Your role as a customer
It’s very likely that Microsoft could allocate more resources to making the kind of changes that Apple introduced with DriverKit. They’re more likely to do so if it’s clearly a priority for their customers.
If you hold a significant budget that’s spent on Windows licences or cloud services running Windows workloads, speak to your Microsoft account manager. Tell them you’re concerned that security vendors are forced to deploy kernel mode drivers to get access to security telemetry needed to protect Windows systems. Explain that you would like to see Microsoft make changes to Windows similar to those changes Apple made to macOS with DriverKit and the Endpoint Security framework.
Footnotes
1. This confusion perhaps arose from the fact that Channel Files have a .sys extension (conventionally used for driver binaries on Windows) and that a CrowdStrike driver did in fact crash.