On July 19th, 2024, CrowdStrike identified a defect in one of their Falcon sensor updates for Windows hosts that caused crashes and blue screen errors on some systems. While the immediate fix has been deployed, many organizations are still working to recover affected machines and ensure full remediation. In this blog post, we will comprehensively cover how to identify impacted systems, manual and automated recovery steps, recommended best practices, and lessons learned from this event.
Understanding the Root Cause
CrowdStrike quickly determined the root cause to be a single problematic content update signed on July 9th that was deployed to Windows hosts between the dates of July 9th-19th. This update contained a defect in one of the Falcon sensor component files which led certain systems configured a specific way to experience crashes or bugs. The impacted file was a kernel mode driver called “C-00000291*.sys” with a timestamp of 0409 UTC.
By July 19th, CrowdStrike had reverted the problematic update and pushed out a fixed version of the driver with a later timestamp of 0527 UTC or later. Any hosts brought online after this point should avoid issues. The main indications of impact were blue screen errors referencing the Falcon sensor or crashing that prevented the machine from staying online long enough to receive the reverted update automatically.
Identifying Possibly Affected Hosts
To understand the scope of impact, it is important for organizations to identify any hosts that may have installed the problematic driver version between July 9th-19th. CrowdStrike provides guidance on running an advanced Windows event log search to flag these systems. Some key things to look for include:
- Event ID 14 from the Microsoft-Windows-DriverFrameworks-UserMode source
- Channel file version matching “C-00000291*.sys” and timestamp between 0409-0527 UTC
- Processes or threads crashing with a bugcheck code of 0x0000007E
Automation can help orgenizations efficiently run this search across all managed Windows endpoints. The results will show any hosts that likely need direct remediation efforts.
Automated Recovery Methods
For environments that have automation capabilities in place, there are options to automatically remediate some affected hosts without requiring direct hands-on actions:
- Reset disk/volume: Detach the operating system disk/volume from the virtual machine instance, optionally create a backup snapshot, then reattach after the problematic driver file has been deleted. This works well for virtual infrastructures.
- Rollback snapshot: For systems like Amazon Web Services (AWS) that support point-in-time snapshots, roll back to an earlier snapshot taken prior to the problematic driver installation date of July 9th.
- Windows Update: Enable Windows Update to automatically download and install the latest updates, which should include the fixed CrowdStrike driver version for any network-connected endpoints.
- WSUS/SCCM: Leverage existing Windows Server Update Services (WSUS) or System Center Configuration Manager (SCCM) infrastructure to approve and deploy the necessary CrowdStrike hotfix package to clients.
When possible, automated solutions can mass remediate large numbers of systems with minimal manual effort. But some environments may require individual machine interventions.
Manual Recovery Steps
For hosts that cannot automatically receive the fix, more hands-on actions are required:
- Reboot the machine and attempt to download the revised driver file through Windows Update or directly from CrowdStrike. Put the host on a wired network for fastest connectivity.
- If the system continues crashing on reboot, then boot into Safe Mode or Windows Recovery Environment, which loads a limited driver set and avoids the crashing driver load.
- Navigate to the CrowdStrike driver folder at
%WINDIR%\System32\drivers\CrowdStrikeand delete the file matchingC-00000291*.syswith the outdate timestamp. - Reboot the machine normally. In some cases, rebuilding the driver database may also help (
dism /online /cleanup-image /restorehealth). - For BitLocker encrypted systems, recovery keys may be needed to access the drive in various boot modes. Centralized key management solutions can help with recovery access.
- As a last resort, reload the operating system from backups taken prior to the patching dates.
Having mitigation procedures documented streamlines individual machine fix efforts. Consider automating manual steps when possible.
Lessons Learned
Every security incident holds important lessons. From this CrowdStrike update issue, here are some takeaways organizations may apply:
- Maintain recent system backups or snapshots outside production for quick rollbacks.
- Leverage hypervisor/OS automation when able to reset rather than repair individual affected systems.
- Centralize encryption keys for rapid BitLocker access in recovery scenarios.
- Test patching processes under different network conditions (wired, wireless, low-bandwidth).
- Establish communication pathways with key security vendors for promptly learning scope and remediation details.
- Consider alternative update delivery methods like WSUS to prevent single point of failures.
- Run preparatory event searches to proactively identify systems requiring updates or attention during known patching windows.
- Document robust incident response plans that evolve from lessons learned through real-world experiences.
While this CrowdStrike update caused challenges, it also serves as an opportunity for organizations to strengthen processes through reflection and improvement. An incident-informed approach will help boost security posture and preparedness over the long run.
In conclusion, thorough recovery methodology combined with preventative planning can help turn what was once a disruption into a learning experience. With effort across both automated and manual fronts, affected systems can be restored. Even more importantly, future dangers may be better mitigated through revisions informed by past events.
