This was the headlines last week (19 July 2024). I picked up on it when it was a breaking news story on TVNZ. When they announced it had caused problems in NZ and Australia, I realised it could be an international issue. I started gathering as much information as I could as the event’s impact unfolded, both for managing any risk to our international client base, and to assess and learn from it for future BC planning.

The source of the CrowdStrike incident was a massive failure impacting many “cloud” providers and large organisations, triggered by the release of a security update that had an error in its execution coding.

CrowdStrike is what is known as an Endpoint Security Software whose job it is to prevent malware entering corporate networks via Microsoft workstations. This is particularly important for those organisations who have people working from home, but is also applicable to businesses with remote digital displays.

Providers of this sort of software will regularly push their updates out very quickly, as they have to move rapidly to block new and known attacks from the bad guys. Unfortunately, on this occasion CrowdStrike seemed to have not fully tested or gone through a robust enough quality control process before sending this update out. The flaw in the update put the Microsoft Windows platform into what is known as a “closed boot loop” where it continually tried to reboot itself, and when it found it could not do that, it defaulted to what the industry calls the Blue Screen of Death (BSOD) error message.

Fixing the problem was relatively easy for a knowledgeable IT person, by bringing each affected machine up in “safe mode” to locate and delete the errant file. The problem is that the fix has had to be actioned individually on each Windows machine that failed and required the person doing it to have Administrator rights to perform the correction. Whilst straightforward this still created significant down time for many organisations to remedy.

Lessons to learn from the CloudStrike crash

This incident has highlighted the danger of relying solely on cloud providers. Should they fail, the impact is potentially massive. Cloud providers such as AWS, Microsoft and Google have multiple data centres, and should they lose physical services or hardware in one data centre they will move that service to another data centre. However, in this case the software moved rapidly across all systems. For a long time, I have had a concern that so many organisations and governments are putting their services onto these cloud providers, essentially putting all their eggs in one basket. If a cybercriminal wants to impact a government now, they just need to get into one of these cloud providers and cause their systems to fail. The Cloud providers consistently say this cannot happen. Yeah right – looked what happened with CloudStirke.

I also noted that some people were saying during the CloudStrike incident they should activate their business continuity and disaster recovery plans. In theory that may have been feasible, but as soon as you bring up a Windows machine it will do a software update, triggering the problem while it still existed. So, caution would have to be exercised here; in this scenario, the best course of action was to not turn on your machines to check if it had been affected, but to delay this until CloudStrike has put their fix out.

Sam Mulholland is the founder and lead Business Continuity consultant at Standby Consulting. To discuss how to test your organisation’s response to unplanned disruptions from events like the CloudStrike crash, contact Sam directly.

Business Continuity - articles and case studies

CrowdStrike crashes approximately 8.5 million Microsoft Windows machines

Lessons to learn from the CloudStrike crash

Standby Consulting Limited

Standby Consulting Services in Business Continuity, Disaster Recovery, Escrow and Computer Room Design