Thoughts on the CrowdStrike Outage

Unless you’ve been living under a rock, you probably know that last Friday a global crash of computer systems caused by ‘CrowdStrike’ led to widespread chaos and mayhem: flights were cancelled, shops closed their doors, even some hospitals and pharmacies were affected. When things like this happen, I first have a smug feeling “this would never happen at our place”, then I start thinking. Could it?

Broken Software Updates

Our department do take responsibility for keeping quite a lot of software up-to-date. For most of our customers, we rely on unattended upgrading for important bug fixes (including security fixes). This typically happens nighttime and roughly at the same time for many of our customers. A broken kernel fix applied to multiple versions of Ubuntu could theoretically take down a large number of our customers before our on-call duty would be able to identify the cause and disable the upgrading, and it would take a lot of time fixing. However, on the systems that really matter we do have good redundancy, often with servers distributed over three sites - and we ensure that some time passes from when one site is patched until the next. The details vary from customer to customer, but for some solutions we let it pass more than a day between each site is upgraded. In case of serious security problems we don’t have time waiting a full day for the upgrade, but then we would handle it manually - and we would usually do a quick check that nothing is obviously broken on one site before upgrading the next.

Time Bombs

I do see a risk in a really serious bug or security hole, deliberately or accidentally introduced in some popular software component and lurking there for a longer time undiscovered before causing significant disruption. While I consider it unlikely, I do think there is some risk that a serious bug could cause as much havoc as the CrowdStrike incident - particularly if it’s intentionally crafted by malicious persons.

We’ve seen supply chain attacks recently in the open source communities - polyfill.io, ua-parser-js, ctx, xz and probably many more. I don’t think this is a problem limited to Open Source software, insider attacks could happen also in companies producing proprietary software. There is also the tendency of state powers having long-term project planting or recruiting “moles” at different places, why not also in software development companies? I believe such supply chain attacks can be difficult to prevent or discover and could cause similar or worse outage or security breaches in the future - but I hope not.

Self-Updating Software Packages

CrowdStrike wasn’t an ordinary software package - it was a self-updating software package. We do have some of them in production - like customers using WordPress and installing various extensions through the web-ui. It’s for sure way easier to click on some buttons in a web-ui than to have to send a manual request to some sysadmin “could you install package X and configure it to work in my environment” - but security-wise I think it’s a bad practice. A running application should ideally not have the permissions to update its own code.

If I understand it correctly, CrowdStrike was also updating itself at high frequency, almost real-time - and that’s how so many customers could get affected simultaneously before anyone got time to pull the plug. There is quite often a dilemma there, urgent security upgrades needs to be pushed really fast, but sometimes the new bugs introduced can be worse than the bugs the upgrade is trying to fix.

By gaining full privileged access to so many customer systems and being able to auto-update their software in real-time, CrowdStrike has efficiently created a gigantic “command and control”-network containing really a lot of critical servers. Now, if things didn’t go bad by accident, there is a high risk that sooner or later something would go bad due to a malicious attack, i.e. from an insider. Such an attack may not be as “innocent” as ensuring downtime, it may involve things like secretly siphoning off sensitive data from the customers. If I understand their business model right, siphoning data from customers systems is already an important aspect - albeit for the legitimate purpose of discovering attacks.

IDS/IPS and Third-Party Solutions

I understand it so that the CrowdStrike was some Intrusion Detection System (IDS) and/or Intrusion Prevention System (IPS). After two minutes of research, I find this:

it’s a software package with comprehensive prevention features, including the ability to stop malicious code execution, block zero-day exploits, kill processes, and contain command and control callbacks. It has a Cloud-based architecture, unlike traditional on-premises IDS/IPS solutions, CrowdStrike Falcon is a 100% cloud-based Security as a Service (SaaS) offering.

Now, giving a 100% cloud-based SaaS-offering the permission to stop code execution and kill processes, that sounds like a recipe for disaster to me! To be really useful (stopping zero-day attacks in the window from the security holes gets known and until the security holes are patched) upgrades to IDS/IPS-system needs to be pushed out to all the computers really fast - but that again causes the risk of breaking really a lot of systems really fast.

Some of our customers are buying and installing security software like this one. I’m usually discouraging it - they are giving away control to third parties without getting much security benefits for it. I think such systems are mostly “politically motivated”, giving some managers without much technical knowledge a “warm and fuzzy feeling” that they have done something to improve the security, or maybe they had a checkbox for “anti-virus software” that needed to be ticked off. So to answer my initial question, yes, some few of our customers could be affected by problems like this one. I read that the Linux version of CrowdStrike is a kernel module - that is the highest privilege you can have in Linux, and it also carries the highest risk of causing the computer to behave erratically or crash. It may seem easy to say this in the hindsight, but I would for sure ask my customers three times “are you really sure about this?” before installing a third-party cloud-based SaaS self-updating kernel-module for the customer.

We do offer some IDS/IPS, and we occasionally scan our systems for known malware signatures - but for one thing, our security would never rely on any software of this sort - it’s only for monitoring purposes and “defence in depth”, in the very best case it could be a faster route for blocking certain zero-day attacks than patching the software. If it’s needed to kill processes then the system is already to be considered compromised, and then the Big Red Button is to be pressed. Luckily we very rarely have such incidents, but dealing with it is always a very manual process - it’s not sufficient to just kill some few rogue processes. If some very bad zero-day appears then we also deal with it manually. While speed is of essence, I still believe it’s important to have humans in the loop taking decisions before ending up with such big mess as we did last Friday. So our IPS would only be allowed to do one thing - block internet traffic. Well, admittedly, if the IPS goes haywire and blocks all the internet traffic, it may be hard both to push updates and to log into the servers to fix it, but I still believe the risk is very much lower than having a third-party kernel module updating itself.

A Note on Security

Some news sources state that this was not a security incident. Probably no malicous actor was involved, and probably no data was leaked, but I’d still classify it as a security problem:

Sometimes systems go down due to malicious actors (i.e. deliberate DDoS-attacks), cases like this may rather be considered a “friendly fire incident” - the cause of the outage does not matter, important solutions should have some security also against broken software upgrades.
A third-party SaaS had access to disable the servers completely, which I consider a security problem.
Giving a third-party SaaS unfettered access to gather data from the server it’s running on may be considered a security breach.
It’s partly based on AI - giving some AI privileged access to the servers is maybe not a brilliant idea.
This time the outage was probably due to an accident, but consider that it could have been a malicious attack by an insiders in CrowdStrike.

Conclusion

Thinking thoroughly through “could a similar problem happen to us”, the answer is that it’s highly unlikely.

The CrowdStrike incident wasn’t a freak accident - this was a disaster waiting to happen. The cause was not that computer systems have become “too complex” as some media reports. I also read some expert claiming that the problem can only be fixed through tighter regulations in the US. Now CrowdStrike may be a US company, but nobody was forced to use it.

I strongly believe that it was a bad idea to give a “100% cloud-based SaaS-offering” full privileged access to important servers. No matter how secure and stable the systems have been made, how well thought-out the redundancy is, by installing CrowdStrike or similar software - and even paying for it - you have efficiently ruined the design completely. Perhaps the management needs to stop considering “anti-virus” as a checkbox that needs to be ticked off, and rather prioritize good in-house security competence.

The CrowdStrike Terms and Conditions even spells it out in capital letters: THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT.

Post-Scriptum

This has nothing to do with Windows. After writing this article, I heard that CrowdStrike had similar issues on some Debian versions in April - but I think the Linux community is more wary of installing such applications, so very few got their hands burned. While SELinux doesn’t do quite the same as CrowdStrike, it’s a battle-tested solution that can stop many server applications from doing things they are not supposed to do. Combine it with frequent but not too frequent auto-patching of bugfixes, a competent sysadmin team keeping track of security vulnerabilities as they get published, and eventually some auto-updating Web Application Firewall, and I think one should be pretty much covered - without running the risk of a major service outage due to rogue updates.

Disclaimers

Typos and other minor corrections done after suggestions from ChatGPT.
Minor edits have been done after publication

Thoughts on the CrowdStrike Outage

July 23, 2024

Broken Software Updates

Time Bombs

Self-Updating Software Packages

IDS/IPS and Third-Party Solutions

A Note on Security

Conclusion

Post-Scriptum

Disclaimers

Tobias Brox

Time tracking systems - software

Time tracking systems - general thoughts

Creating and using a script to Install Arch linux through wifi