How did this happen?
The official word from CrowdStrike on the day of the outage was that it "…was caused by a defect found in a Falcon [cyber security software] content update…". As content updates are routinely made by IT vendors without such significant unintended consequences, many in the IT industry have started to question whether the issue was a lack of testing of the update before the roll out and whether the approach to roll out was flawed, neither of which we yet have the answer to. In the meantime, what we do know already is of real interest for all organisations, as the ability for updates and patches to be pushed centrally to systems around the world is generally considered to be a great benefit of connected systems and cloud hosted solutions; however, in this case, it was instead the cause of the problem.
IT contracts, particularly for cloud services, often do not deal with the releasing of patches and updates in detail, in fact the focus is more likely to be on requiring the supplier to patch and update to keep the solution current (particularly important in the cyber security space to deal with rapidly evolving threats). Cloud hosted or centrally managed solutions allow central control, with customers relying on the vendor to manage this and "push" updates out to systems without any physical interaction. A customer may not even be notified, unless downtime or other system changes are required to carry out the update. This is an agile, efficient and cost-effective method but, as has been shown, is not without risks.
These benefits, coupled with: (i) many customers focusing on the "big", more obvious contractual issues (e.g. liability and termination rights, as evidenced in AG's Technology & Outsourcing Risk Report 2024 – which you can request a copy of here); and (ii) major IT and cloud vendors generally being reluctant to negotiate, has put customers on the back foot. As noted in our 2024 Report, the continued complexity of systems and the inter-operability of these systems, is expected to be a growing challenge, something that the CrowdStrike incident would appear to validate. All, or a combination of, extensive testing prior to deployment, initial runs in a test environment and phased roll out, are measures we would expect to see to protect from any defects in the update.
These protections should be included in the contract and, as importantly, actively managed by the customer. The outsourcing of IT systems, shift to cloud solutions and reliance on third party vendors does not outsource the risk of an issue, something unfortunately which those enterprises impacted by the IT outage are well aware of. Even if contractual protections are included, there may be little to gain in enforcing them after the event. We would expect organisations to assess their position in the wake of the CrowdStrike incident (particularly as details of the cause become clearer).
Practical steps
As businesses around the world look inwards, as well as outwards, in considering how to prevent, or at least mitigate the impact of, such an outage in the future, there are some points to think about:
- Notice: How much notice are you being provided with before a patch/update is released? Does this provide sufficient time to carry out system testing?
- Testing: Do you have a clear release process in place which manages the release of patches/updates, with steps of how you can engage in the process and request support from vendors? A phased release approach could have potentially avoided some of the issues experienced on Friday, as part of the CrowdStrike update.
- Obligation to test: Do your cloud subscriptions include a clear obligation on the vendors to manage release of patches/updates in such a way that they will ensure it does not impact the service? Do you understand the release processes utilised by your vendors to support your key systems?
- Support: What level of commitment have you got from your vendors if there is a critical outage, particularly one that is impacting multiple customers? Is there enhanced support following any update or patch?
- Business Continuity and Disaster Recovery: Do your plans cover situations where there is a need to manage a failed patch/update? We would commonly see customers look to roll-back to the previous version, but is there a contingency if physical intervention is required (as appears to be the case in this instance), for example for device or system re-boots.
- Current Version: If the software being updated is not a cloud service, then there is a likelihood that you as a customer would not be obligated to be on the most recent version, and instead be permitted to stay two, or more, versions behind the current version. This providing time to test, and plan, the release of any updates or new versions of the software. For minor releases and patches, there may only be an obligation to take these where they are seen as critical patches, i.e. required to fix vulnerabilities, rather than increase functionality.