The CrowdStrike Outage Proves Why Security and Risk Management are So Essential

Gabe Dimeglio talks about the CrowdStrike outage and what it means for security.

Many businesses rely on software made by other companies to function. But as the CrowdStrike outage this summer showed, if something happens to those critical software programs, your business could be in huge trouble. And if that issue requires hands-on solutions, it can cause even more delays. It’s essential to make sure your business is set up to recover and repair after an incident like this.

See The Update that Broke America with Gabe Dimeglio for a complete transcript of the Easy Prey podcast episode.

Gabe Dimeglio has been working in IT and security for twenty years. Currently he is the GVP and GM of Rimini Protect, a project at Rimini Street. Over the years, at Rimini Street, he has built out a global practice of cybersecurity and compliance solutions. His team at Rimini Protect do full implementations and manage service of security solutions, primarily for enterprise software.

The CrowdStrike Outage: What Happened?

You’ve probably heard about the CrowdStrike outage – whether it was because you came into work and found your Windows computer inoperable or because people on the internet were suddenly talking about how computers weren’t working and planes couldn’t fly. Gabe lives in North Carolina, but he was in California when it happened, and he was concerned about being able to fly home the next day.

CrowdStrike is a security software that a huge number of businesses use. The outage came down to poor coding on an update. Windows devices have two modes – user mode and kernel mode. User mode is what you see. If an update goes wrong in user mode, you get an error and a suggestion to try the update again. But kernel mode is much more embedded and has a potential for much greater issues and data loss. If something goes wrong in the kernel, the whole system crashes to protect from data loss and corruption. That’s the cause of the “blue screen of death.”

Some programs need to use kernel mode, and most programs that do go through extensive testing before releasing updates to make sure it won’t cause issues. CrowdStrike also goes directly into the kernel. That is great for a security company because it gives them good visibility. But they release updates frequently, much too fast to test them rigorously first. Their idea was that faster updates led to better protection. But that meant when faulty code got sent out as an update, it went directly into the kernel – and every device with CrowdStrike on it crashed.

Why the CrowdStrike Outage was Such a Big Problem

Theoretically, a bunch of computers crashing at the same time would cause some issues, but not on the scale that the CrowdStrike outage caused. But there were a couple factors that made this outage a huge deal. One was that a lot of businesses and industries use CrowdStrike. The airline industry is one that people talked a lot about, because the sudden challenge to get flights working affected a lot of people. Healthcare and banking industries also had major problems.

The other reason that it was such a big problem was the way to fix it. The fix was relatively easy. You had to boot the device into Safe Mode, remove all the files associated with the faulty update, and then reboot the device. But even though it was easy, it required someone to boot up the device and remove those files. Approximately 8.5 million devices were affected, and someone had to manually fix every single one. It took a lot of time and effort and manpower. For some companies running Windows servers, they had to fly people out to fix thousands of servers in remote server buildings.

It also caused a lot of reputational damage for CrowdStrike. They were a revolutionary company when they started and captured a huge amount of market share. Some of that is because they’re still one of the very few options for certain things. But now, even when CrowdStrike would be the best solution for a customer, he gets pushback. It’s only one mistake out of many thousands of updates, which is a really good track record. But the outage was so big and so publicized that people don’t want to work with them anyway.

Across the board, there’s a lot of reputational damage [to CrowdStrike].
Gabe Dimeglio

It Could Happen Again

The CrowdStrike outage highlighted vulnerabilities that can affect other industries, too, especially those reliant on technology for critical infrastructure. Transportation, energy, and manufacturing all face similar challenges when it comes to software updates. Outages could cause a lot of problems or be challenging to fix. If a system managing oil pipelines goes down, for example, someone may have to fly to a remote area by helicopter to fix it. And if a computer controlling manufacturing machinery stops working, the machinery could potentially damage itself or injure workers.

Anything critical should ideally be segmented into its own environment to minimize the risk. The approach is Operational Technologies (OT). Any technology that keeps your business operational gets to be in its own separate network environment. Anything new coming in should go through its own network environment and get tested before it can access your OT. If it’s essential to your business or could be a huge problem if it’s out of control, you need to protect it.

The CrowdStrike outage illustrates some of the risks of using subscription-based services.

Gabe is also concerned about the increasing number of cloud-managed solutions in critical infrastructure. There are a lot of risks associated with these cloud services, especially given the frequent reports of cloud breaches. And if the cloud provider hasn’t properly configured things, putting critical infrastructure in the cloud is putting it at huge risk of breach or even complete loss if something happens to the cloud provider.

Many companies may have already been compromised and just not know it yet. It’s a harsh reality, but it’s the reality we’re living in. To effectively protect critical assets, organizations should start by assuming that breaches will happen and may have already happened. Baseline your critical assets from a production perspective.

Every single company out there has been compromised, and for those that haven’t, they just don’t know they’ve been compromised.
Gabe Dimeglio

Protect Your Critical Infrastructure

Simply leaving your critical infrastructure systems vulnerable is not an option. The key is rigorous testing whenever changes are introduced. Many people might dismiss older software as “legacy,” but the truth is that these systems are generally very reliable.

If you think about it, software programs are just a series of ones and zeroes. They don’t wear out, and they’ll continue functioning as expected. So what causes problems? Changes. Any time a change is made, whether it’s updating software or altering data, there’s risk. Problems arise when these changes affect the system’s state in ways that haven’t been tested, leading to unexpected issues.

Any time you introduce change, that’s when you introduce risk.
Gabe Dimeglio

The root causes of many issues, as it was with the CrowdStrike outage, are updates that alter how a platform operates. CrowdStrike is an interesting case because auto-updates are part of the feature. You can turn it off, but that requires manipulating the uninstall/maintenance service and has other ramifications. So most people have it on auto-update. It’s quick, effective, and comprehensive, but can also cause problems.

Advice for Smaller Businesses

All of this advice is great for big companies that have IT and security teams and budgets to spend on technology and security. But many small and mid-sized businesses operate with just one person handling multiple roles. What roles those actually are can vary based on the business. But no matter what you do, it’s important to start by understanding what in your business is really valuable.

Understand what it is that you have that’s of value to anyone, yourself included.
Gabe Dimeglio

If you’re a one-man shop making widgets and collecting payments through Venmo, you may not feel like you have a lot that’s worthwhile. But if you’re using a CAD system to design a unique product that you’re selling, that changes the situation. If someone steals those designs and starts making their own, not only does it jeopardize your business, it means you won’t be able to file a patent on those designs.

It’s crucial to identify what is important to you and to others, and then take steps to protect those assets. As businesses grow and evolve, many find themselves relying on subscription-based services from vendors. Not everyone has the funds to invest in robust server rooms. But when everything is subscription-based, you’re dependent on these vendors. What happens if one of them has a major issue and stops functioning the way you need? The CrowdStrike outage serves as a reminder of how important it is for even small businesses to take cybersecurity seriously and plan ahead to manage these kinds of risks.

Invest in Security and Control Changes

The good news is cybersecurity is a top priority for every company Gabe has worked with in recent years. Security is important to executives and board members and it’s easier to get security budgets. The key to getting executive buy-in is making the risks real. Just warning them that hacks are possible isn’t enough. If you can put numbers behind it – this is the likelihood of an incident happening, this is the business impact of an incident, and this is the likelihood of an incident if we invest this amount – they are much more likely to fund specific security measures. After all, the only thing more expensive than security is shutting down the business because of poor security.

There is nothing more expensive than shutting down the business.
Gabe Dimeglio

For the actual security, the important part is change control. Companies need to change and adapt. If you’re not evolving and transforming, you’re getting beat by the competition. But the CrowdStrike outage happened because an untested change caused problems. Never give into the temptation to skip fully understanding how technologies work and their implications. Every time you want to make a change, assess it thoroughly first. And make sure you have a backup plan. If everything goes wrong and the change breaks things, what’s your rollback plan? Having all of this worked out in advance will limit the damage an unexpected change can do.

You can connect with Gabe Dimeglio and Rimini Street on LinkedIn.