Business
SRE

The hidden costs of outages and quick first steps to preventing them

We all know that downtime is a problem, but how do you make that tangible for leadership? Here’s some data that quantifies the cost of outages, and tips for getting started early on preventing this worst-case scenario.

By
Cortex
-
November 3, 2021

Over Halloween weekend, Roblox had a massive outage and was down for three days, causing its competitors to see a 12-13% increase in usage. Not only did Roblox lose money, but they also have to pay back content creators on their site who lost revenue. Earlier in October, another outage made the news when Facebook, Instagram, and WhatsApp were down for more than five hours, costing the company an estimated $60-$100 million

Even with astonishing losses like these, Roblox and Facebook can weather the storm. But when your company is small and just building a reputation, a major outage could mean the end of your business. If your team isn’t already talking about how to invest in reliability to prevent outages, it’s time to get the conversation going and develop your influence as an SRE.

Why outages are so expensive

There are many ways that outages can cost your business, and it’s important to be upfront about these risks when you’re communicating with leadership. Here are some points you can use when talking to management about why downtime matters:

Lost revenue from customers

Of course, every minute your site is down is a minute that your business is losing money. But how much? According to Gartner, that number averages to $5,600 per minute, with a wide variance. Businesses like financial institutions and eCommerce marketplaces can expect to be hit even harder, and an outage during peak traffic hours is much more costly. 

Downtime = wasted developer hours

When your site is down, that means your engineering team is taken away from the work of building your product for hours or days. Stripe surveyed developers around the world and found that they spend on average 4 hours a week dealing with “bad code” / errors. That’s an expensive 9.25% productivity loss that could be spent building new features.  

It lowers morale

No one likes being paged in the middle of the night, or that feeling that you’ve let your users and customers down. It’s one thing if this happens infrequently. But if you don’t get a handle on technical debt early, you can end up in a situation where developers are basically afraid to make a change to the system — which is a terrible feeling to have as an engineer. 

Your competitors win

An outage negatively impacts your reputation and increases the odds that your hard-earned customers will go elsewhere. Depending on your business, if your site is down, there’s the potential opportunity cost of not being able to make a sale — or even having an investor hit a 403 error. If a security breach is involved, a survey by Security.org found that nearly one in four customers will leave your site permanently. All of this means that an outage is a gift to your competitors. 

How to prevent outages—and make them less severe when they happen 

So, given that it’s important to stop outages from happening or at least reduce the amount of time that your site is down, what can you do? Here are some concrete tips. The good news is, it may actually be easier to implement these ideas while you’re a startup, since your system is less complex and your engineering culture isn’t set in stone. 

Invest in reliability and monitoring from Day 1

When you start building a new service, make sure that telemetry doesn’t end up as an afterthought. It should be part of your development process to make sure that you have alerting set up and you’re properly tracking your four golden signals and your SLOs, SLIs, and SLAs.  

Use tooling to encode best practices and reduce your MTTR 

When you enforce standardization and consistency across your engineering org, it’s less likely that something will go wrong—and when it does, everyone will know the process to follow to diagnose and fix issues. Tooling like runbooks and automated rollbacks will reduce the manual overhead for your team and therefore reduce your MTTR (mean time to resolve issues)

Drive accountability and ownership

When there’s an outage, you don’t want to end up with the bystander effect where everyone on the team thinks someone else will step up—and so the problem sits around for hours, not getting fixed. The answer to this is to have a clear owner for each part of your system and to document this ownership transparently. 

While hopefully outages are rare events for your business, getting your plan in place can lead to improvements that you’ll see across your day-to-day work as well, like having better documentation or defining standard templates for service creation. We’re here to help at Cortex—if you have any questions, just reach out to team@getcortexapp.com.   

Business
SRE
By
Cortex
What's driving urgency for IDPs?