We need to talk about production readiness

Every software team has some version of a production readiness checklist—some assortment of things they expect to happen before new software is released into the world. But have we lost the plot on what these lists were designed to achieve? To prevent? Deployments are now continuous, but our checklists are still frozen in time (and tools). It's time to rethink what we're doing here.

Ganesh Datta

May 2, 2024

On December 31, 2008, all the Microsoft Zunes around the world stopped working. The development team hadn’t properly accounted for the Leap Year, and when the year changed over, everything broke.

On February 29, 2024, card payments in a Swedish grocery chain went down, payment terminals in New Zealand gas stations crashed, and an EA Sports racing game was rendered unplayable for the day.

Over nearly two decades, the same basic and relatively predictable problem caused disasters across software companies and services, both big and small. These companies deemed fundamentally faulty software worthy of consumption, either because of approval processes that couldn’t catch these flaws or because they weren’t aware of the consequences of some of their code changes.

Examples like these demonstrate a pattern: All of this software worked the day it was released but failed sometime later, even though that day and that failure was predictable and avoidable.

These flaws, however, were only predictable and avoidable in theory. Until recently, it was impossible to look across dozens of tools, reconcile all the data types, and actually make signal from the noise. Some amount of flaws were bound to hide beneath the waves and surface in production. With the rise of internal developer portals (IDPs), however, the ability to build a single system of record requires us to rethink production readiness.

Too often, companies think about production readiness as the state of the product when it’s launched rather than as an ongoing alignment to evolving standards. The day the product launches sweeps aside the many days after, and the everyday suitability that determines whether a service should really remain in production fades into the background.

It’s long past time to take a cold, hard look at the state of production readiness, rethink readiness from first principles, and work toward a new definition of continuous readiness.

The state of production readiness

In The 2024 State of Software Production Readiness, we surveyed 50 engineering leaders at companies with more than 500 employees to better understand how modern companies think about production readiness and what limits they face.

Many of the companies we surveyed used a production readiness review checklist (or PRR) to establish a set of criteria developers can check to determine whether a service is ready for production. The goal, generally, is to ensure the organization can operate the service once there’s real customer traffic going through it.

Checklists are blunt tools, but their form is often even blunter. Organizations often use a combination of Confluence pages, which outline the operating procedures, and spreadsheets, which document a list of services and criteria. The documents are static, and the process of checking and updating them is manual.

Most of our respondents said their PRRs focused on things like adequate testing coverage, necessary security coverage, connection to CI/CD tools, and planned-out rollback protocols. Beyond these basics, standards varied widely across companies and within companies. Our survey showed zero duplication, meaning not even two leaders selected the same set of standards.

Nearly one-third of engineering leaders don’t have a formal process for ensuring services continue to meet production standards after their initial launches. Of the companies that do, the most popular rate of review was once a quarter.

In these companies and many others, incident response has become the de facto way to warn teams that a service is no longer meeting their standards. Users then become the first line of defense – unpaid and unhappy QA testers who expect readiness, given that the service is in production, but don’t get it.

That statement isn’t an exaggeration: According to our survey, overall program confidence was around 6/10, but 98% of participants reported witnessing at least one significant consequence as a result of failing to meet production readiness standards. Consequences included a downstream loss in revenue, an increase in change failure rate, an increase in mean time to resolve or mediate, and a decrease in developer productivity.

When there’s a gap between confidence and results that’s this wide, it’s worth looking at the frameworks companies are operating with. Something is broken.

A return to first principles

The concept of a product being “ready” has a long history that likely stretches back to the first time anyone made anything. (Surely, a caveman once tied a rock to a stick and thought, “No, this needs another pass before I take it on the hunt.”). The specific concept of “production readiness,” however, emerged from Google’s work on reliability engineering.

SRE for the rest of us

Ben Treynor Sloss, now vice president of engineering at Google, joined in 2003 and built the first site reliability team. By 2016, Google employed over 1,000 site reliability engineers. Companies like Airbnb, Dropbox, and Netflix also adopted the role.

Central to the SRE role was a tool and process called production readiness reviews (PRRs). Through PRRs, SREs verify that a service meets production standards and can demonstrate operational readiness. The PRR itself is essentially a checklist that verifies things like instrumentation, metrics, monitoring, emergency response, and more (often much more).

Though the SRE role and the PRR process have become industry norms, Google is an uncommonly well-resourced company, and many (if not most) companies struggle to replicate its practices.

Google itself, for example, wrote in its book on SREs that “Onboarding each service required two or three SREs and typically lasted two or three quarters.” As a result, “The lead times for a PRR were relatively high (quarters away).” Often, the amount of SREs available constrained the work that needed to be done (and this is Google we’re talking about!).

Google, as well-resourced as it is, found hiring SREs to be expensive, too. “Hiring experienced, qualified SREs is difficult and costly,” they write. “Despite enormous effort from the recruiting organization, there are never enough SREs to support all the services that need their expertise.” Even after hiring, the costs didn’t stop: “Once SREs are hired, their training is also a lengthier process than is typical for development engineers.”

An SRE-based system, and PRRs in particular, just isn’t practical for most companies. Despite this, the concepts from this work have leaked beyond their purview, and even companies without SREs suffer from them.

The right tension and the wrong solution

Google correctly identified a tension between product and operations, but its solution is wanting.

The tension is this: The operations team has a checklist they want the development team to follow, but the development team just wants to keep building and shipping.

As Sloss explains in an interview, “When you strip away everything else, the incentive of the development team is to get features launched and to get users to adopt the product. That's it! And if you strip away everything else, the incentive of a team with operational duties is to ensure that the thing doesn't blow up on their watch.”

The tension exists at companies of all sizes, but artisanal SRE staffing is only possible at Google’s scale. We need an “SRE for the rest of us” model — one that returns to the first principles of continuously developed software without taking on the baggage of “readiness.”

The truth is that our concept of “production” has changed over time, but our concept of production readiness has not.

Prior to the rise of Agile and DevOps, companies developed software via the waterfall method, delivered it “over the wall” for IT Operations staff to implement, and then sent it – in hard disks! – to customers to deploy and maintain.

Software inherited an idea of readiness that was naturally dominant in every other industry: The car, shoe, hammer, and whatever else – all had to be ready before delivery to the customer. But with the rise of the cloud, SaaS, and continuous deployment, the basic idea of “production readiness” has outlived its usefulness.

As an industry, we preach that companies should ship “minimally viable” products, that companies should ship patches fast and roll back as necessary, and that companies should start imperfectly and iterate toward perfection.

These norms dispense with the black-and-white notion of “readiness,” but once we’re trying to maintain systems, this way of thinking comes back in. When we return to the first principles that underlie modern software, however, readiness feels like the wrong framework entirely.

Why the checklist approach to production-readiness is fundamentally flawed

We believe the current approach to production readiness is fundamentally flawed. We don’t make that criticism lightly.

When you find a flaw in a methodology and decide it needs improvement, you need to figure out upfront whether a particular version of the methodology is flawed or whether the methodology itself is flawed on such a deep level that it requires refactoring or replacement.

Over the years, we’ve seen numerous ways to make incremental improvements to production readiness. Better tools can help, as well as better practices around using the checklists and training who uses them and how. But the more we hear from our customers, the more we think the methodology itself is flawed and that the issues and pain points people tend to feel are downstream from some fundamental flaws.

1. Manual, laborious workflow

Teams performing production readiness reviews tend to have static checklists that they check manually. 56% of the respondents to our survey said that manual follow-up was a huge blocker to determining production readiness.

With manual work, too, mistakes are likely – by nature. As Thomas A. Limoncelli, an SRE TPM at StackOverflow, once wrote, “Manual work is a bug.”

2. Non-standardized standards

Many teams don’t or can’t standardize their checklists. Our survey, for instance, showed that 66% of respondents cited a lack of consistency in how standards are defined across teams within their own company as their number one issue for ensuring production readiness.

Sometimes, this is because the PRR isn’t a priority, but just as often, it’s a result of deadlines crashing into a format that is inherently resistant to standardization. As a result, a service could meet production readiness requirements in one team and not another.

Non-standardization creates confusion over time because the organization hasn’t really determined, much less reinforced, what it means to be production-ready.

3. Arbitrary priorities

The checklist model doesn’t allow teams the native ability to differentiate between different items on the list. Broadly, only some are required for production, and many others will be beneficial but not strictly necessary.

If you have fifty items on the checklist, for example, even completing 80% of the items could mean you’re “production ready” but not really production-ready. As time goes on, the scope of items to check will likely increase, too, making the checklist more and more confusing and less and less useful.

4. Resistant to iteration

As you iterate on service scalability, reliability, and availability, the checklist format makes it difficult to apply the lessons you learned. Your team might get more knowledgeable over time, but your checklist will always lag behind (especially given the manual workflow flaw above).

This resistance to iteration worsens when companies don’t clearly identify who owns what. In our survey, 36% said unclear ownership made it difficult to determine production readiness.

The more items you add to the checklist, the more difficult it is to know which services no longer pass your standards. It then gets even more difficult to actually apply and distribute your growing knowledge across your services, improve them, and bring them in line with your newer services.

5. Prone to inaccuracy

When you only have a checklist, you can never really know whether all your services in production are actually production-ready. The services themselves are changing, but so are your checklists and your broader sense of readiness. At worst, you could end up proceeding with false confidence, causing errors as you introduce new services and strain old dependencies.

6. Inflexible for growth or scale

Checklists, especially given some of the flaws already noted, are inherently inflexible – a flaw that affects both growing startups and scaling enterprises.

As startups grow, build, and rapidly iterate, production readiness checklists can rapidly become out of date, and old services can fail to meet new standards. Startups can improve in one direction while letting the rest of their services fall behind (often without realizing it).

As enterprises manage webs of services and repositories, production readiness checklists can fail to accommodate the number of services — all changing — that need to be regularly checked. Tech debt can pile up in many places, and much of it will go unseen until errors in production force them to become visible.

7. Binary, black-and-white model

The most fundamental flaw is that, conceptually, production readiness is defined in binary terms: ready or not ready, on or off, in production or out.

Checklists are done once, even though the criteria for production readiness can change over time and the standards that determine whether a service passes can change, too. Services that are currently in production might not change even as the standards around them do, meaning services already in production can end up not being production-ready anymore.

But how can you assess these issues, especially from a global level, when readiness is determined once and rarely revisited?

Toward continuous readiness

All of the flaws above are downstream of the last flaw, the most fundamental one: If production readiness is treated as a black-and-white box to be checked or not checked, we’ll always be lagging behind our services and chasing our most ambitious quality and scalability goals.

Here, I want to propose a new framework: Continuous readiness. This framework, like continuous deployment and integration, takes the continuous nature of software as its first principle. If we start by assuming continuity instead of ready or not ready, the whole concept of “readiness” shifts.

A continuous readiness system can check a new or old service against previous criteria and new criteria. When a team spins up a new service, they check it against standards they know are up to date, and as the company makes backend changes, the teams that own older services know when those services fall out of readiness (and why).

A continuous readiness system provides a leveling system that explicitly prioritizes different criteria. At a glance, developers know which criteria are “basic” and essential and which are nice to have. Developers shift from feeling punished or monitored by a clipboard-wielding ops person to feeling confident about meeting minimum standards and motivated to make their service meet even higher goals.

A continuous readiness system encourages developers to build and deploy services that contribute to a broader sense of ecosystem healthiness. By ensuring readiness, each service becomes a better “citizen” in a web of other services. And by allowing developers to see readiness from both a high-level and a granular level, companies can encourage developers to maintain a holistic, systems-first perspective.

A continuous readiness system offers engineering managers a high-level way to monitor pain points and hot spots. If there are particular criteria numerous teams aren’t following, they can investigate why and potentially make it easier for those teams to meet that standard. Engineering managers can then propose systems-level solutions to systems-level problems, making developers more efficient across the board.

A continuous readiness system provides continuous risk mitigation. PRRs are already useful, so by automating, scaling, and improving the underlying idea, development teams can better reinforce best practices and reduce the risk of security and availability issues. As new risks emerge and services age, the monitoring remains effective.

Software is never ready and we need a framework to reflect that.

This is why I founded Cortex

Chasing people down, spreadsheet in hand, to harangue them on this standard or that was a real pain point in my career. I could tell that it was painful for the developers, too. Eventually, I realized it wasn’t either party’s problem or a problem of incentives. The problem was the checklist between us.

One of the original reasons we started Cortex was the idea of producing a true continuous production readiness system. Now, companies like Adobe, Unity, and Docker use scorecards, which are animated with this philosophy, to understand and operate their services and set standards for service quality that they can reinforce and iterate on a continuous basis.

With the Cortex IDP, teams can create scorecards with a wide range of rule types, allowing teams to set binary standards to check CI/CD connections and target standards to set code coverage standards. They can define deadlines, integrate push alerts, and set up exemption lists. Production readiness as continuous readiness is a founding design principle, and it shows throughout.

If you want to learn more about production readiness as it stands today, check out The 2024 State of Software Production Readiness. And if you’ve ever been chased down or have done the chasing, book a demo and see how this new way of thinking plays out in action.

Ganesh Datta

Content

Topic

Handling incidents with PagerDuty & Cortex

Integrating PagerDuty into Cortex can help your team enforce adoption of on-call best practices for each service and team as well as track dozens of key metrics like MTTR through Scorecards.

We need to talk about production readiness

Best practices for performing code reviews

A code review is a systematic process of evaluating written code to improve its quality by discovering and correcting bugs. Learn more about coding review best practices here.

See All Articles