Why we fail to fail better
Blameless postmortems and prerequisites for effective mistake response
"But the slothfulness that wasted and the arrogance that slew,
Shall we leave it unabated in its place?"
-- Rudyard Kipling, "Mesopotamia"
A dear friend of mine, Paul Cantrell, recently mused on Twitter about the institutional failures of our generation and their possible roots. Paul posits that we didn’t do enough to hold people accountable for the big failures of the GW Bush years, like the Iraq war and the botched response to Hurricane Katrina, and maybe that led to a wider belief among institutional leaders that they could screw up with impunity; and he asks what more positive examples we might learn from to make failures less common.
I have one positive example I’ve seen a lot and learned a lot from: the practice of blameless postmortems that Google made famous. I think by examining that practice and what makes it work, we can learn some lessons about recent institutional failures. My main claims are:
The conditions needed for blameless postmortems to succeed are also necessary for other forms of effective mistake response.
Thus, general decline in the availability of those conditions may explain recent poor institutional performance.
What blameless postmortems mean to me
Google products run on extremely complicated infrastructure that can fail in all sorts of ways. Accidental misconfigurations, cascading server failures, unexpected traffic spikes, design flaws that manifest at just the wrong time, even the occasional nation-state attack: all can disrupt service and cost customers and Google huge amounts of time and money. Yet these failures remain remarkably rare and typically get rarer over time as products mature, though human folly is not much1 less common at Google than anywhere else. Why is that?
Part of the reason is what happens when an outage does occur. Once the immediate danger is resolved, the people involved in the outage are supposed to write a postmortem document that lays out clearly:
How the outage manifested, e.g. what services failed for what time period
What the overall impact of the outage was
What the root cause(s) was/were, if known
The timeline of actions and events related to the outage
What went well, what didn’t go well, and what luck played a role
Actionable steps to repair the proximate causes of the outage and to prevent or mitigate future such outages
The postmortem doc then gets a round of review and comment, sometimes asynchronously and sometimes as part of a team meeting, and the relevant engineering leaders are supposed to make sure that the action items get done.
If someone’s mistake was a proximate cause of the outage, this is noted in the postmortem but is almost never cause to blame or discipline that person. The focus is on what features of the system caused that mistake to result in an outage, and what actionable systemic repairs will prevent future such mistakes from doing the same— because if someone made that mistake once, future people will likely make it again, and blaming and punishing them won’t fix that. And crucially, not blaming or punishing the mistake-maker gives them the proper incentive to do the right thing: namely, to own up to the mistake and participate constructively in the postmortem. Their insight about why they did what they did, and what would have helped them to do otherwise, is often the key to formulating the right action items.
Application: the healthcare.gov rescue
This postmortem process can and does fail in all sorts of ways: the root causes can be murky or misdiagnosed; the action items can be too vague to implement, or can get dropped for lack of attention or interest. But it works often enough to make a huge difference to organizational effectiveness. And it’s been shown to work at least once outside Google, in one of the great near-disaster turnaround stories of the past decade.
It’s well known by now that the initial botching of the healthcare.gov rollout was a significant threat to the success of Obamacare. It’s also well known that a team of experienced techies led by Mikey Dickerson, then of Google, took leaves from their jobs to parachute in, work absurdly long hours for months, and save the day. But the story of what worked and why in the rescue should be better known.
I had the privilege to know many of the people involved in the rescue and to hear Dickerson talk about it at some length soon afterward. One factor they emphasized was that they had to stop the blame cycle to even figure out why the site was failing so badly. The contractors who had been hired to roll it out were so afraid of losing their jobs if they were blamed for the failure that they wouldn’t talk to each other, or initially to the rescue team, about it honestly. Only once Dickerson made clear to everybody that they were going to use the blameless postmortem process, and that nobody who honestly admitted mistakes and helped fix them would get fired, could they find out enough about what was going on to start fixing it.
Key prerequisites
I want to emphasize that my claim is not “every institution should use blameless postmortems for all failures, then everything will be as reliable as Google services.” Just the opposite: I think successful blameless postmortems require a bunch of underlying conditions which we could usually take for granted at Google, but which usually don’t hold elsewhere. And examining those can help us understand why modern institutions so often fail.
First, though, it’s also important to emphasize a strength of the postmortem process: it is resilient to variations in people quality; that is, it works reasonably well even when the people executing the process have a range of ordinary human flaws and limitations. It sounds almost tautological to say that this is a requirement for any good human coordination process, but there are plenty of human coordination processes that are not like this, including ones that seem like they should work well. It’s a central article of faith in the American civic religion that the Founders designed the Constitution to have this resilience, and this is why our system of government is so great and the Founders were geniuses. That claim is far from entirely true, but there is a great truth in its animating premise: designing processes to be resilient to people quality is hard and important!
Supermajority shared values
So what isn’t a blameless postmortem process resilient to?
Well, it assumes the people involved agree that
outages are bad
making systemic improvements to reduce future outages is important
efficiently and effectively reducing future outages is more important than punishing people who seem blameworthy
An occasional fringe dissenter from these premises won’t break the process. But a large and passionate bloc of dissenters definitely will, even if they are not the majority. Partly this is because there is no voting rule in postmortem review; instead, the authors must forge a consensus general enough that a vote isn’t needed. But adding a voting rule wouldn’t fix this failure mode, because dissenters can sabotage effective followup even if they lose the vote.
The lack of an analogous values consensus is common in political institutions especially, and has gotten more common in my adult lifetime as we’ve gotten more polarized. One of Paul’s saddest examples of recent political failures illustrates this point: we do not have a supermajority consensus that either torturing suspected terrorists, or forcibly separating unauthorized immigrant children from their parents, is actually that bad or that important to fix. Without that, we’re neither going to get individual accountability for the perpetrators nor systemic legal change to prevent such abuses, regardless of institutional mechanisms.
Supermajority shared facts
Just as effective blameless postmortems require consensus values, they also require consensus facts. Postmortem review won’t work well if a large enough, loud enough bloc— again, it doesn’t need to be a majority— insists that the outage which you thought was caused by a cascading server failure was actually due to cosmic ray beams fired at the servers by lizard people. Or that, rather than implementing better load balancing protocols, we should douse the servers with hydroxychloroquine.
The lack of consensus facts in our time, and the urgency of fixing that, is a theme thoroughly explored by myriad pundits, myself included. In retrospect this may be the way in which Google’s culture, as I was lucky enough to experience it, diverged most from the general culture of the 21st century West. Googlers often disagreed with experts about what should be done, but rarely distrusted relevant expert statements about the factual state of things. One reason for this was that Google experts did not generally expect that their pronouncements would be accepted without question. It was culturally OK, even when it got annoying, for a random junior engineer to argue with Jeff Dean or Rob Pike or even Vint Cerf and expect their arguments to be taken seriously and answered thoughtfully. There were several exceptions to this which proved the rule: times when experts— or even worse, executives who thought they were experts— did expect to just be taken at their word were among the worst, most divisive and damaging episodes of my time at Google. Which brings me to…
Mutual respect
For an effective postmortem, you need to be able to safely assume that:
The authors of the postmortem will lay out the evidence for their beliefs (about root causes and effective next actions) forthrightly, and also take proper account of any evidence that goes against those beliefs.
If they don’t, reviewers and commenters are entitled to point out where they haven’t done so, and expect the authors to respond constructively.
Everyone involved will keep an open mind, assume good faith, and not stoop to insults, point-scoring, whataboutery, etc.
Of course, even well-intended people will fall short of these norms sometimes, and that’s OK. But once again, if even a large passionate minority rejects these norms altogether, the whole thing falls apart.
In its efforts to maintain these norms, Google had the great advantage of being able to discipline and ultimately exclude those who repeatedly violated them. They didn’t always use that ability optimally— there were both false positives and false negatives— and things got worse as, well, things in the larger society got worse; but just knowing the option was there improved the standard of discourse greatly. Broader public institutions have it much harder: they cannot dissolve the people and elect another.
Punishing coverups and repeat offenders
While making a mistake that led to an outage would typically not get you blamed or punished at Google, other types of outage-related misconduct would. The most common types were:
Trying to hide or deflect attention from the role of your mistake.
Willfully learning nothing from your mistake and going on to do the exact same thing again.
Herein lies the wisdom of “it’s not the crime, it’s the coverup”. Blamelessness for people who own their mistakes and contribute to the learning process is the carrot, punishment for those who don’t is the stick; you need both to get the incentive structure right. Too often, in the larger world, you don’t have either.
Going beyond emergencies
A final ingredient in the unusual effectiveness of Google postmortems was their routine use even when the outage was not disastrous, or even when it was just a near-miss and not a “real” outage at all. Obviously there were unusually high-effort postmortems after the biggest outages, and execs would pay particular attention to those and declare special “code yellow/code red” projects to address the action items. And sometimes these would find that the problem had been foretold by earlier non-emergency postmortems whose action items were neglected. Still, much of the day-to-day reliability improvement that made Google products so solid was accomplished quietly by ordinary engineering teams responding to non-disaster postmortems.
I end with this because when I try to think of recent institutional successes worth learning from, the examples are so skewed toward emergency response, and prevention and resilience so consistently get short shrift. Operation Warp Speed brought us lifesaving COVID vaccines in record time— but broader pandemic preparedness measures still can’t get even modest funding increases. Modern wildfire fighting does an amazing job protecting lives, homes, and towns from even extraordinary megafires— but the forest management tactics that could prevent megafires get bogged down in turf wars and regulatory sclerosis. Problems we know will recur in our lifetimes, and that we know can be mitigated by modest upfront effort, don’t get the activation energy to make that effort happen.
Has the cultural myopia behind this gotten worse in this century, or have its costs just risen disproportionately? It’s very hard to tell. There are plenty of plausible theories out there about short-term-centric incentives, loss of faith in a better future, etc. Surely part of the progress studies program ought to be devoted to figuring out how to fix this. In the meantime, if anybody knows of any more countercultural organizations like Guarding Against Pandemics, please link to them in the comments!
People often assume Google is so reliable because it selects for “the best people,” and if only we selected for “the best people” in other institutions they would perform reliably too. I do think Googlers are typically smarter and better-intentioned than average and that this explains some of Google’s high performance. Selection effects do matter, and e.g. electing better people to important public offices would make a significant difference! But the Fundamental Attribution Error leads us to overestimate both how big the people-quality difference is and how much it matters. And it leads us to undervalue systems and processes that get better results from ordinary, average-quality humans.