Blameless PostMortems
Blameless Post-mortems- How do we handle errors and mistakes at Encoding Enhancers(EE)?
Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you labor over, the code you write and review, or the alerts and metrics you meticulously pore through.
So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?
Maybe they should be fired. Or maybe they need to be prevented from touching the dangerous bits again. Or maybe they need more training.
This is the conventional "human error" view, which focuses on the features of the people involved. That's what the "Bad Apple Theory" is called by Sidney Dekker-get rid of bad apples, and you'll get rid of human error. Simple, it seems, right?
We're not taking the conventional view of EE .We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.
Blameless Post-MortemWhat does it mean to have a ‘blameless’ Post-Mortem? Does it mean everyone gets off the hook for making mistakes? No. Well, maybe. It depends on what “gets off the hook” means. Let me explain.
Having a Just Culture implies that you strive to balance security and accountability. This means that by analyzing failures in a way that reflects on the situational aspects of the mechanism of a failure and the decision-making process of people close to the failure, a company will come out safer than it would usually be if it had merely disciplined the individuals involved as a remediation.
Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:
- what actions they took at what time,
- what effects they observed,
- expectations they had,
- assumptions they had made,
- and their understanding of timeline of events as they occurred.
…and that they can give this detailed account without fear of punishment or retribution.
Why shouldn’t they be punished or reprimanded? Because an engineer who feels they're going to be reprimanded is disincentivized to provide the required information to get an understanding of the failure's process, pathology, and function. This lack of knowledge of how the accident happened all but ensures that it could be replicated. If not with the original engineer, then, in the future, another one.We believe that this detail is paramount to improving safety at EE. Fear of punishment will also motivate people to act correctly in the future.
This cycle of name/blame/shame can be looked at like this:
- Engineer takes action and contributes to a failure or incident.
- Engineer is punished, shamed, blamed, or retrained.
- Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
- Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
- Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
- Errors more likely, latent conditions can’t be identified due to #5, above
- Repeat from step 1
This loop is one we need to stop. We want the engineer who has made a mistake to provide specifics of why he or she did what they did (either directly or implicitly); why the action made sense to them at the time. This is key to learning the anatomy of the failure. At the time they took it, the action made sense to the individual else they wouldnt have taken it.
The base fundamental here is something Erik Hollnagel has said:
We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.
Allowing Engineers to Own Their Own Stories
A funny thing happens when engineers make mistakes and feel safe when providing information about it: they are not only willing to be kept responsible, they are also enthusiastic in helping the rest of the business prevent the same error in the future. They are, after all, the most aware of their own mistakes. They should be heavily interested in coming up with things for remediation.
So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping EE become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.
So what do we do to enable a “Just Culture” at Encoding Enhancers
- We encourage learning by having these blameless Post-Mortems on outages and accidents.
- The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
- We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
- Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
- We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
- We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
- We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
- We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
- We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine it’s getting done, via Gantt charts and procedures) on the sharp end.
- The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.
Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.
One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”
Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.
That’s why we have blameless Post-Mortems at Encoding Enhancers , and why we’re looking to create a Just Culture here.