I gave a presentation to the Seattle DevOps Meetup group September 30th on The Troubleshooting Chart. While the talk was mostly well received, the topic of Blame proved unexpectedly controversial. "Blame" is negative. I carefully chose "Blame" because nothing else truly captures the human activity and emotion so succinctly. I was hoping to be provocative. But I was surprised by how much the emotional baggage of saying "Blame" overpowered everything I had to say after that.
My approach failed, but I would like to try laying out my thoughts on what Troubleshooters need to know about blame:
- Blame is inevitable when troubleshooting software issues
- Troubleshooting Engineers need to be conscious of blame
- Data-driven methods such as the Troubleshooting Chart can help
Inevitability of Blame in Software Troubleshooting
If a piece of hardware had failed, the members of your team are not likely to feel personally responsible for the failure. It's the hardware's own fault. Or the vendor's fault. Later, we might assign the fault to company penny-pinching, testing, or whatever, but we don't need to determine that to get a fix.
Software is different because a person you know made the change. In the desperate search for a critical software fix, there is no way around finding the team member who made the breaking change. We need his or her help to understand the change, the unexpected impact, and they will most likely be the best qualified to make a fix.
We don't want to heap blame on Developers. We are actively trying not to. But it isn't entirely up to us. Blame develops naturally in the minds of Engineers and their managers as a result of the association with the issue. Everyone knows the system is on fire because of something Joe checked in, and he had to fix it ASAP.
We are never going to announce "Joe broke the system!" But we are going to be interrupting his regularly scheduled work to talk with him about the issue. Joe, the members of his team, and anyone within line of sight and hearing, will be aware that Joe is associated with the critical issue. Joe's association with the issue may taint him with suspicion. This suspicion may grow exponentially if Joe is involved with critical issues often. Wishing the blame away doesn't stop this natural process.
A proper investigation might reveal that it isn't Joe's fault. To the contrary, perhaps he is heroically fixing problems fundamentally created by other engineers, partners, managers, etc. Perhaps Ops goes to him first because he has proven his ability to turn around a fix faster than anyone else. Perhaps Joe works on the most complicated and valuable part of the system. Or maybe Joe is a bad engineer who screwed up. Eventually, all of these theories might get resolved in a postmortem or retrospective analysis.
But the system is on fire now. Customers are complaining now. We need a fix now. A responsible and deliberate process improvement investigation is not going to be conducted on the way to a fix. That's for later, much later. And it may not involve you.
Why Troubleshooters Should Care About Blame
You need to enlist the Developer's immediate help to get a fix. You will not be able to prevent the association between certain Developers ("Joe") and the issue. And you cannot fully control how that association will be interpreted by Joe, his team, and other Engineers. Being aware may give you a chance to manage the perception.
You also are associated with the issue, and in the eyes of observers you may also share some guilt, shame, and blame conveyed by association. If you work enough critical issues, the dark cloud of doom will permanently follow you around. Developers will fearfully eye your approach to their desks, expecting the worst, even if you were just about to invite them to lunch. Sorry.
Stay positive. Phrase your involvement as a request for help diagnosing and fixing the issue. Don't be too quick to lay out your case for why Joe broke the system. Do not get unnecessarily involved in the details of why exactly this change was made and how it should be prevented in the future. You will need to ask enough to make sure the fix is right, but don't dwell on who-did-what.
Always be vocally thankful for their timely help determining the cause and building a fix.
Remember that there is a future despite the apparent urgency of the current issue. You need to protect your credibility and your personal relationships with Developers to be both effective and happy long-term.
How Data Can Help Manage Blame
I mentioned above that you should phrase your interruption of a Developer as a request for help diagnosing and fixing an issue. Bringing some data and timeline evidence helps make this constructive and credible. Of course, I recommend a Troubleshooting Chart, but you might have other artifacts like error messages, customer observations, etc.
The key here is to let the Developers arrive at "blame" by themselves, on their own interpretation of the evidence. Don't pull the trigger. If their change caused the incident, they will almost always realize this when presented with the evidence. If they are innocent with respect to this incident, they can describe their innocence in the context of the data, which will be very helpful to your investigation.
Beyond the data and chart artifacts, going through the process of gathering evidence shows consideration to Developers by not throwing accusations around irresponsibly. It respects their time by presenting them the summary of your investigation so far. It offers them a chance to heroically help towards the fix, rather than just being the cause of the break.