When things go wrong

An opportunity to impress, or cut your losses

In any business, things sometimes go wrong. You should of course try to prevent this, but in the real world sh*t happens. It's often a stressful time for the technical team and for anyone in a customer facing role. It helps to remember that most people (at least in a B2B setting) have been on both sides of things going wrong and are likely to have some sympathy for you, despite expressing frustration.

If your product/service isn't business-and-time-critical for your customers, they may not be all that bothered about a minor outage, even if it feels like a big deal to you. In any event, be proactive and fairly honest in your dealings with customers, including telling them once the incident has been resolved. Thank them for their patience and provide affected customers with regular updates about mitigation measures. Depending on the incident, you may also be able to use the fact that you've got their attention to deepen your relationship with them ... we're taking the opportunity to review the code and would value their views about XYZ features etc.

If you can't turn an incident into an opportunity, then make a honest assessment of how much damage has been done to your reputation in the eyes of any affected customers. Consider how important (financially or in other ways) those customers are to your business and consider putting your energy into maintaining good relationships rather than chasing after those that are irreparably damaged.

Behind the scenes, it really helps to have a clear and simple process in place for dealing with incidents (and things which might be incidents, but you're not sure yet). Don't make it too long or too complicated ... just cover what the escalation mechanism is, how and when to communicate internally and how to make decisions - including deciding when and what to say externally.

The best time to think about these things is before they happen ... ask yourself what information you'll need in order to make useful decisions, who should be responsible for gathering that information and who is responsible for making decisions as the incident unfolds. The second best time to think about these things is after an incident has just occurred - this is when things are fresh in your mind and you can learn useful lessons for next time.

Typically, a standing plan for incident response might look something like ...

  1. Customer or tech team becomes aware of potential incident and notifies x,y,z
  2. Investigate for 30 mins then huddle for 5 mins
  3. Decide on a plan for further investigation / urgent mitigation / further decisions
  4. Execute the plan and adapt as necessary
  5. Wash up <-- THIS IS IMPORTANT

During the washup/lessons learned phase, think about preventing similar problems in the future, how to make incident response easier in the future, what aspects of the incident could have been managed better and whether any changes to the process should be made as a result. Be careful not to just fight yesterday's fire ... any plan needs, above all else, to be simple and flexible.