a DevOps horror story — Bitfield Consulting

Computer says no

After watching the graphs for a few minutes, Morgan noticed that the restart didn’t seem to have fixed the issue. “Oh well,” they thought, “it was worth trying. So now what do I do?”

The next step in the checklist said “If the issue is still unresolved, click this button to fail over to the backup service, and continue following this checklist to troubleshoot.” Luckily, Morgan’s team had a Plan B for this service: a third-party equivalent which could take over temporarily and make sure requests still went through, albeit not as fast as with the primary service.

Morgan clicked the failover button, their confidence growing with each step of the runbook they had successfully followed. After confirming from the monitoring dashboard that the backup service was starting to clear down the number of queued requests, they moved on to the next part of the checklist.

This advised them to read the log messages from the failing service (automatically captured in a previous step, and now available as part of the incident report document). There were a small number of examples of different error messages that they might see, with suggestions for what to do in each case.

“Hmm, this says I should look for a message indicating that a new version of the service was deployed recently,” Morgan thought. “Let’s see… wait, here it is. Deploying updateall instances successfully restartedVersion XXX running. So there was a fresh deploy about five minutes before the first errors started coming in. Very suspicious!”

The runbook advised Morgan that, in this situation, the first thing to try was to click the button to automatically roll back to the previously-deployed version. “Oh, very useful!” thought Morgan. “This saves me having to check out the code repo and figure out how to find the last known good version and deploy it from the command line, or from having to do it via the CI/CD platform. Great. Click!”

It took a few minutes for the deploy to complete, while Morgan anxiously refreshed the service dashboard, but one by one, red lights started to turn green, and the graphs started returning to normal. “This is looking good!” Morgan thought. “What do I need to do now?”

That wasn’t too bad…

The checklist said that if a redeploy fixed the problem, they should first of all click a button to switch traffic back to the primary service, cutting out the backup, and then add their own notes to the incident report explaining what they did and what happened. The report would go automatically to Morgan’s team lead to be reviewed the next day, and then for the team to review in their weekly meeting. The other team responsible for the process-payments service would be automatically notified, too, and a ticket would be opened for them to identify what went wrong and fix it; in the meantime, new deploys would be blocked, to prevent someone inadvertently putting the buggy service back into production.

With this done, and the incident officially closed, Morgan could finally relax. “That wasn’t anywhere near as scary as I’d thought it might be,” they mused. “The runbook took a lot of the fear out of the process, and it gave me something to focus on and a clear sequence of actions to take. Even better, most of the actions were automated, so that all I had to do was trigger them. I suppose every time something like this happened in the past, the team learned from it, updated the runbook with information about it, and built a little more automation to help fix the issue. That’s a very enlightened way to run your web operations, now I come to think of it.”

The terror

Suddenly a harsh beeping jolted Morgan out of their reverie.

“Service process-payments is CRITICAL.”

With mounting horror, Morgan realized that there was no runbook link attached. It had all been a dream… and now Morgan’s true night of terror was about to begin.


Source link