Search this blog

Friday, April 30, 2010

Tips for Handling Service Outages (multiple services affected)

This is the second post related to “Tips for Handling Events, Incidents, Outages, and Maintenance”.

You’re about to have a interesting day/night. Multiple business-critical services are offline or having intermittent problems. It affects revenue, your company-wide-outage processes have been started, and you’re the lucky person on the on-call roster to lead the outage.

This is a different beast than resolving a problem with an single service. Here, you’re going to have to coordinate across several services and try to get the entire system up and running as soon as possible. The biggest obstacles here are coordination, communication, and discipline. This requires lots of practice before you get good in the role.


I’ve resisted role terminology as much as possible up to this point but I’m going to introduce three roles here: the Incident Manager, the Service Manager, and the Communication Manager. There are lots of names for these roles and you can use any name as long as it’s clear that the Incident Manager is in charge of the entire incident, each Service Manager is responsible for reporting on their service, and the Communication Manager is responsible for handling communication with everyone else and pulling people in as necessary. It’s critical that everyone understand these roles and relationships or a phone call/chat session can turn into a nightmare.

The Incident Manager’s job here is to drive communication, get regular updates from the Service Managers, and look for higher level patterns across the affected services. For example, let’s say three separate services are all having connectivity problems. Each Service Manager is going to be heads down looking at their particular service. The Incident Manager should be looking for a common theme here.

Let’s say that all of the services are having network problems. An Incident Manager might start with the following questions: “Are the services having problems trying to connect to the same place? Are they having problems trying to reach other?” A quick view of a network architecture diagram can tell him what is in common and he can start asking the right questions or he can have the communication manager pull in the right team. In this case, let’s say all the services are connected to each other via a common VPN circuit between data centers and the VPN team isn’t on the call. The incident manager could then pull in the VPN team to verify the systems even if the VPN team’s monitors haven’t gone off. The point is that the Incident Manager has the bandwidth to pull people in and explore ideas while the Service Managers are busy troubleshooting their particular service. The Incident Manager is a critical part of restoring service as quickly as possible and it’s the bandwidth to look at the big picture that makes this possible.

Here are some tips for Incident Managers:
  • Get your head straight (see previous post on Tips for Handling Service Incidents)
  • Stay calm and immediately address panic. Your demeanor will affect everyone on the call. Similarly, one panicky person can results in lost time or productivity. There’s no time for panic, stress, or sniping during a call. Address any of these problems immediately.
  • Control your call. Don’t give out the phone conference number. Non-essential people shouldn’t be joining your meeting. If they feel like they can contribute, they should go through the communication manager.
  • Don’t assume anything. “Should” is a dirty word during an outage. It’s your job as Incident Manager to validate assumptions while the teams are debugging their specific systems.
  • Enforce phone discipline. If someone is calling in and has a lot of background noise, tell them to mute, move to a quieter location, or switch to a different phone if possible.
  • Get comfortable with silence. Most people want to fill the void or speculate during these calls. Let the Service Managers do their job and spend your time as Incident Manager thinking and calling in the right people.
  • Do a roll-call every five minutes. You should read through the issues that you’re tracking and ask each Service Manager for an update. For example, “Five minute roll-call. We’re tracking two issues. Networking team. You’re a go for update. [update happens]. Thank you. Storage team. You’re a go for update…” These five minute updates help everyone stay focused on the problems and help provide context for anyone joining the call. Framing the list of issues also helps someone correct you if you’ve accidentally forgot about an issue.
  • Consider an ATC-like-model for communication. This is mostly my preference since I’m a pilot and a nerd but I find that enforcing communication discipline greatly helps during an outage as you don’t lose time to random conversation. Treat the phone as a precious shared resource – something that Service Managers shouldn’t hog with a long question or discussion. They should first request time from the Incident Manager. In the example below, note the control flow and that everyone positively acknowledges their assignment. There’s nothing worse than silence when you ask someone to do something. They should confirm that they understand you and are working on the task. In the following example, John is the Incident Manager and Dave is the Communication Manager:
Storage team: “John, storage team has an update”
Incident Manager: “OK, thanks. Standby. Dave, please call in the VPN team on call engineer”
Communication Manager: “OK. I’m calling in the communication manager”
Incident Manager: “Thanks. Storage team. Go ahead”
Storage team: “We think there may be a network problem outside of our rack”
Incident Manager: “We’re thinking the same. We’re narrowing in on VPN. Please do a traceroute and get back to me with the results”
Storage team: “OK. I’ll callback with the traceroute”
[Incident Manager adds this to his tracking list for the next five minute update]
[Meanwhile, the communication manager is capturing relevant events and sending out updates as required]
[Later, from the storage team…]
Storage team: “Storage team has the traceroute”
Incident Manager: “Go ahead storage”
Storage team: “it looks like we’re getting blackholed with our outbound carrier. We’re working with our colo provider on their BGP routes”
Incident Manager: “Thanks. I’ll track that. Network, please confirm the traceroute blackhole.
[All this can happen in parallel before the VPN team even joins the call]
    • Carefully track your events. Pen and paper are great but a text editor and a screen-sharing program in a designated room (with Adobe Connect for example) are even better. You can post all information and updates in windows and everyone immediately sees the current status as soon as they join the room.
    • Make sure you are all referring to a single version of the truth. If you have a monitoring system and an executive dashboard, make sure everyone is looking at the monitoring system and not some fancy dashboard that was thrown together by another group.
    • Start preparing a shift rotation if it’s a multi-day event. The worst thing you can do is stay up for 48 hours. I’ve done this before and, while it seemed heroic at the time, it was just plain stupid. You want your people at their best if you’re going into extra innings.

    No comments:

    Post a Comment

    Thoughts?