Wheel of Misfortune

Identify Operational Risks with a Wheel of Misfortune For several months, we have been working on migrating a key service to a new Kubernetes platform. Whilst we have done security assessments and implemented protective monitoring, as we near the switchover date, there are a couple of things that keep me up at night: We don’t…

November 2, 2019

Identify Operational Risks with a Wheel of Misfortune

For several months, we have been working on migrating a key service to a new Kubernetes platform. Whilst we have done security assessments and implemented protective monitoring, as we near the switchover date, there are a couple of things that keep me up at night:

We don’t have any live, operational experience on the platform. This will be the first service on the new infrastructure.
The service currently has a 99.5% SLA and serves 0.2 requests per second. We will need to maintain both of these

We needed a way to build our confidence in the new platform by identifying operational risks and implementing preventive measures to achieve sufficient reliability for the new service.

To do this, we used a new approach to brainstorm potential risks to our goal of 99.5% availability. Then we added some extra details to help us prioritise which risks to focus on.

We ended up with a table of reliability risks like this one below.

What do the headings mean?

Estimated time to detection (ETTD) – How long it takes to discover an incident.

Estimated time to recovery (ETTR) – How long does it take to fix the incident once you have discovered it?

Percentage of users affected – Does it affect all your users or a subset? Maybe users belong to a specific use case or user journey.

Estimated time between failures (ETBF) – How often does this incident happen? How likely is this risk to occur?

Unit – To keep things simple, we used High/Low to measure each risk but you can use something more granular i.e. Red/Amber/Green or Seconds/Minutes.

Process

Set the scene

Start by giving the team some context on what’s at stake. For us, we asked the team to imagine we have gone live on the new platform and our SLA is 99.5% – What can jeopardise that?

Assumptions

It’s important to set some ground rules that will help the team focus otherwise you risk overwhelming the team and yourself with risks.

In our scenario, we made the assumption that: 1. We are installing quarterly maintenance updates/patches to the platform and it’s dependencies 2. We have monitoring on new platform equivalent to the existing

Identify Risks

Ask the team to identify risks on post-it notes (or Trello). Each person should then describe the risks they’ve identified in the order of likelihood or impact.

Go through each risk as a team and discuss each heading. This is where a Delivery Manager comes in hand to keep the discussion moving and avoid getting bogged down. We used a simple High/Med/Low rating to go through our list fast!

It’s useful if you have someone other than the facilitator transcribing important points that are generated from conversations and ’aha moments.

Preventative Measures

At this stage, it really helps if you prioritise the most important risks before discussing preventative measures.

Here are some possible options you can implement:

An issue that takes a long time to discover can be caught earlier with monitoring and alerts.
An issue that occurs frequently can be added to the backlog to resolve once and for all.
As issue classed as BAU might be resolved with documentation
A critical service might require dedicated support and classification of risks i.e. P1, P2

Note: this isn’t an exact science. The key outcome of this session is to get a shared understanding of the risks and their impact.

For my team, this helped us determine the important risks and plan the rest of our reliability work for the platform.

Tips

Take breaks
Keep it light
Split into multiple sessions
Continuously prioritise

P.S Growing up, I was more of a Supermarket Sweep fan then Wheel of Fortune!

Rumman Amin

Wheel of Misfortune

What do the headings mean?

Process

Leave a Reply Cancel reply