Wheel of Misfortune
Identify Operational Risks with a Wheel of Misfortune
For several months, we have been working on migrating a key service to a new Kubernetes platform. Whilst we have done security assessments and implemented protective monitoring, as we near the switchover date, there are a couple of things that keep me up at night:
- We don’t have any live, operational experience on the platform. This will be the first service on the new infrastructure.
- The service currently has a 99.5% SLA and serves 0.2 requests per second. We will need to maintain both of these
We needed a way to build our confidence in the new platform by identifying operational risks and implementing preventive measures to achieve sufficient reliability for the new service.
To do this, we used a new approach to brainstorm potential risks to our goal of 99.5% availability. Then we added some extra details to help us prioritise which risks to focus on.
We ended up with a table of reliability risks like this one below.
What do the headings mean?
Estimated time to detection (ETTD) – How long it takes to discover an incident.
Estimated time to recovery (ETTR) – How long does it take to fix the incident once you have discovered it?
Percentage of users affected – Does it affect all your users or a subset? Maybe users belong to a specific use case or user journey.
Estimated time between failures (ETBF) – How often does this incident happen? How likely is this risk to occur?
Unit – To keep things simple, we used High/Low to measure each risk but you can use something more granular i.e. Red/Amber/Green or Seconds/Minutes.
Set the scene
Start by giving the team some context on what’s at stake. For us, we asked the team to imagine we have gone live on the new platform and our SLA is 99.5% – What can jeopardise that?
It’s important to set some ground rules that will help the team focus otherwise you risk overwhelming the team and yourself with risks.
In our scenario, we made the assumption that: 1. We are installing quarterly maintenance updates/patches to the platform and it’s dependencies 2. We have monitoring on new platform equivalent to the existing
Ask the team to identify risks on post-it notes (or Trello). Each person should then describe the risks they’ve identified in the order of likelihood or impact.
Go through each risk as a team and discuss each heading. This is where a Delivery Manager comes in hand to keep the discussion moving and avoid getting bogged down. We used a simple High/Med/Low rating to go through our list fast!
It’s useful if you have someone other than the facilitator transcribing important points that are generated from conversations and ’aha moments.
At this stage, it really helps if you prioritise the most important risks before discussing preventative measures.
Here are some possible options you can implement:
- An issue that takes a long time to discover can be caught earlier with monitoring and alerts.
- An issue that occurs frequently can be added to the backlog to resolve once and for all.
- As issue classed as BAU might be resolved with documentation
- A critical service might require dedicated support and classification of risks i.e. P1, P2
Note: this isn’t an exact science. The key outcome of this session is to get a shared understanding of the risks and their impact.
For my team, this helped us determine the important risks and plan the rest of our reliability work for the platform.
- Take breaks
- Keep it light
- Split into multiple sessions
- Continuously prioritise