Novice |
- "Preventing future incidents is difficult, because we don't have enough data and/or resources."
- "We can use predictive metrics to completely prevent classes of incidents in the future."
|
- Focus of prevention efforts for future incidents is on documentation, process design, and metric collection
- The preventative focus in retrospective exercises is on specific static contributory causes, identified by the actors
|
Beginner |
- "Our system is reasonably covered with metrics, our standard operating processes for day-to-day operations is documented, and that documentation is current. We value our operators' interactions with these aspects of the system."
- "We use metrics to generally inform our sense of attack/risk surface."
|
- Documentation, processes, and metrics are established to a generally-accepted level, which is less than total (100%) coverage of the system
- The preventative focus in retrospective exercises incorporates the actions of direct actors in the system, in addition to static identified contributory causes
|
Competent |
- "That old documentation has been deprecated, and is clearly marked as so, or is destroyed. Metrics which are not useful have been removed from the system or hidden."
- "We've got statistically-validated, automated trend analysis hooked into our metrics, so possible 'soft' problems are raised to the operators for further inspection."
|
- Established documentation, processes, and metrics are generally employed and consulted in the normal operation of the system
- Preventative focus is on reviewing documentation, processes, and metric collection both in the context of retrospective analysis, but also in day-to-day operations. Work to review and keep up-to-date these artifacts of the system is prioritized
- Focus in retrospective exercises is on the response of the team to an incident
|
Proficient |
- "When we started game days/hack events/failure exercises, it was a real mess. But the team has improved over time. We've still got more improvement to work on, but these 'failure games' have been directly applicable to production failure situations."
- "Over time, we've become less-focused on the specifics of individual operational incidents and more focused on how our team creates a crew to address the incident."
|
- We actively induce failure in our systems on a known schedule to drill our team's response abilities, patterns, and practices
- We review our responses to our induced failure and work to update our operational documentation, standard operating processes, and metrics
|
Advanced |
- "When we conduct game days now, the team is excited about testing its ability to respond to the unexpected."
- "There is a sense of camaraderie among the team, as they come together to form a crew to address a production incident. This exists even when it's an induced failure as part of a drill."
- "Our crews really care about their formation and dissolution, ensuring that they put their ‘gear’ back from the next crew that will use it."
|
- We actively induce failure in our systems on a random schedule to drill our team's response abilities, patterns, and practices
- We review our responses to our induced failure and amplify behaviors and patterns leading to positive outcomes and work to dampen patterns and behaviors leading to negative outcomes
- When forming crews to respond to incidents, the crew does not consider its work to be completed until all the tools are returned, reset, and available for the next crew that will form. This includes updates/changes to processes and documentation and general hygiene of the operational environment
- This process is considered our primary role and responsibility in addressing and remediating operational failure
|