|
Language Used |
Behavior Displayed |
Novice |
- "Other teams (QA, customer support) will notify us of any problems."
- "Problems with our service are obvious; outages are obvious to everyone."
|
- The service team is notified of incidents via manual, external notification mechanisms (ticketing system, phone calls, etc.)
- No baseline metrics established; description of service level is bucketed into four broad categories ("available", "unavailable", "degraded", "answering, but unavailable")1
|
Beginner |
- Most of the time, we’re the first to know when a service has transitioned from available to unavailable (or another state).
- We’re the first to know when a service is impacted.
|
- External monitoring is in place to detect in real time when a service transitions between one of the four broad buckets
- The team is notified in an automated way when monitoring detects a transition between these four buckets
|
Competent |
- "We've detected a number of service level transitions via the monitoring of very new (and maybe very old) API endpoints; in all cases, MTTD was reduced."
- We use historical data to perform manual, 'first approximation' guesses of service level changes; we're starting to communicate this information outward, potentially in ongoing discussions about SLAs."
|
- Historical data has been collected to establish broad baselines of acceptable service, enough to infer bands within the four buckets
- External monitoring of infrastructure, API endpoints, and other outward-facing interfaces exists and is recorded in the (historical) monitoring system
|
Proficient |
- "Other teams can help us monitor our own service because we've provided hooks for them to integrate within their own system."
- "We prioritize feature requests and bug reports to these monitoring hooks within our development sprints and in our organizational support work; monitoring is a first-class citizen for our team, and takes precedent over the deployment of new features."
- I know that specific code/infrastructure change caused this specific change in service level; here's how I know..."
|
- Baseline data is comprehensive enough to be able to be statistically correlated to current code state and map to code changes
- Application internals report monitoring data to the monitoring system
- Monitoring systems employ a deep use of statistical significance to provide proof (and disproof) of service anomalies
|
Advanced |
- "We've decoupled the deployment of code and/or infrastructure changes, because we can roll those changes back or forward, as necessary, to automatically remediate the issue before any service level impact becomes notable."
- Our team isn't being paged anymore for changes that automation can react to; our number of incidents that on-call engineers have to respond to is measurably down."
|
- Monitoring output is reincorporated into operational behavior in an automated fashion
- Anomalies do not result in defined “incidents,” as operational systems can automatically react to statistically significant changes in metrics
|
1 The distinction between “unavailable” and “answering” is in the former, the service does not respond to requests at all; in the latter, the service responds, but does not provide the requested functionality, i.e. returning HTTP 5xx response codes