Template for reports

{{ incident date: yyyy-mm-dd }}, {{ incident name }}

Summary

Quick summary of what user impact was, how long it was, and how we fixed it.

Timeline

All times in {{ most convenient timezone}}

Start of incident. First symptoms, possibly how they were identified.

Investigation starts.

More details.

Lessons learnt

What went well

List of things that went well. For example,

We were alerted to the outage by automated bots before it affected users
The staging cluster helped us catch this before it went to prod

What went wrong

Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items. For example,

We do not record the number of hub spawn errors in a clear and useful way, and hence took a long time to find out that was happening.
Our culler process needs better logging, since it is somewhat opaque now and we do not know why restarting it fixed it.

Where we got lucky

These are good things that happened to us but not because we had planned for them. For example,

We noticed the outage was going to happen a few minutes before it did because we were watching logs for something unrelated.

Action items

These are only sample subheadings. Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten.

Process improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Documentation improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Technical improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

template-incident-report.md

template-incident-report.md

Template for reports

{{ incident date: yyyy-mm-dd }}, {{ incident name }}

Summary

Timeline

{{ yyyy-mm-dd hh:mm }}

{{ hh:mm }}

{{ hh:mm }}

Lessons learnt

What went well

What went wrong

Where we got lucky

Action items

Process improvements

Documentation improvements

Technical improvements

Files

template-incident-report.md

Latest commit

History

template-incident-report.md

File metadata and controls

Template for reports

{{ incident date: yyyy-mm-dd }}, {{ incident name }}

Summary

Timeline

{{ yyyy-mm-dd hh:mm }}

{{ hh:mm }}

{{ hh:mm }}

Lessons learnt

What went well

What went wrong

Where we got lucky

Action items

Process improvements

Documentation improvements

Technical improvements