Quick summary of what user impact was, how long it was, and how we fixed it.
All times in {{ most convenient timezone}}
Start of incident. First symptoms, possibly how they were identified.
Investigation starts.
More details.
List of things that went well. For example,
- We were alerted to the outage by automated bots before it affected users
- The staging cluster helped us catch this before it went to prod
Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items. For example,
- We do not record the number of hub spawn errors in a clear and useful way, and hence took a long time to find out that was happening.
- Our culler process needs better logging, since it is somewhat opaque now and we do not know why restarting it fixed it.
These are good things that happened to us but not because we had planned for them. For example,
- We noticed the outage was going to happen a few minutes before it did because we were watching logs for something unrelated.
These are only sample subheadings. Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten.
- {{ summary }} [link to github issue]
- {{ summary }} [link to github issue]
- {{ summary }} [link to github issue]
- {{ summary }} [link to github issue]
- {{ summary }} [link to github issue]
- {{ summary }} [link to github issue]