Skip to content

Commit

Permalink
Add post-mortem september 2023
Browse files Browse the repository at this point in the history
  • Loading branch information
saraycp committed Sep 19, 2023
1 parent ad1fa4b commit cba61a1
Showing 1 changed file with 47 additions and 0 deletions.
47 changes: 47 additions & 0 deletions _posts/deployments/2023-09-19-post-mortem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
layout: post
title: "Post-mortem: Gap Between Deployment and Migration End"
category: deployments
---

## OBS Pages Inaccessible on 19th of September

After the morning deployment, many of the OBS pages were inaccessible returning a 500 error.
The error messages had to do with `moderated_at` and `code_of_conduct` database columns that the deployment tried to introduce.

**Date**: 19.09.2023

**Impact**: Most of the OBS pages were inaccessible to all our users.

**Root Causes**: In our deployment, we first update the obs-api package (including restarting servers) and then run migrations. In the timeframe between the restart and the end of the migrations, we saw the failures. The updated interface dealt with methods like `moderated_at` and `code_of_conduct` which were not reflected in the database yet.

**Trigger**: Morning deployment and migrations.

**Resolution**: Everything went back to normal when the migrations finished. No human intervention.

**Detection**: Manual check after deployment and error tracking tool.

## Lessons Learned

**What went well?**

* We started the incident protocol as soon as possible.
* Another team member assumed the communicator role quickly.

**What went wrong?**

* During development, [strong_migrations](https://github.com/ankane/strong_migrations) didn't detect any unsafe migration on our code.
* We weren't able to foresee that the migration could take longer than expected.

**Where we got lucky?**

* OBS went back to normal without intervention.
* There was no data loss.

## Timeline (CEST)

- *10:27* Deployed with migrations assuming there won't be downtime
- *10:32* Noticed some errors when accessing pages with comments (package, project and requests pages).
- *10:34* Detected Errbit errors regarding missing `moderated_at` and `code_of_conduct` columns in the Comments and Configuration tables respectively
- *10:40* Errors stopped
- *10:46* Concluded that we got the errors in the timeframe between the servers restarted and the migrations. The interface tried to deal with new columns that weren't in the database yet.

0 comments on commit cba61a1

Please sign in to comment.