Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker + Ferm → Open Library Increased Errors + 503s #4706

Closed
7 tasks done
mekarpeles opened this issue Mar 1, 2021 · 1 comment
Closed
7 tasks done

Docker + Ferm → Open Library Increased Errors + 503s #4706

mekarpeles opened this issue Mar 1, 2021 · 1 comment
Assignees
Labels
Affects: Admin/Maintenance Issues relating to support scripts, bots, cron jobs and admin web pages. [managed] Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Mar 1, 2021

Summary

  • What is wrong?

openlibrary.org became slugish w/ increased 503s. We noticed infobase down on ol-home0. We restarted ol-www1's nginx and haproxy as sometimes these become saturated. We checked ol-mem* to check our memory usage (in case of swapping).

  • What caused it?

Postmortem: @cdrini noticed infobase down on ol-home0, further investigation yielded that docker was having iptable issues (i.e. all docker on that host seemed in a strange state). Hypothesis is, during cron job testing 1h earlier, ferm + rsync rules + a checkout to olsystem may have affected the state of docker (which mounts olsystem and which may rely on ferm).

The error presented itself as:

drini@ol-home0:/opt/olsystem$ sudo docker container restart openlibrary_infobase_nginx_1
Error response from daemon: Cannot restart container openlibrary_infobase_nginx_1: driver failed programming external connectivity on endpoint openlibrary_infobase_nginx_1 (b29fd27011f636faebe9d09b060cb9c208ee496f0ece261d8cd3591f6fe445c6):  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 7000 -j DNAT --to-destination 172.24.0.3:7000 ! -i br-b870e8414bb9: iptables: No chain/target/match by that name.
 (exit status 1))
  • What fixed it?

Restarting docker (itself, i.e. the daemon) with sudo systemctl restart docker resolved the issue.

  • Followup actions:

Opportunity for us to add nagios alerts to certain hosts on slack (as the only reason we noticed this issue was usage of the actual site).

Steps to close

  1. Assignment: Is someone assinged to this issue? (notetaker, responder)
  2. Labels: Is there an Affects: label applied?
  3. Diagnosis: Add a description and scope of the issue
  4. Updates: As events unfold, is noteable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
  5. "What caused it?" - please answer in summary
  6. "What fixed it?" - please answer in summary
  7. "Followup actions:" actions added to summary
@mekarpeles mekarpeles added Affects: Admin/Maintenance Issues relating to support scripts, bots, cron jobs and admin web pages. [managed] Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue labels Mar 1, 2021
@mekarpeles mekarpeles self-assigned this Mar 1, 2021
@mekarpeles mekarpeles changed the title Open Library Increased Errors + 503s Docker + Ferm → Open Library Increased Errors + 503s Mar 4, 2021
@mekarpeles
Copy link
Member Author

The issue was that we were updating /etc/ferm/ferm.conf rules even though these were auto-generated daily. We should have been putting them in /etc/ferm/input/ as is described in @abezella's guide:
https://docs.google.com/document/d/1W4DtLPlzCUszovOj1yA6uy5Ws8GY_cpjlxu5VOo2aQo/edit#heading=h.3dy6vkm

The second issue is, restarting ferm via sudo service ferm reload causes Docker iptables to go haywire. The solution is to restart docker with sudo systemctl restart docker

If we reprovision ol-home0 this ferm/input rule needs to be a step, e.g. adding:
saddr $CLUSTER proto tcp dport rsync ACCEPT;
to /ol-home0:/etc/ferm/input/rsync.conf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Admin/Maintenance Issues relating to support scripts, bots, cron jobs and admin web pages. [managed] Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue
Projects
None yet
Development

No branches or pull requests

1 participant