Faktory crashing when trying to be redeployed #487

ibrahima · 2024-09-04T00:54:16Z

I'm experiencing an issue where the Faktory service is failing to start when trying to redeploy it. This was triggered by an Amazon ECS platform update, so it happened without us explicitly trying to redeploy it. We have it set up to deploy the new instance of a service before decommissioning the old one, but the new one never boots successfully so the old one is never decommissioned. The service tries to mount a shared volume to /var/lib/faktory so I do wonder if there's some contention happening there. I am wondering if shutting down the existing service first will allow a new service to come up successfully but I don't have a way to test that because this issue isn't occurring in our staging environment. Would appreciate any pointers to resolving this, thanks!

Which Faktory package and version?
- Docker image, 1.9.0
Which Faktory worker package and version?
- This is a server issue but we do have Ruby and Python workers. One thing to note is that the Python client library seems less resilient to server connection issues, so the Python workers keep restarting because of the temporary split-brain situation. (I feel like the way we have Faktory deployed is not ideal, but... I'm not sure what the best way to fix it would be.)
Please include any relevant worker configuration
Please include any relevant error messages or stacktraces

logs from Amazon ECS

timestamp	message
1725410530251	Faktory 1.9.0
1725410530251	Copyright © 2024 Contributed Systems LLC
1725410530251	Licensed under the GNU Affero Public License 3.0
1725410530251	I 2024-09-04T00:42:10.251Z Initializing redis storage at /var/lib/faktory/db, socket /var/lib/faktory/db/redis.sock
1725410535910	E 2024-09-04T00:42:15.909Z Unable to create Faktory server: context deadline exceeded

Are you using an old version? No
Have you checked the changelogs to see if your issue has been fixed in a later version? Yes.

The text was updated successfully, but these errors were encountered:

mperham · 2024-09-04T00:57:28Z

🤷🏻‍♂️ try it and see. The datafile certainly can’t be shared by multiple Redises so that would make sense.

ibrahima · 2024-09-04T00:59:40Z

yea I think I will... I'm slightly nervous because I don't know what will happen if the service doesn't come back up but right now the service is degraded anyway, so it won't make things much worse.

ibrahima · 2024-09-04T01:14:20Z

Hmm.. seems to still be dying on start. Do you have an idea where that message might be coming from or any way to get more details? My limited understanding is that something is timing out but I'm not sure what it could be.

mperham · 2024-09-04T01:15:40Z

What does starting with “-l debug” print out?

ibrahima · 2024-09-04T01:28:22Z

oh hmm, after a few failed starts it seems like it might have succeeded! so it was probably the shared EFS volume thing, and maybe it took a while after the old service shutting down for the volume to become fully accessible. maybe it was hanging around in some zombie state or something lol... (should not have been the case, but IDK... gremlins...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faktory crashing when trying to be redeployed #487

Faktory crashing when trying to be redeployed #487

ibrahima commented Sep 4, 2024 •

edited

Loading

mperham commented Sep 4, 2024 •

edited

Loading

ibrahima commented Sep 4, 2024

ibrahima commented Sep 4, 2024

mperham commented Sep 4, 2024

ibrahima commented Sep 4, 2024

Faktory crashing when trying to be redeployed #487

Faktory crashing when trying to be redeployed #487

Comments

ibrahima commented Sep 4, 2024 • edited Loading

mperham commented Sep 4, 2024 • edited Loading

ibrahima commented Sep 4, 2024

ibrahima commented Sep 4, 2024

mperham commented Sep 4, 2024

ibrahima commented Sep 4, 2024

ibrahima commented Sep 4, 2024 •

edited

Loading

mperham commented Sep 4, 2024 •

edited

Loading