Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faktory crashing when trying to be redeployed #487

Open
ibrahima opened this issue Sep 4, 2024 · 5 comments
Open

Faktory crashing when trying to be redeployed #487

ibrahima opened this issue Sep 4, 2024 · 5 comments

Comments

@ibrahima
Copy link
Contributor

ibrahima commented Sep 4, 2024

I'm experiencing an issue where the Faktory service is failing to start when trying to redeploy it. This was triggered by an Amazon ECS platform update, so it happened without us explicitly trying to redeploy it. We have it set up to deploy the new instance of a service before decommissioning the old one, but the new one never boots successfully so the old one is never decommissioned. The service tries to mount a shared volume to /var/lib/faktory so I do wonder if there's some contention happening there. I am wondering if shutting down the existing service first will allow a new service to come up successfully but I don't have a way to test that because this issue isn't occurring in our staging environment. Would appreciate any pointers to resolving this, thanks!

  • Which Faktory package and version?
    • Docker image, 1.9.0
  • Which Faktory worker package and version?
    • This is a server issue but we do have Ruby and Python workers. One thing to note is that the Python client library seems less resilient to server connection issues, so the Python workers keep restarting because of the temporary split-brain situation. (I feel like the way we have Faktory deployed is not ideal, but... I'm not sure what the best way to fix it would be.)
  • Please include any relevant worker configuration
  • Please include any relevant error messages or stacktraces

logs from Amazon ECS


timestamp message
1725410530251 Faktory 1.9.0
1725410530251 Copyright © 2024 Contributed Systems LLC
1725410530251 Licensed under the GNU Affero Public License 3.0
1725410530251 I 2024-09-04T00:42:10.251Z Initializing redis storage at /var/lib/faktory/db, socket /var/lib/faktory/db/redis.sock
1725410535910 E 2024-09-04T00:42:15.909Z Unable to create Faktory server: context deadline exceeded

Are you using an old version? No
Have you checked the changelogs to see if your issue has been fixed in a later version? Yes.

@mperham
Copy link
Collaborator

mperham commented Sep 4, 2024

🤷🏻‍♂️ try it and see. The datafile certainly can’t be shared by multiple Redises so that would make sense.

@ibrahima
Copy link
Contributor Author

ibrahima commented Sep 4, 2024

yea I think I will... I'm slightly nervous because I don't know what will happen if the service doesn't come back up but right now the service is degraded anyway, so it won't make things much worse.

@ibrahima
Copy link
Contributor Author

ibrahima commented Sep 4, 2024

Hmm.. seems to still be dying on start. Do you have an idea where that message might be coming from or any way to get more details? My limited understanding is that something is timing out but I'm not sure what it could be.

@mperham
Copy link
Collaborator

mperham commented Sep 4, 2024

What does starting with “-l debug” print out?

@ibrahima
Copy link
Contributor Author

ibrahima commented Sep 4, 2024

oh hmm, after a few failed starts it seems like it might have succeeded! so it was probably the shared EFS volume thing, and maybe it took a while after the old service shutting down for the volume to become fully accessible. maybe it was hanging around in some zombie state or something lol... (should not have been the case, but IDK... gremlins...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants