Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling server upgrades? #461

Closed
ibrahima opened this issue Dec 24, 2023 · 4 comments
Closed

Handling server upgrades? #461

ibrahima opened this issue Dec 24, 2023 · 4 comments

Comments

@ibrahima
Copy link
Contributor

ibrahima commented Dec 24, 2023

I noticed that #372 is an open issue. In our Faktory installation, we're deploying to AWS ECS. When we do upgrades, it seems like depending on how you handle the deployment there is a chance of jobs getting lost - e.g. if you have two servers at the same time temporarily, some jobs might go to the "older" one and then get lost when the second server comes up, if they aren't persisted to disk in time. (I'm also not exactly sure how it behaves if two servers mount the same persistent volume.) Right now we're pretty early in development so we've just been doing upgrades live, but I am guessing that's not the best approach.

Is the current right way to do a Faktory upgrade, to shut your site down temporarily? To be more explicit:

  1. Put your main application into a maintenance mode so that it can't queue new Faktory jobs
  2. Quiet your workers so that they don't pick up new tasks
  3. Wait for workers to finish any in-progress work
  4. Do the upgrade
  5. (potentially) restart your clients/workers so that they can reconnect to the new server (I found that Ruby clients/workers reconnect by themselves after a while, but the Python ones seem to crash and burn)

It might be nice to document this somewhere, but I am not sure where yet. It might depend on how the server is deployed though it seems like the above overall procedure is probably general to most deployment types.

I'm realizing that #372 probably doesn't help in a containerized setup, because the new server will be in a new container and so isn't spawned by a parent process that could share its port or socket. Though with a load balancer in front, you get a similar behavior as the "reused socket" situation, but that still feels non-ideal because you might have jobs go to the "old" server instead of the new one. And since Faktory isn't designed to have multiple running servers (e.g. #447) there's probably no way around that.

Thinking out loud... if you could tell a server to stop accepting jobs once the replacement is online, and have the clients retry operations a few times on failure, then you might be able to achieve something like 0-downtime deploys. But that certainly complicates things further, and it kinda feels like it's better to just minimize the downtime rather than try to handle correctness in these scenarios.

@mperham
Copy link
Collaborator

mperham commented Jan 3, 2024

I've never been able to design a zero downtime solution unfortunately. The Faktory protocol is stateful so we can't just swap out backends using a reverse proxy. Existing client connections need to re-authenticate with the new server. Essentially you're right with the steps. On a good day, you can probably get those steps to take no more than 30 seconds; bringing down everything is the safest option.

@shuber
Copy link

shuber commented Jan 14, 2024

@mperham Does using an external persistent REDIS_URL via faktory enterprise change anything around being able to have a brief overlap as ECS containers are drained/swapped? Right now I have ECS deploys for faktory configured with min: 0% and max: 100% to ensure only 1 instance is ever running when we deploy, but I'd love to make that min: 100% and max: 200% like all of our other services with zero downtime deployments. If it helps at all we could also pause/unpause all queues around deployments - I would just love for services to be able to continue enqueuing jobs during the process.

I've noticed the same client reconnection issues as @ibrahima where Ruby clients reconnect fine but our node ones do not (we'll be testing that issue with the golang client soon as well) and plan on forking/patching the clients to get that working. The node/golang faktory client TLS support is something else I'd like to try and get working as well (works fine for Ruby).

@mperham
Copy link
Collaborator

mperham commented Jan 15, 2024

Does using an external persistent REDIS_URL via faktory enterprise change anything

Nope. Faktory connections don't automatically migrate from old to new so you'd need to reestablish all connections.

@mperham
Copy link
Collaborator

mperham commented May 17, 2024

Closing as dupe of #372

@mperham mperham closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants