Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices in deploying and monitoring a queue #523

Closed
stas opened this issue Feb 16, 2022 · 6 comments · Fixed by #525
Closed

Best practices in deploying and monitoring a queue #523

stas opened this issue Feb 16, 2022 · 6 comments · Fixed by #525

Comments

@stas
Copy link
Contributor

stas commented Feb 16, 2022

Hi there,
first of all, thank you for all the work that everyone has dedicated into this project!!! ❤️

The reason I'm reaching out is because we're running a bunch of queues with GoodJob on Heroku, and recently we started seeing jobs piling up and not being processed.

After a bit of digging, we found that GooJob would end up out of available connections in the pool (which is totally fair):

ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.045 seconds); all pooled connections were in use

But what's scary is that the running processes would be reported as healthy by Heroku, even though the performance of the queue degraded (eg. 500+ jobs pending for 10-20m, while normally it would take minutes to process such a load)!

Based on this, my immediate takeaways are:

  1. the workers do not return the connections back into the pool
  2. the monitoring we're having in place is not ok

Anybody else had to go through something similar?

Should we explicitly try to release any connection back into the pool? Should we use the built-in health-check instead of the generic pid-based health-check?

@mikereczek
Copy link
Contributor

Hi @stas - thank you for your work on jsonapi-serializer!

Anybody else had to go through something similar?

I ran into a similar issue in dev a while back. If I remember correctly, I had failed to account for the GoodJob LISTEN/NOTIFY thread when setting the database pool size. I think the latest recommendation (@bensheldon - please correct me if I am wrong) is this:

At the moment, I would recommend that your database pool size = [Web/Puma threads if async] + [GoodJob execution threads] + [1 GoodJob LISTEN/NOTIFY thread] + [2 GoodJob cron threads] + 20% margin

@bensheldon
Copy link
Owner

@stas thanks for opening the issue and sorry for the trouble!

It shouldn't (typically) be necessary to manage the database connection pool yourself. GoodJob wraps each job with a Rails reloader that will checkout/check-in database connections:

def create_task(delay = 0)
future = Concurrent::ScheduledTask.new(delay, args: [performer], executor: executor, timer_set: timer_set) do |thr_performer|
Thread.current.name = Thread.current.name.sub("-worker-", "-thread-") if Thread.current.name
Rails.application.reloader.wrap do
thr_performer.next
end
end
future.add_observer(self, :task_observer)
future.execute
end

Edit: just saw @reczy's comment. That is my first thought of where the culprit lies.

@stas
Copy link
Contributor Author

stas commented Feb 17, 2022

Thanks @reczy @bensheldon

We're running GJ on a dedicated instance, so there's nothing but the workers consuming the pool connections.

From what I understand, running the default setup of bundle exec good_job start --max-threads=5 means we'd need a pool of:

  • 5 connections per thread
  • 1 connection to LISTEN/NOTIFY aka notifier per process

I'm not sure I understand the 20% margin though, @reczy do you think you can help clarify where this is coming from?

@bensheldon looking at the scheduler implementation, from what I can understand, every new job that has to be processed in the future, will get a thread and wait to be processed, is my assumption correct?
https://github.com/bensheldon/good_job/blob/main/lib/good_job/scheduler.rb#L266

Because that would explain how we managed to drain the connections pool, as we have a lot of jobs that get scheduled, and it seems like those are just sitting there and waiting.

We had a similar issue in pq and we came up with a pooling query that's returning the upcoming job instead:
https://github.com/malthe/pq/blob/master/pq/__init__.py#L309-L339

Where que took the path of creating a trigger to solve this:
https://github.com/que-rb/que/blob/master/lib/que/migrations/4/up.sql#L126-L180

Appreciate your help here!!! 🙇

@bensheldon
Copy link
Owner

bensheldon commented Feb 17, 2022

bundle exec good_job start --max-threads=5

Yes, that would use 6 threads total (5 execution + 1 LISTEN/NOTIFY), assuming that you're using the default --queues=*. But if you start splitting your queues into multiple pools (e.g. --queues=slow;fast) that would be multiplicative (e.g. 2 pools X 5 max-threads => 10 threads + 1 LISTEN/NOTIFY = 11 threads).

every new job that has to be processed in the future, will get a thread and wait to be processed, is my assumption correct?

Slightly different. The Future's Concurrent::ScheduledTask's block isn't executed until the ThreadPoolExecutor allocates a thread and executes it, so no thread is allocated (and thus no connection checked-out) until the pool has room for it.

Regarding the 20% margin, I'm less confident that there isn't an extra thread or two that lingers a second longer than expected (though I'm hopeful someone would have reported it by now; and I've been running GoodJob successfully across multiple production applications). And also I've seen multiple times where multithreaded code is introduced into an application without awareness that each thread will consume a database connection. I don't see a benefit to running extra lean on configured database connections.

@stas
Copy link
Contributor Author

stas commented Feb 17, 2022

Got it! Thanks a lot @bensheldon

Would you be ok to merge a PR if I prepare it with a note about how to setup the GoodJob in production?

@bensheldon
Copy link
Owner

@stas definitely. Thank you.

bensheldon added a commit that referenced this issue Mar 3, 2022
* Added a chapter on how to prepare for production.

See #523 for more feedback.

* Fix lint errors and move db connection details into section

Co-authored-by: Ben Sheldon <bensheldon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants