Best practices in deploying and monitoring a queue #523

stas · 2022-02-16T19:28:53Z

Hi there,
first of all, thank you for all the work that everyone has dedicated into this project!!! ❤️

The reason I'm reaching out is because we're running a bunch of queues with GoodJob on Heroku, and recently we started seeing jobs piling up and not being processed.

After a bit of digging, we found that GooJob would end up out of available connections in the pool (which is totally fair):

ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.045 seconds); all pooled connections were in use

But what's scary is that the running processes would be reported as healthy by Heroku, even though the performance of the queue degraded (eg. 500+ jobs pending for 10-20m, while normally it would take minutes to process such a load)!

Based on this, my immediate takeaways are:

the workers do not return the connections back into the pool
the monitoring we're having in place is not ok

Anybody else had to go through something similar?

Should we explicitly try to release any connection back into the pool? Should we use the built-in health-check instead of the generic pid-based health-check?

mikereczek · 2022-02-16T20:19:20Z

Hi @stas - thank you for your work on jsonapi-serializer!

Anybody else had to go through something similar?

I ran into a similar issue in dev a while back. If I remember correctly, I had failed to account for the GoodJob LISTEN/NOTIFY thread when setting the database pool size. I think the latest recommendation (@bensheldon - please correct me if I am wrong) is this:

At the moment, I would recommend that your database pool size = [Web/Puma threads if async] + [GoodJob execution threads] + [1 GoodJob LISTEN/NOTIFY thread] + [2 GoodJob cron threads] + 20% margin

bensheldon · 2022-02-16T20:22:59Z

@stas thanks for opening the issue and sorry for the trouble!

It shouldn't (typically) be necessary to manage the database connection pool yourself. GoodJob wraps each job with a Rails reloader that will checkout/check-in database connections:

good_job/lib/good_job/scheduler.rb

Lines 266 to 275 in 2edcd02

    
           def create_task(delay = 0) 
        
             future = Concurrent::ScheduledTask.new(delay, args: [performer], executor: executor, timer_set: timer_set) do |thr_performer| 
        
               Thread.current.name = Thread.current.name.sub("-worker-", "-thread-") if Thread.current.name 
        
               Rails.application.reloader.wrap do 
        
                 thr_performer.next 
        
               end 
        
             end 
        
             future.add_observer(self, :task_observer) 
        
             future.execute 
        
           end

Edit: just saw @reczy's comment. That is my first thought of where the culprit lies.

stas · 2022-02-17T11:07:36Z

Thanks @reczy @bensheldon

We're running GJ on a dedicated instance, so there's nothing but the workers consuming the pool connections.

From what I understand, running the default setup of bundle exec good_job start --max-threads=5 means we'd need a pool of:

5 connections per thread
1 connection to LISTEN/NOTIFY aka notifier per process

I'm not sure I understand the 20% margin though, @reczy do you think you can help clarify where this is coming from?

@bensheldon looking at the scheduler implementation, from what I can understand, every new job that has to be processed in the future, will get a thread and wait to be processed, is my assumption correct?
https://github.com/bensheldon/good_job/blob/main/lib/good_job/scheduler.rb#L266

Because that would explain how we managed to drain the connections pool, as we have a lot of jobs that get scheduled, and it seems like those are just sitting there and waiting.

We had a similar issue in pq and we came up with a pooling query that's returning the upcoming job instead:
https://github.com/malthe/pq/blob/master/pq/__init__.py#L309-L339

Where que took the path of creating a trigger to solve this:
https://github.com/que-rb/que/blob/master/lib/que/migrations/4/up.sql#L126-L180

Appreciate your help here!!! 🙇

bensheldon · 2022-02-17T16:13:35Z

bundle exec good_job start --max-threads=5

Yes, that would use 6 threads total (5 execution + 1 LISTEN/NOTIFY), assuming that you're using the default --queues=*. But if you start splitting your queues into multiple pools (e.g. --queues=slow;fast) that would be multiplicative (e.g. 2 pools X 5 max-threads => 10 threads + 1 LISTEN/NOTIFY = 11 threads).

every new job that has to be processed in the future, will get a thread and wait to be processed, is my assumption correct?

Slightly different. The ~~Future's~~ Concurrent::ScheduledTask's block isn't executed until the ThreadPoolExecutor allocates a thread and executes it, so no thread is allocated (and thus no connection checked-out) until the pool has room for it.

Regarding the 20% margin, I'm less confident that there isn't an extra thread or two that lingers a second longer than expected (though I'm hopeful someone would have reported it by now; and I've been running GoodJob successfully across multiple production applications). And also I've seen multiple times where multithreaded code is introduced into an application without awareness that each thread will consume a database connection. I don't see a benefit to running extra lean on configured database connections.

stas · 2022-02-17T20:12:01Z

Got it! Thanks a lot @bensheldon

Would you be ok to merge a PR if I prepare it with a note about how to setup the GoodJob in production?

bensheldon · 2022-02-17T20:41:35Z

@stas definitely. Thank you.

* Added a chapter on how to prepare for production. See #523 for more feedback. * Fix lint errors and move db connection details into section Co-authored-by: Ben Sheldon <bensheldon@gmail.com>

stas mentioned this issue Feb 18, 2022

Added a chapter on how to prepare for production. #525

Merged

bensheldon closed this as completed in #525 Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices in deploying and monitoring a queue #523

Best practices in deploying and monitoring a queue #523

stas commented Feb 16, 2022

mikereczek commented Feb 16, 2022

bensheldon commented Feb 16, 2022

stas commented Feb 17, 2022

bensheldon commented Feb 17, 2022 •

edited

Loading

stas commented Feb 17, 2022

bensheldon commented Feb 17, 2022

Best practices in deploying and monitoring a queue #523

Best practices in deploying and monitoring a queue #523

Comments

stas commented Feb 16, 2022

mikereczek commented Feb 16, 2022

bensheldon commented Feb 16, 2022

stas commented Feb 17, 2022

bensheldon commented Feb 17, 2022 • edited Loading

stas commented Feb 17, 2022

bensheldon commented Feb 17, 2022

bensheldon commented Feb 17, 2022 •

edited

Loading