Improve database connection management #145

jcass77 · 2021-05-07T07:08:41Z

There have been various issues posted that are related to 'lost' database connections.

Database connection management is a fairly complex topic, and the official Django documentation contains some great pointers to help you configure connection management for your specific database properly.

Generally speaking, adding a database connection pooler like pgbouncer to handle database connections for you can also help avoid this class of errors altogether.

For everyone else, we might be able to improve things slightly be implementing our own connection management wrapper.

Taking a closer look at Celery's DjangoWorkerFixup implementation might be a good place to start.

dongfanger · 2021-05-20T08:30:40Z

I met the same question, and it confused me a lot. After I viewed issue #73 and #60, I wrote a patch like this:

from django.db import connections
from django_apscheduler.models import DjangoJobExecution


def close_old_connections():
    _old_atomic_update_or_create = DjangoJobExecution.atomic_update_or_create

    def atomic_update_or_create_with_close(lock,
                                           job_id: str,
                                           run_time: datetime,
                                           status: str,
                                           exception: str = None,
                                           traceback: str = None, ) -> "DjangoJobExecution":
        job_execution = _old_atomic_update_or_create(lock, job_id, run_time, status, exception, traceback, )
        for conn in connections.all():
            conn.close_if_unusable_or_obsolete()
        return job_execution

    DjangoJobExecution.atomic_update_or_create = atomic_update_or_create_with_close


close_old_connections()

somehow, it worked!

jcass77 · 2021-06-13T18:12:01Z

In order to figure out how to approach this issue, it probably makes sense to first look at how the standard Django framework deals with database connection management. Keep in mind however that Django typically only needs to perform database operations in response to some request that it received from a user via their browser.

This request-response cycle does not exist when dealing with background process managers like APScheduler.

Having said that, from what I have been able to gather, Django manages connections in the following way:

When Django is started and the database connection handler is first created, code logic is executed in order to subscribe to the request_started and request_finished signals, and call django.db.close_old_connections() at the beginning and end of the request-response-cycle.
django.db.close_old_connections() closes database connections if:

the application didn't restore the original autocommit setting
an error occurred; or
the connection's max age has been exceeded (as defined by the CONN_MAX_AGE configuration setting).

The default for CONN_MAX_AGE is 0, which means that all connections are cleaned up at the end of every request. The tradeoff of not using persistent connections is that Django has to establish a new database connection on every request, which carries with it some additional overhead. The database engine also needs to allow at least as many concurrent connections as there are webserver worker threads + however many APScheduler jobs you might have running that also want to connect to the database.
(3) above is why it makes sense to set a high enough value for CONN_MAX_AGE to allow connections to be re-used (between requests in the standard Django use case). But doing so means that the database server might close connections that Django was still intending to re-use at some point in the future (which causes the type of errors that this issue relates to).
It is safe to close the connections at the beginning and end of each request since (a) Django's database connections are thread local (i.e. each thread maintains its own connection) and (b) each request-response-cycle is handled by a separate webserver worker process in a single thread. In other words, closing connections in one thread will not affect connections in another thread. By making use of a BlockingScheduler in a Django management command we ensure that we only have one thread running that wants to access the job store (i.e. database) as well.
Closing DB connections at the end of the request means that all database operations that are performed as part of that request still get to share the same connection, which improves performance. By comparison, if we were to call close_old_connections or close_if_unusable_or_obsolete at arbitrary points in our code we lose the ability to re-use connections and honour the Django settings that the user has configured.

So there are essentially two problems to solve:

We do not have the benefit of being able to wait for a request to be initiated from the client side before initialising our database connections: a new job can be added or removed at any time requiring one or more database operations to be performed; and
It is also difficult to detect when all intended database operations have been completed, and it is safe to clean up our database connections, since we do not have anything analogous to a HTTP response being sent to mark the completion of all work required to produce that response.

Resolves #145.

jcass77 · 2021-06-13T18:30:42Z

A potential solution has been proposed as part of #148.

The approach is rather speculative / experimental and I will merge it into the main branch, and cut a new release, once enough people have tried it out and provided some feedback.

bluetech · 2021-06-14T08:34:50Z

From my understanding, your analysis is accurate. Thanks a lot for looking at this.

we do not have anything analogous to a HTTP response being sent to mark the completion of all work required to produce that response.

So the way I see it, we do have an analogy for a request-response cycle -- a scheduler loop iteration. In my PR, I used get_due_jobs as an easy proxy for that. But we can also ask apscheduler to add an event for "iteration started" and "iteration finished" or such, and then hook close_old_connections there (or suggest to the user to do so on their own).

We currently use django-apscheduler with postgres, so full disclosure, I am mostly thinking about this database. As you say there are usually two scenarios for this:

Not using pgbouncer. In this case users would definitely want to set CONN_MAX_AGE to a non-zero value to avoid the connection overhead each time.
Using pgbouncer (probably with transaction pooling). In this case CONN_MAX_AGE = 0 is usually a good choice.

The solution you propose in your PR is to detect connection problems after the fact and then fix the connections. My main objections to this are:

It does not respect CONN_MAX_AGE = 0. When using this (usually only makes sense with pgbouncer), users want connections to be recycled "often" on the assumption that connections are fast & cheap.
It does not respect a non-zero CONN_MAX_AGE. In postgres in particular it is recommended to recycle connections every so often (here is a discussion about this I have seen recently). I guess this is why CONN_MAX_AGE takes a duration instead of just a boolean.

In your PR you wrote

Calling db.close_old_connections() pre-emptively before the job store executes a DB operation: this would break Django's standard connection management. For example, if the CONN_MAX_AGE setting is set to 0, a new connection will be required for every database operation (as opposed to at the end of every request like in the Django standard).

This is true, but is not quite what I propose -- I suggest doing it at the start (and possibly end) of every scheduler iteration. If a connection problem occurs during an iteration, that particular iteration is lost, but the next one will try to recover. This is analogous to a connection problem occurring during a request -- the request itself is done for but the next one will try to reconnect.

BTW, django-apscheduler users are likely to also use the Django DB inside their jobs, and the connection problems will then also happen there (particularly with a thread pool executor etc.). But that is a problem for the executor, not the jobstore, so out of scope for this discussion (although after this issue is resolved we can maybe provide some guidance on this as a note in the docs).

jcass77 · 2021-06-14T10:17:59Z

@bluetech, thanks for the feedback! Comments below...

...we do have an analogy for a request-response cycle -- a scheduler loop iteration. In my PR, I used get_due_jobs as an easy proxy for that.

I'm not sure it is safe to make assumptions about how the APScheduler internals work. In fact, I could not even find get_due_jobs in the codebase for the upcoming release anymore :(

...we can also ask apscheduler to add an event for "iteration started" and "iteration finished" or such, and then hook close_old_connections there (or suggest to the user to do so on their own).

It would be nice if APScheduler could provide these type of hooks to allow job store implementers to integrate in a more decoupled fashion. The contributors of the APScheduler project appear to have their heads down working on release 4.0 however and I'm not sure they will be able to entertain feature requests at this time (although we could ask). I think that there will be many breaking changes in the new release anyway and they might even include a persistent job store in the standard solution (PostgresqlDataStore certainly appears to be a step in that direction).

There are also issues with basic event sequencing which makes an event-based approach unreliable at this stage (see agronholm/apscheduler#445, for example).

You are right about all of the other tradeoffs: Django will clean up connections at the start and end of every request, and we will basically wait until a problem occurs and then retry with a fresh DB connection. This is not ideal.

For now I think the compromise is that this workaround at least gives us a way to help users avoid confusing database connection-related error conditions (most of whom are not used to having to worry about that type of thing as part of typical Django app development). Including a guideline to clean up connections at the start and end of your own jobs is a great idea (perhaps we could provide a utility context manager for that purpose).

The downside of all this however is that, for now, we won't strictly honour the spirit of CONN_MAX_AGE.

From my own experience: I have never encountered any connection issues in more than three years of running django-apscheduler in production. But this requires some extra care to configure everything properly. Relying on a connection pooler like pgbouncer simplifies connection management a lot.

This workaround is probably for those projects that do not have the ability to run a database connection pooler themselves for whatever reason. We will most likely have to revisit everything again when APScheduler 4.0 is released and the underlying architecture changes again.

jcass77 · 2021-06-15T02:20:04Z

Using pgbouncer (probably with transaction pooling). In this case CONN_MAX_AGE = 0 is usually a good choice.

Wouldn't CONN_MAX_AGE = None be better? That way the connection would be persistent, and completely under pgbouncer's control.

jcass77 · 2021-06-15T04:45:15Z

I have added an ensure_old_connections_are_closed decorator that can be used to close stale Django DB connections before and after a job is run.

This should mimic the Django standard behaviour for HTTP requests (i.e. each job run is handled as if it were a separate 'request').

Of course now the question becomes why we need both ensure_old_connections_are_closed and retry_on_db_operational_error? Maybe the rule of thumb should be:

Jobs that require database access should be wrapped in ensure_old_connections_are_closed to ensure that the DB connections are in a consistent state (and CONN_MAX_AGE is enforced) before and after the job is run.

All database operations within the same job will still get to share the same database connection (just like a Django request would).

It is possible that the overhead of doing connection management on a per-job basis could become too great, but by the time this happens users should really be resorting to making use of connection poolers and persistent connections instead.
django-apscheduler should use retry_on_db_operational_error internally, since we have no control over in what order, or how frequently, APScheduler will call the various job store methods (and cleaning up DB connections on every method call is probably overkill).

jcass77 · 2021-06-15T04:49:16Z

PR has been merged in to make it easier for people to try out. Keeping this issue open for further comment and testing.

bluetech · 2021-06-16T08:18:18Z

I tested the current develop branch (d11fe38) in a local docker environment using postgres, and using BlockingScheduler running in its own service. It seems to fix the issue for me.

I tested as follows:

High CONN_MAX_AGE.
Set log_connections = on in postgres.
An interval job running every 5 seconds which performs 4 pg_sleep(1) queries.
Wrapped the job with @util.ensure_old_connections_are_closed.
Let it run a bit.
Stop the DB for a while.
Start the DB.

I made sure that both the job itself fails (so the worker needs to reconnect) and the jobstore fails (so the jobstore needs to reconnect). Both recovered after the DB went back up.

As a separate test, I did the same thing with a low CONN_MAX_AGE, and verified that the worker connections are reestablished according to the max age (the jobstore connection doesn't, as expected).

bluetech · 2021-06-16T08:26:19Z

A few minor comments:

The name ensure_old_connections_are_closed is accurate but very "procedural". It might be better to give it a more open-handed "semantic" name like uses_django_db or such. (I can see it both ways).

The README now says

If your APScheduler jobs require database access, and you are not making use of a connection pooler and persistent connections, then it is probably a good idea to wrap those jobs in a ensure_old_connections_are_closed decorator to ensure that Django's CONN_MAX_AGE configuration setting is enforced before and after your job is run.

First the phrase "and you are not making use of a connection pooler and persistent connections" is a bit ambiguous - I'm not sure if you meant uses connection pooler && uses persistent connections or if you meant || there.

Either way I think it makes sense to wrap in both cases:

Using connection pooler - the connection pooler can also go down, and respecting CONN_MAX_AGE is still a good idea (if they user doesn't want it they can set CONN_MAX_AGE = None).
Using persistent connections - needed to recover from errors.

The README says

Django will not reconnect automatically and you need to re-start django-apscheduler as well.

which is not quite accurate anymore.

jcass77 · 2021-06-16T14:28:30Z

Yes the ensure_old_connections_are_closed name is clunky. I can't think of a better one that still makes it clear what the decorator does. The problem is that it is optional to a certain degree (you can use the Django DB without it). Maybe we should just change it to close_old_connectios to mirror the Django standard functionality and be done with it.

Changed README to:

The close_old_connections decorator should be applied to APScheduler jobs that require database access. Doing so
ensures that Django's CONN_MAX_AGE configuration setting is enforced before and after your job is run. This mirrors the standard Django functionality of doing the same before and after handling each HTTP request.

I'm not sure about removing the recommendation to restart django-apscheduler (and by implication Django) in tandem with the database. The open issue that is linked to in the Django project seems to indicate that this can still be necessary under the right circumstances.

It is also meant to address weird user workflows that people have logged issues about in the past (e.g. taking the database down for backups, or to set the system clock to the past) and then expect everything to keep on working without a glitch.

Thanks for the time that you have spent to look into this in such detail @bluetech! If there is nothing else then I will cut a new release in the next day or so.

bluetech · 2021-06-16T14:36:08Z

Sounds good to me, thanks a lot!

jcass77 · 2021-06-17T09:24:41Z

New functionality has been released as part of v0.6.0.

jcass77 added the enhancement label May 7, 2021

bluetech mentioned this issue Jun 12, 2021

Close old DB connections allow to recover from DB errors on each iteration #147

Closed

jcass77 pinned this issue Jun 13, 2021

jcass77 added a commit that referenced this issue Jun 13, 2021

enh(jobstore): Add logic to deal with django.db.OperationalError.

4436ddc

Resolves #145.

jcass77 mentioned this issue Jun 13, 2021

Improved database connection management #148

Merged

jcass77 added the wait response Waiting response about bug fix label Jun 13, 2021

jcass77 closed this as completed in #148 Jun 15, 2021

jcass77 reopened this Jun 15, 2021

jcass77 closed this as completed Jun 17, 2021

jcass77 removed the wait response Waiting response about bug fix label Jun 17, 2021

jcass77 unpinned this issue Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve database connection management #145

Improve database connection management #145

jcass77 commented May 7, 2021

dongfanger commented May 20, 2021

jcass77 commented Jun 13, 2021 •

edited

Loading

jcass77 commented Jun 13, 2021

bluetech commented Jun 14, 2021

jcass77 commented Jun 14, 2021 •

edited

Loading

jcass77 commented Jun 15, 2021

jcass77 commented Jun 15, 2021 •

edited

Loading

jcass77 commented Jun 15, 2021

bluetech commented Jun 16, 2021

bluetech commented Jun 16, 2021

jcass77 commented Jun 16, 2021

bluetech commented Jun 16, 2021

jcass77 commented Jun 17, 2021

Improve database connection management #145

Improve database connection management #145

Comments

jcass77 commented May 7, 2021

dongfanger commented May 20, 2021

jcass77 commented Jun 13, 2021 • edited Loading

jcass77 commented Jun 13, 2021

bluetech commented Jun 14, 2021

jcass77 commented Jun 14, 2021 • edited Loading

jcass77 commented Jun 15, 2021

jcass77 commented Jun 15, 2021 • edited Loading

jcass77 commented Jun 15, 2021

bluetech commented Jun 16, 2021

bluetech commented Jun 16, 2021

jcass77 commented Jun 16, 2021

bluetech commented Jun 16, 2021

jcass77 commented Jun 17, 2021

jcass77 commented Jun 13, 2021 •

edited

Loading

jcass77 commented Jun 14, 2021 •

edited

Loading

jcass77 commented Jun 15, 2021 •

edited

Loading