more pool fixes #1211

abonander · 2021-05-06T00:17:14Z

a task that is marked woken but didn't actually wake before being cancelled will instead wake the next task in the queue
a task that wakes but doesn't get a connection will put itself back in the queue instead of waiting until it times out with no way to be woken
the idle reaper now won't run if there are tasks waiting for a connection, and also uses
the proper SharedPool::release() to return validated connections to the pool so waiting tasks get woken

(hopefully for good this time)

Signed-off-by: Austin Bonander austin@launchbadge.com

* a task that is marked woken but didn't actually wake before being cancelled will instead wake the next task in the queue * a task that wakes but doesn't get a connection will put itself back in the queue instead of waiting until it times out with no way to be woken * the idle reaper now won't run if there are tasks waiting for a connection, and also uses the proper `SharedPool::release()` to return validated connections to the pool so waiting tasks get woken closes #622, #1210 (hopefully for good this time) Signed-off-by: Austin Bonander <austin@launchbadge.com>

abonander · 2021-05-06T20:13:20Z

@D1plo1d @bbqsrc if you would, please try this branch to see if it fixes your issues.

D1plo1d · 2021-05-07T16:39:23Z

Has something changed with the Executor trait? I'm seeing lots of errors in my functions that take an executor (eg. to receive either a Transaction or a Pool<Sqlite>) like this:

error[E0277]: the trait bound `&mut Transaction<'_, Sqlite>: sqlx_core::executor::Executor<'_>` is not satisfied
   --> crates/auth/src/user/resolvers/user_mutation_resolvers.rs:106:21
    |
106 |         user.remove(&mut tx, false).await?;
    |                     ^^^^^^^ the trait `sqlx_core::executor::Executor<'_>` is not implemented for `&mut Transaction<'_, Sqlite>`

abonander · 2021-05-07T19:59:03Z

@D1plo1d is there more context to that error? That impl exists: https://docs.rs/sqlx/0.5.2/sqlx/struct.Transaction.html#impl-Executor%3C%27t%3E-4

bbqsrc · 2021-05-08T12:20:46Z

@abonander I will be back at work on Tuesday and will test it then. If you don't hear from me, please remind me with a ping.

D1plo1d · 2021-05-08T15:51:59Z

@abonander I had another look - turned out I has missed a few instances of async-graphql when I was switching my monorepo to this git branch. With that fixed it looks like everything is working now so I'd say it looks like this PR fixed the issue. Cheers!

D1plo1d · 2021-05-08T19:22:11Z

@abonander Upon testing with --release I'm seeing more errors:

pool timed out while waiting for an open connection

Sadly it looks like the relative slowness of my debug builds was what was preventing the race conditions earlier :(

abonander · 2021-05-10T00:40:13Z

@D1plo1d does setting a higher connect_timeout not fix it? The acquire timeout is an expected result at high load.

athei · 2021-05-10T08:03:12Z

I also have timeouts with highly parallel accesses to the connection pool. Using the postgres backend. Am am trying out this PR right now and my applications runs for a day without timeouts so far. I am cautiously optimistic that this fixes my problem.

D1plo1d · 2021-05-10T22:07:32Z

@D1plo1d does setting a higher connect_timeout not fix it? The acquire timeout is an expected result at high load.

It shouldn't be under high load if I understand the throughput that would qualify as that. This is a dev instance of my server running locally with a few transactions a second normally and the occasional bursts of a hundred or so on a GraphQL request.

This is not optimized but also this worked perfectly fine up till this release and still does when I rollback sqlx.

I suspect the contention is due to the fact that I have many queries starting at approximately the same time and less to do with throughput.

I will try increasing the connect_timeout. If it is that is there any reason why the connect timeouts would be substantially slower in the new version? Again this worked previously.

D1plo1d · 2021-05-10T22:21:55Z

@abonander I don't have time today but I'll reverify my claims here and also test out connect_timeout tomorrow or Wednesday.

athei · 2021-05-10T22:42:32Z

I had the same issue and increasing the connect_timeout to basically infinity did not help. Every task just got stuck on while for an sql connection. It seemed to be some kind of busy waiting where the processes where at full load while not getting a connection. However, for me it seems to be fixed with the this PR.

D1plo1d · 2021-05-13T01:13:58Z

After re-testing I've only been able to reproduce the problem on 0.5.2 today - all my tests in this branch look good.

I think that my issues on this branch were most likely some kind of PBCAK on my end. My apologies for the confusion.

Thanks a bunch for patching this @abonander !

jstasiak · 2021-05-21T14:22:05Z

I also had the original issue with sqlx 0.5.2 and this PR fixed it – a release with this change would be greatly appreciated.

Thank you for your work!

D1plo1d · 2021-05-26T22:12:05Z

Just a heads up @abonander - I've managed to observe the issue on 0.5.3.

This is with my connection_timeout set to 60 minutes.

It's more intermittent then before but the pool is still deadlocking on something for me. I've downgraded back to 0.5.1 again and re-confirmed that it still resolves the issue for me (no regression in my own codebase on that front). Not sure if there's any more information I can provide that would help debug this one - let me know.

See launchbadge/sqlx#1211

abonander · 2021-05-26T22:22:13Z

A solid repro case would really, really help here, or at least a good characterization of the load that triggers the issue.

I know that's gonna be really hard to come up with though since this is ultimately a concurrency bug.

A simple project like a one-route web API that reliably deadlocks under a synthetic stress test would be great.

D1plo1d · 2021-05-26T22:27:52Z

Yeah, I wish I knew what specific to my application was triggering it. I noticed that it seems to only occur when I'm using sqlx transactions. Could it be related to my use of transactions somehow?

abonander · 2021-05-26T22:33:10Z

You mean like only the routes that call Pool::begin()?

D1plo1d · 2021-05-26T23:29:40Z

Yeah, Pool::begin(). Although looking closer at this I don't think the transactions I was initially thinking about should be running given the application state atm. I'll keep looking at this though.

abonander mentioned this pull request May 6, 2021

Pool::acquire times out #622

Closed

jplatte mentioned this pull request May 6, 2021

Pool timed out using actix-web with sqlx #1199

Closed

abonander force-pushed the ab/more-pool-fixes branch from c74c26d to 0847a04 Compare May 7, 2021 01:09

abonander merged commit 8f1d8c7 into master May 18, 2021

abonander deleted the ab/more-pool-fixes branch May 18, 2021 02:24

D1plo1d added a commit to PrintSpool/PrintSpool that referenced this pull request May 26, 2021

chore: Downgraded sqlx to fix ongoing pool deadlock issue

d3ced5f

See launchbadge/sqlx#1211

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more pool fixes #1211

more pool fixes #1211

abonander commented May 6, 2021

abonander commented May 6, 2021

D1plo1d commented May 7, 2021

abonander commented May 7, 2021

bbqsrc commented May 8, 2021

D1plo1d commented May 8, 2021

D1plo1d commented May 8, 2021 •

edited

Loading

abonander commented May 10, 2021

athei commented May 10, 2021 •

edited

Loading

D1plo1d commented May 10, 2021

D1plo1d commented May 10, 2021

athei commented May 10, 2021 •

edited

Loading

D1plo1d commented May 13, 2021 •

edited

Loading

jstasiak commented May 21, 2021

D1plo1d commented May 26, 2021

abonander commented May 26, 2021

D1plo1d commented May 26, 2021

abonander commented May 26, 2021

D1plo1d commented May 26, 2021

more pool fixes #1211

more pool fixes #1211

Conversation

abonander commented May 6, 2021

abonander commented May 6, 2021

D1plo1d commented May 7, 2021

abonander commented May 7, 2021

bbqsrc commented May 8, 2021

D1plo1d commented May 8, 2021

D1plo1d commented May 8, 2021 • edited Loading

abonander commented May 10, 2021

athei commented May 10, 2021 • edited Loading

D1plo1d commented May 10, 2021

D1plo1d commented May 10, 2021

athei commented May 10, 2021 • edited Loading

D1plo1d commented May 13, 2021 • edited Loading

jstasiak commented May 21, 2021

D1plo1d commented May 26, 2021

abonander commented May 26, 2021

D1plo1d commented May 26, 2021

abonander commented May 26, 2021

D1plo1d commented May 26, 2021

D1plo1d commented May 8, 2021 •

edited

Loading

athei commented May 10, 2021 •

edited

Loading

athei commented May 10, 2021 •

edited

Loading

D1plo1d commented May 13, 2021 •

edited

Loading