Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single timeout thread never stops and could block #909

Closed
sehrope opened this issue Dec 13, 2018 · 8 comments · Fixed by #920
Closed

Single timeout thread never stops and could block #909

sehrope opened this issue Dec 13, 2018 · 8 comments · Fixed by #920

Comments

@sehrope
Copy link
Contributor

sehrope commented Dec 13, 2018

Driver version

7.1.3 (preview release)

Problem description

The shared single threaded timeout handler introduced in #842, once started via running a query on a connection with a non-zero queryTimeout, never stops. It keeps running even after all connections are closed. This will lead to a classloader issue in any environment with multiple classloaders (ex: Tomcat) that dynamically add and remove the driver class as the associated classes can never be unloaded.

I haven't recreated that situation myself (I don't use Tomcat) but recognized it as an issue as the PostgreSQL JDBC driver had a similar issue a couple years back. For background, the original pgjdbc issue reported by an end user is is here: pgjdbc/pgjdbc#188, PR that fixes it is here: pgjdbc/pgjdbc#197.

The TimeoutPoller also executes the timeout specific interrupt() action in the single timer thread. From skimming through the code I don't think that's an issue right now as the interrupts being executed are non-blocking. If that changes down the road, then it'd be possible for a single hanging interrupt() execution to cause the timer thread to stall indefinitely. That'd stall all timeouts on all connections using the same classloader as the thread is shared. Having the interrupt execution happen in a separate thread (say from a cached thread pool) should fix that.

Again, I'm not sure if that specific problem could happen as the only IO operations inside the interrupt() calls I could see are closing sockets but it could happen down the road with unrelated code changes. Synchronized blocks may lead to to a similar issue.

Expected Behavior

  1. When all connections are closed there should not be any extra threads running and all classes should be unloaded.
  2. Timeout interrupt() operations from one command should not be able to stall timeouts for other commands.

Regarding actual implementation, since the newer versions of the driver target JDK 8+, I'd suggest using a shared ScheduledThreadPoolExecutor with a pool size of one (1) that could shutdown when there are no more connections referencing it. That'd be a slightly more modern version of what pgjdbc implemented.

One other advantage compared vs the current implementation of a single thread in a sleep / check loop is that that the standard implementation of ScheduledThreadPoolExecutor uses a heap to sort the timeout commands rather than looping through the entire list every time and it sleeps until the next timeout period rather than waking up every second.

If there's confirmation and internal API agreement that command specific interrupt() calls cannot invoke any blocking operation then it should be fine to have a single thread invoking them. Otherwise, the interupt() calls should be handled via a separate thread pool to ensure that one operation does not block all other timeouts.

@ulvii
Copy link
Contributor

ulvii commented Dec 13, 2018

Hi @sehrope ,

Thank you for creating the issue. We will investigate and get back to you.

@ulvii
Copy link
Contributor

ulvii commented Dec 14, 2018

Hi @sehrope ,

Thank you for bringing this to our attention. The only concern I have is the change in behavior. In a scenario when all connections are closed and the application requests another connection, the driver will have to spin up a new timer thread. @shayaantx thoughts?

@sehrope
Copy link
Contributor Author

sehrope commented Dec 14, 2018

If the lifecycle of the shared timer is tied to the connections then that shouldn't be an issue as an active application will continuously have idle connections in a pool that will retain a reference to the timer.

The timer would shutdown only when either:

  • There is no activity in the application, all pooled connections expire out of the pool, and the connection pool is configured with a "min idle" of zero.
  • The entire application is shutdown (which would force all connections to close)

Yes, an application transitioning from idle to active and specifying a query timeout would need to spin up a new timer thread. But that same application must also have a pool size of zero and that means it'd have to open a new TCP socket, perform an SSL handshake, handle TDS init, authenticate, etc. All of that is going to take much longer than spinning up a new thread.

Separately, spawning a new thread to handle the interrupt() calls might cause some thrashing if there's a lot of it happening concurrently but short of guaranteeing that the calls will be short and non-blocking, it can't really be avoided. Any hanging interrupt could prevent all timeout operations across all connections (not even just it's "siblings" from the same data source) by blocking the primary timer thread.

In practice I doubt many applications are invoking the interrupt() much anyway. The vast majority of application code (that I've seen...) that enables query timeouts expects its operations to complete prior to the timeout. The timeout firing an interrupt to cancel the query is exceptional case and only they'd have to pay the cost for spinning up the thread that executes the interruption.

@shayaantx
Copy link
Contributor

shayaantx commented Dec 15, 2018

@ulvii I think @sehrope brings up some valid points.

If the classloader unloads the driver classes (either through garbage collection or someone manually unloading them), then this thread would hang around. Although I don't usually see a lot of that happen happening since people usually keep idle connections around in some pool for the duration of their application. Regardless seems worth addressing.

The possible blocking behavior in the lone timeout thread would of existed prior to my changes, as I reused the interrupt logic that already existed (it would of been even worse before since we used an unlimited thread pool for timeout handling). I only looked briefly and it looks like non blocking sockets are being used in the tds channel for tds writes/reads (didn't look that in depth to fully confirm), so the likelihood of this type of behavior occurring seems low. If the interrupt lock had synchronization issues, then yes this could happen.

Regardless I think addressing both would be ideal, I can attempt to open a PR for these changes this weekend.

Let me know of any questions or concerns

Thanks

@shayaantx
Copy link
Contributor

shayaantx commented Dec 15, 2018

@ulvii one thought I had regarding an actual implementation, wouldn't be appropriate to just start the timer thread when the driver is constructed and use finalize to stop the timer thread (in the sqlserverdriver class)?

In the scenario where the driver gets unloaded, when it gets registered again, the timer would start. Then if the driver gets unregistered finalize would be called and we can stop the timer thread. If the driver never gets unloaded, then the thread just sits idle. Normally I wouldn't use finalize, but with this presented use case I think this would work well here. I guess its possible the jvm only unloads some of the mssql package.

Is there already something where I can check all the active connections easily?

Thoughts?

@ulvii
Copy link
Contributor

ulvii commented Dec 18, 2018

Hi @sehrope, @shayaantx,

Agreed that the issues needs to be addressed. Please also note that the driver uses blocking sockets only.

I also think the driver should not start a new timer thread every time it is constructed, simply because query timeouts are not used that often.

@shayaantx, would you be able to create a PR?

@shayaantx
Copy link
Contributor

Hi @ulvii, @sehrope

I can make a PR, I'm not entirely sure regarding the complete fix.

Regarding the interrupt blocking:

If I'm understanding sehropes recommendation (please let me know if I'm missing something obvious), use a ScheduledThreadPoolExecutor per connection for timeouts? That would be essentially old implementation of the timeouts where you could have many timeout threads (basically a timeout thread per connection). If the suggestion is to use a single threaded pool for all timeouts and set some arbitrary timeout for how long we wait for an interrupt, that makes most sense to me so far.

Regarding the class loader leak:

The other thing I don't really know if its available in the api, are we recording anywhere in the driver where I can check all connections are closed or open? Are we storing some in-memory collection of connections anywhere? If not seems like I'd need to implement that to actually check if the timeout thread can be shutdown when all connections are closed. Otherwise I can stop/restart the thread every-time timeouts become available.

Let me know

Thanks

@sehrope
Copy link
Contributor Author

sehrope commented Dec 18, 2018

-1 to using finalizers. Better to handle things directly as finalizers both slow things down and add non-deterministic behavior.

The model I'm suggesting for this is similar to the pgjdbc code I linked to above with the main change being to internally use a ScheduledThreadPoolExecutor rather than a Timer (since we're on JDK 8+). With a core pool size of 1 it acts similar to a Timer with a cleaner API.

Let's say the new class is called, SqlServerTimer that wraps a ScheduledThreadPoolExecutor. The timer is shared by all active connections (even with different database details) in the same classloader as it'd be accessed via a static method (ex: SqlServerTimer.getTimer()) that also increments an internal reference count. Connections would retrieve a reference to it on first use and retain it on a Connection private variable. On Connection.close() if the private reference is non-null then it'd be released via timer.release() and set back to null.

The release() method on SqlServerTimer will decrement the reference count and check if there are no longer any references (i.e. <= 0). If so, it will shutdown the ScheduledThreadPoolExecutor and clear the shared static reference. If a timer is subsequently requested (say by a new Connection that calls getTimer()) then a new one will be created.

To work around issues with tasks blocking the cancellation thread they should be executed in their own thread so the primary thread receives the work to perform the cancellation but immediately defers it to a new thread. While this does incur some thread thrashing during timeouts, it's only paid by actions that actually time out which should be the minority of real world usage (most actions are expected to succeed, not fail).

This model has a couple of advantages:

  • All Connections using the same driver classloader will share a single timer thread.
  • Adding / removing tasks is O(log(n)) as internally it's managed by ScheduledThreadPoolExecutor as a heap.
  • There's no sleep / wake cycle to check if tasks are expired as the ScheduledThreadPoolExecutor uses its core thread to sleep only until the next time to wake up (i.e. the shortest task).
  • If no Connection ever requests a timeout then the timer thread is never created anyway (it's done on demand).
  • If a Connection requests a timeout but then is closed the timer thread does not stick around (so no classloader leaks).
  • If a Connection requests a timeout and is not physically closed (ex: it's instead returned to a connection pool), then the timer thread is not shutdown. Subsequent timeout requests will reuse the shared timer and there will not be any thread thrashing for the timer thread itself. <== This is the likely "steady state" for most apps using this driver with query timeouts
  • If a Connection's timeout requests never fire (say because the queries always complete prior to the queryTimeout) then no extra threads are created for them.
  • If a Connection's timeout fires but performs some blocking action then it won't slow down other timeouts as the timeout action operates in its own thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants