Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Smarter Connection Pooling #14033

Closed
vmg opened this issue Sep 19, 2023 · 0 comments · Fixed by #14034
Closed

Feature request: Smarter Connection Pooling #14033

vmg opened this issue Sep 19, 2023 · 0 comments · Fixed by #14034

Comments

@vmg
Copy link
Collaborator

vmg commented Sep 19, 2023

The issue with the current connection pool

The current implementation of connection pooling in the vttablets is old! It dates back from the early days of Vitess, and it has never been analysed or benchmarked. The only major change it has received in the past few years is the addition of “setting connections” to the pool. Setting connections are a very important optimization: when a MySQL client opens a connection to a vtgate and sets a connection option (e.g. the SQL mode, the default collation, or any of the myriad options that ORMs from most programming languages can set), this setting must be taken into account by all the connections that the upstream vttablets perform to the MySQL server. Before we shipped this optimization last year, this meant grabbing a connection from the pool, setting the connection state via one or more SQL queries, and then resetting the state before returning it to the pool. It was very wasteful!

The new functionality with connection settings works well in practice because the old one was just inefficient, but it has a major shortcoming: namely, that it was implemented on top of the original connection pool, and it inherits all of its problems. The connection pool in Vitess is implemented using a Go Channel. When measured as purely synthetic performance, this is a reasonable choice, because channels are well optimized and handled efficiently by the Go runtime. I spent some time last year trying to micro-optimize it without significant results. But my problem was that I was targeting the wrong benchmark: using a channel for a connection pool is fast in a synthetic environment but very slow in a realistic one, because pooling connections through a FIFO (first-in, first-out) data structure forces MySQL to constantly cycle all the connections in the system, with the resulting thrashing of cache state and connection buffers. The alternative, using a LIFO (last-in, first-out) data structure forces the same sub-set of connections to be constantly reused (particularly when the system is not fully contended). This results in significant, measurable performance benefits that do not show up in synthetic benchmarks.

In an upcoming PR, I’m proposing a new design for a smart connection pool that fixes the shortcomings of the existing one. This issue details the new and novel design, and the statistical analysis performed to understand the behaviors of the old pool, and guide the implementation of the new one.

Proposal: a design for a new pool

The new connection pool being proposed here is not based around channels, but around classic lock-free stacks (Treiber, R.K., 1986. Systems programming: Coping with parallelism). I have considered other concurrent stack implementations, and they all have in common that they perform better than Treiber’s under heavy contention, and worse when uncontended. The data structures in a connection pool are very rarely contended because their access time pales in comparison with the time spent performing the upstream request. Hence, I’ve chosen the simpler option, which also performs best in practice.

The pool is divided into N+1 stacks: one stack that holds clean connections with no settings applied, and N stacks that hold settings that have a specific setting applied. In my analysis, a value of 8 for the setting stacks seems to work best in practice, so I’ve hardcoded it.

There are two different paths when getting a connection from the pool: a path for getting a connection with no settings applied, and a path for a connection with a specific setting. The two paths have slightly divergent heuristics to optimize their wait times.

To get a connection without settings applied we first try atomically popping from the clean stack. In an uncontended system, this always instantly results in a clean connection ready to return, without having to yield on the scheduler by waiting on any channels. It’s very efficient, and makes the new pool beat the old one in raw synthetic benchmarks! If the clean stack is empty, we fall back to checking the setting stacks.

The 8 setting stacks contain connections which have a setting applied to them. Connections with settings are not pushed randomly on the stacks: the new system “interns” the Setting struct for the pool, so that a specific connection setting will always be represented by the same *Setting pointer, and each setting in the system is tagged with an strictly increasing bucket number to decide on which of the 8 stacks to place the connection. This makes it extremely more likely when popping from a stack that you’ll get a connection with the setting you’re looking for.

In this case, however, we don’t care about a specific setting, since we’re looking to return a clean connection and we’ll have to reset its setting anyway. To optimize for this, we keep an atomic index pointing to the last setting stack where a connection was pushed. We use this index to choose the stack where we’ll pop the connection from. If the stack happens to be empty (because we raced with another request), we’ll continue randomly iterating the setting stacks until we find a connection we can reset and return to the user. These operations all happen atomically and without locking.

If we’ve checked all the setting stacks and there are no connections available in the pool, we consider the system to be starved, and we’ll add ourselves to the waitlist. The waitlist in this new connection pool is a queue implemented as a concurrent linked list. The linked list is made thread-safe by a single mutex: alternative designs for lock-free and wait-free linked lists were studied, but in cases where the pool is starved for connections, the contention is completely dominated by the response times from the MySQL instance, so the mutex works well in practice and simplifies the code.

As you’d guess, the waitlist is a FIFO data structure, as opposed to the connection stacks which are LIFO. This is OK because the wait list doesn’t hold connections, but clients waiting for their turn to get a connection back from the pool. By making it a queue, we ensure fairness between all the waiters, so that requests for a specific connection are roughly served in the same order they arrived. The trick to maximizing the throughput and minimizing the latency of the system is cooperatively returning the connections to the pool.

When waiting for a returned connection, we add ourselves as a node to the waitlist with relevant metadata, including the Setting we’re interested in (if we’re interested on receiving a connection with a specific setting applied) and a semaphore, on which we block. When another Goroutine returns a connection to the pool it first checks if there are any waiters in the waitlist (this check is performed atomically without locking the list). If there are, it iterates the list finding the best candidate to directly hand over the connection to: if there’s a waiter looking for the same setting as the one in the connection we’ve applied, we hand over the connection to that waiter directly, even if he’s not at the front of the queue. Otherwise we prioritize the front-most waiter that wants a clean connection (because resetting a connection is always cheaper), and if not, choose whoever is at the front of the list. In all the cases, the connection is handed over directly from the Goroutine returning it to the pool to the Goroutine waiting for a connection, without going through the connection stacks or a channel. This is accomplished by clever use of Go runtime internals: we link into the runtime_semAcquire and runtime_semRelease APIs, which are private to the sync package as they’re used to implement Mutex and WaitGroup, and use them to block a specific Goroutine waiting for a connection and to wake and hand over the connection directly to it.

The algorithm for getting a connection with a specific setting applied is similar. We start by popping from the setting stack where the setting can live: there’s a very high chance (unless there’s a hash collision) that the popped connection has exactly the setting we’re looking for, so it can be returned right away. If we can’t find any connections on the stack, we’ll check the clean stack, because applying a setting to a clean connection is faster than resetting and re-applying in a connection that has the wrong setting. If the clean stack is also empty, we’ll iterate the setting stacks we haven’t checked so far, and if that fails we’ll fall back to waiting.

Performance analysis

I have modeled several request races to analyze how the old and the new systems behave under load. A trace is synthesized once and replayed on both systems, to ensure randomness doesn’t affect the outcome of the analysis. The pattern of the requests arriving to the pool follows a Poisson distribution with a configurable frequency for each trace. The duration of each request (i.e. the time it’d spend on MySQL) follows a Pareto distribution with Mx based on the latency to the server. The connection setting requested for each connection follows a weighted distribution which favors “clean connections”. The traces are executed synthetically with the implementation available in load_test.go and their statistical output is analyzed using Python (Numpy + Pandas + Seaborn). All traces are generated to run for 30s, as that appears to give enough datapoints for all relevant analysis.

Contended trace

For this trace we configure a request frequency of 100 requests/s, 15ms latency, 4 different settings with weights [5, 1, 1, 1] and a pool capacity of 8 connections. The system is continuously starved for connections throughout the trace.

before (ms) after (ms)
average 693.008 385.457
std 691.064 532.622
pct50 322.019 64.8804
pct75 1367.5 725.48
pct99 2075.23 1758.03
max 2195.4 1810.6

The difference here is very stark because when the system is starving for connections, the cooperative logic of the new pool that hands over connections to other waiters makes all the difference in the world. Here it takes the average time waiting for a connection from 693ms to 385ms, and the median time from 322ms to 64ms! We can see the effects of the smarter logic when graphing the distribution of wait times between the two implementations:

Pasted image 20230919162058

The vast majority of the requests in the new implementation (orange) are served almost immediately, while the long tail is much more condensed.

Uncontended traces

The heuristics of the new system should be less noticeable, if at all, when the pattern of requests into the pool doesn’t cause any contention (i.e. when there’s always a pooled connection available to be returned immediately). This is the case where waiting for a non-empty channel should give the old system very good performance results. Let’s run an experiment with a pool capacity of 16 connections, 20 requests per second, 15ms of latency and 6 different settings with weights [5, 1, 1, 1, 1, 1].

before (ms) after (ms)
average 13.2486 1.25488
std 15.1242 3.12073
pct50 0.968342 0.634032
pct75 31.36 0.896502
pct99 32.6947 16.4157
max 33.0958 16.7035

The uncontended case should be optimal for the old system, and yet we see 10x speedups in the average and 30x speedups in the median request time. What’s going on here? A graph of the amount of settings applied and reset on each connection gives us a clue:

Pasted image 20230919172230

The fact that we have 5 different settings (plus connections without settings) is another pathological scenario for the old implementation. All connections with a setting applied to them go into the same channel, so as soon as you grab a connection with the wrong setting (something very likely as soon as you have more than one setting used in any connection in the system), you have to pay the price of resetting and re-applying the setting over and over again. The new system minimizes this issue by using 8 different stacks for settings: conflicts are way more rare in practice.

Another important insight for this uncontended scenario is graphing the number of requests performed on each connection. Remember that one of the goals of this new design was replacing the FIFO design of the old implementation with a LIFO, to prevent the continuous cycling of connections in the MySQL server. The results are what we expected: requests are evenly distributed among all connections in the old distribution, while they are clustered between a few connections in the new one. Note that although both pools had a capacity of 16 connections, the new pool implementation only had to open 12 connections (!!) to the backend to handle the request trace.

Pasted image 20230919162747
Pasted image 20230919162802

Point trace

Lastly, let’s take a look at what historically has been a very significant limitation of the vttablet: highly contended point queries with high throughput, like you would see if e.g. you were using MySQL as a key-value store. We’ll trace 2000 requests per second with 2ms latency, 16 connections in the pool and 4 different settings being used. Because of the contention and unfairness in the single channel, and the attrition of constantly changing the settings in this connection, the existing connection pool has always performed particularly poorly here. Let’s see if the new one fares any better:

before (ms) after (ms)
average 1520.13 228.008
std 856 238.17
pct50 1866.11 130.452
pct75 2057.54 419.699
pct99 2862.24 807.027
max 2942.3 880.779

Yes, the system is under heavy stress because starvation is severe, but once again we see that directly handing over connections from the Goroutines that return them to the pool to the Goroutines that are waiting for them provides a massive performance benefit in this scenario.

Pasted image 20230919172927

The histogram of wait times is also quite stark. The inherent randomness inside the Go channel implementation essentially hides away the “long tail” of wait times, resulting in clustering around the 2.0 seconds mark. Our optimized stack-based implementation doesn’t suffer from this: the vast majority of responses are served immediately.

Pasted image 20230919173437

If we look at the reset/apply graph, we can see that there’s some churn in the new implementation (unavoidable when dealing with such high throughput), but it is significantly lower than the old’s.

Uncontended without settings

Lastly, we’ll review the worst-case scenario for the new implementation: when the system is fully uncontended and there’s only one set of settings being requested. Here, the old design with a channel for clean connections and another channel for connections with settings should really shine. Although it’s not a particularly relevant case, because it doesn’t happen in real production systems, it’s interesting to see if we can beat the latency of the old implementation. The parameters of the test are the same as for the Uncontended, but we’re only using one connection setting with weight 1, and clean connections with weight 5.

before (ms) after (ms)
average 0.730532 0.735382
std 1.25725 1.23995
pct50 0.655849 0.667169
pct75 0.917489 0.90591
pct99 1.17231 1.18814
max 16.9025 16.5538

Great success! Our fast, uncontended path is roughly as efficient as reading from a non-empty Go channel. Decent feat. The histogram for wait times shows the same story:

Pasted image 20230919174532

Next Steps

I'm going to open a PR with the new implementation to start the review process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant