Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Connection Pooling #14034

Merged
merged 18 commits into from
Oct 18, 2023
Merged

Improved Connection Pooling #14034

merged 18 commits into from
Oct 18, 2023

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Sep 19, 2023

Description

The issue with the current connection pool

The current implementation of connection pooling in the vttablets is old! It dates back from the early days of Vitess, and it has never been analysed or benchmarked. The only major change it has received in the past few years is the addition of “setting connections” to the pool. Setting connections are a very important optimization: when a MySQL client opens a connection to a vtgate and sets a connection option (e.g. the SQL mode, the default collation, or any of the myriad options that ORMs from most programming languages can set), this setting must be taken into account by all the connections that the upstream vttablets perform to the MySQL server. Before we shipped this optimization last year, this meant grabbing a connection from the pool, setting the connection state via one or more SQL queries, and then resetting the state before returning it to the pool. It was very wasteful!

The new functionality with connection settings works well in practice because the old one was just inefficient, but it has a major shortcoming: namely, that it was implemented on top of the original connection pool, and it inherits all of its problems. The connection pool in Vitess is implemented using a Go Channel. When measured as purely synthetic performance, this is a reasonable choice, because channels are well optimized and handled efficiently by the Go runtime. I spent some time last year trying to micro-optimize it without significant results. But my problem was that I was targeting the wrong benchmark: using a channel for a connection pool is fast in a synthetic environment but very slow in a realistic one, because pooling connections through a FIFO (first-in, first-out) data structure forces MySQL to constantly cycle all the connections in the system, with the resulting thrashing of cache state and connection buffers. The alternative, using a LIFO (last-in, first-out) data structure forces the same sub-set of connections to be constantly reused (particularly when the system is not fully contended). This results in significant, measurable performance benefits that do not show up in synthetic benchmarks.

In this PR, I’m proposing a new design for a smart connection pool that fixes the shortcomings of the existing one. Here we detail the new and novel design, and the statistical analysis performed to understand the behaviors of the old pool, and guide the implementation of the new one.

Proposal: a design for a new pool

The new connection pool being proposed here is not based around channels, but around classic lock-free stacks (Treiber, R.K., 1986. Systems programming: Coping with parallelism). I have considered other concurrent stack implementations, and they all have in common that they perform better than Treiber’s under heavy contention, and worse when uncontended. The data structures in a connection pool are very rarely contended because their access time pales in comparison with the time spent performing the upstream request. Hence, I’ve chosen the simpler option, which also performs best in practice.

The pool is divided into N+1 stacks: one stack that holds clean connections with no settings applied, and N stacks that hold settings that have a specific setting applied. In my analysis, a value of 8 for the setting stacks seems to work best in practice, so I’ve hardcoded it.

There are two different paths when getting a connection from the pool: a path for getting a connection with no settings applied, and a path for a connection with a specific setting. The two paths have slightly divergent heuristics to optimize their wait times.

To get a connection without settings applied we first try atomically popping from the clean stack. In an uncontended system, this always instantly results in a clean connection ready to return, without having to yield on the scheduler by waiting on any channels. It’s very efficient, and makes the new pool beat the old one in raw synthetic benchmarks! If the clean stack is empty, we fall back to checking the setting stacks.

The 8 setting stacks contain connections which have a setting applied to them. Connections with settings are not pushed randomly on the stacks: the new system “interns” the Setting struct for the pool, so that a specific connection setting will always be represented by the same *Setting pointer, and each setting in the system is tagged with an strictly increasing bucket number to decide on which of the 8 stacks to place the connection. This makes it extremely more likely when popping from a stack that you’ll get a connection with the setting you’re looking for.

In this case, however, we don’t care about a specific setting, since we’re looking to return a clean connection and we’ll have to reset its setting anyway. To optimize for this, we keep an atomic index pointing to the last setting stack where a connection was pushed. We use this index to choose the stack where we’ll pop the connection from. If the stack happens to be empty (because we raced with another request), we’ll continue randomly iterating the setting stacks until we find a connection we can reset and return to the user. These operations all happen atomically and without locking.

If we’ve checked all the setting stacks and there are no connections available in the pool, we consider the system to be starved, and we’ll add ourselves to the waitlist. The waitlist in this new connection pool is a queue implemented as a concurrent linked list. The linked list is made thread-safe by a single mutex: alternative designs for lock-free and wait-free linked lists were studied, but in cases where the pool is starved for connections, the contention is completely dominated by the response times from the MySQL instance, so the mutex works well in practice and simplifies the code.

As you’d guess, the waitlist is a FIFO data structure, as opposed to the connection stacks which are LIFO. This is OK because the wait list doesn’t hold connections, but clients waiting for their turn to get a connection back from the pool. By making it a queue, we ensure fairness between all the waiters, so that requests for a specific connection are roughly served in the same order they arrived. The trick to maximizing the throughput and minimizing the latency of the system is cooperatively returning the connections to the pool.

When waiting for a returned connection, we add ourselves as a node to the waitlist with relevant metadata, including the Setting we’re interested in (if we’re interested on receiving a connection with a specific setting applied) and a semaphore, on which we block. When another Goroutine returns a connection to the pool it first checks if there are any waiters in the waitlist (this check is performed atomically without locking the list). If there are, it iterates the list finding the best candidate to directly hand over the connection to: if there’s a waiter looking for the same setting as the one in the connection we’ve applied, we hand over the connection to that waiter directly, even if he’s not at the front of the queue. Otherwise we prioritize the front-most waiter that wants a clean connection (because resetting a connection is always cheaper), and if not, choose whoever is at the front of the list. In all the cases, the connection is handed over directly from the Goroutine returning it to the pool to the Goroutine waiting for a connection, without going through the connection stacks or a channel. This is accomplished by clever use of Go runtime internals: we link into the runtime_semAcquire and runtime_semRelease APIs, which are private to the sync package as they’re used to implement Mutex and WaitGroup, and use them to block a specific Goroutine waiting for a connection and to wake and hand over the connection directly to it.

The algorithm for getting a connection with a specific setting applied is similar. We start by popping from the setting stack where the setting can live: there’s a very high chance (unless there’s a hash collision) that the popped connection has exactly the setting we’re looking for, so it can be returned right away. If we can’t find any connections on the stack, we’ll check the clean stack, because applying a setting to a clean connection is faster than resetting and re-applying in a connection that has the wrong setting. If the clean stack is also empty, we’ll iterate the setting stacks we haven’t checked so far, and if that fails we’ll fall back to waiting.

Performance analysis

I have modeled several request races to analyze how the old and the new systems behave under load. A trace is synthesized once and replayed on both systems, to ensure randomness doesn’t affect the outcome of the analysis. The pattern of the requests arriving to the pool follows a Poisson distribution with a configurable frequency for each trace. The duration of each request (i.e. the time it’d spend on MySQL) follows a Pareto distribution with Mx based on the latency to the server. The connection setting requested for each connection follows a weighted distribution which favors “clean connections”. The traces are executed synthetically with the implementation available in load_test.go and their statistical output is analyzed using Python (Numpy + Pandas + Seaborn). All traces are generated to run for 30s, as that appears to give enough datapoints for all relevant analysis.

Contended trace

For this trace we configure a request frequency of 100 requests/s, 15ms latency, 4 different settings with weights [5, 1, 1, 1] and a pool capacity of 8 connections. The system is continuously starved for connections throughout the trace.

before (ms) after (ms)
average 693.008 385.457
std 691.064 532.622
pct50 322.019 64.8804
pct75 1367.5 725.48
pct99 2075.23 1758.03
max 2195.4 1810.6

The difference here is very stark because when the system is starving for connections, the cooperative logic of the new pool that hands over connections to other waiters makes all the difference in the world. Here it takes the average time waiting for a connection from 693ms to 385ms, and the median time from 322ms to 64ms! We can see the effects of the smarter logic when graphing the distribution of wait times between the two implementations:

Pasted image 20230919162058

The vast majority of the requests in the new implementation (orange) are served almost immediately, while the long tail is much more condensed.

Uncontended traces

The heuristics of the new system should be less noticeable, if at all, when the pattern of requests into the pool doesn’t cause any contention (i.e. when there’s always a pooled connection available to be returned immediately). This is the case where waiting for a non-empty channel should give the old system very good performance results. Let’s run an experiment with a pool capacity of 16 connections, 20 requests per second, 15ms of latency and 6 different settings with weights [5, 1, 1, 1, 1, 1].

before (ms) after (ms)
average 13.2486 1.25488
std 15.1242 3.12073
pct50 0.968342 0.634032
pct75 31.36 0.896502
pct99 32.6947 16.4157
max 33.0958 16.7035

The uncontended case should be optimal for the old system, and yet we see 10x speedups in the average and 30x speedups in the median request time. What’s going on here? A graph of the amount of settings applied and reset on each connection gives us a clue:

Pasted image 20230919172230

The fact that we have 5 different settings (plus connections without settings) is another pathological scenario for the old implementation. All connections with a setting applied to them go into the same channel, so as soon as you grab a connection with the wrong setting (something very likely if you have more than one setting used in any connection in the system), you have to pay the price of resetting and re-applying the setting over and over again. The new system minimizes this issue by using 8 different stacks for settings: conflicts are way more rare in practice.

Another important insight for this uncontended scenario is graphing the number of requests performed on each connection. Remember that one of the goals of this new design was replacing the FIFO design of the old implementation with a LIFO, to prevent the continuous cycling of connections in the MySQL server. The results are what we expected: requests are evenly distributed among all connections in the old distribution, while they are clustered between a few connections in the new one. Note that although both pools had a capacity of 16 connections, the new pool implementation only had to open 12 connections (!!) to the backend to handle the request trace.

Pasted image 20230919162747
Pasted image 20230919162802

Point trace

Lastly, let’s take a look at what historically has been a very significant limitation of the vttablet: highly contended point queries with high throughput, like you would see if e.g. you were using MySQL as a key-value store. We’ll trace 2000 requests per second with 2ms latency, 16 connections in the pool and 4 different settings being used. Because of the contention and unfairness in the single channel, and the attrition of constantly changing the settings in this connection, the existing connection pool has always performed particularly poorly here. Let’s see if the new one fares any better:

before (ms) after (ms)
average 1520.13 228.008
std 856 238.17
pct50 1866.11 130.452
pct75 2057.54 419.699
pct99 2862.24 807.027
max 2942.3 880.779

Yes, the system is under heavy stress because starvation is severe, but once again we see that directly handing over connections from the Goroutines that return them to the pool to the Goroutines that are waiting for them provides a massive performance benefit in this scenario.

Pasted image 20230919172927

The histogram of wait times is also quite stark. The inherent randomness inside the Go channel implementation essentially hides away the “long tail” of wait times, resulting in clustering around the 2.0 seconds mark. Our optimized stack-based implementation doesn’t suffer from this: the vast majority of responses are served immediately.

Pasted image 20230919173437

If we look at the reset/apply graph, we can see that there’s some churn in the new implementation (unavoidable when dealing with such high throughput), but it is significantly lower than the old’s.

Uncontended without settings

Lastly, we’ll review the worst-case scenario for the new implementation: when the system is fully uncontended and there’s only one set of settings being requested. Here, the old design with a channel for clean connections and another channel for connections with settings should really shine. Although it’s not a particularly relevant case, because it doesn’t happen in real production systems, it’s interesting to see if we can beat the latency of the old implementation. The parameters of the test are the same as for the Uncontended, but we’re only using one connection setting with weight 1, and clean connections with weight 5.

before (ms) after (ms)
average 0.730532 0.735382
std 1.25725 1.23995
pct50 0.655849 0.667169
pct75 0.917489 0.90591
pct99 1.17231 1.18814
max 16.9025 16.5538

Great success! Our fast, uncontended path is roughly as efficient as reading from a non-empty Go channel. Decent feat. The histogram for wait times shows the same story:

Pasted image 20230919174532

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Sep 19, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Sep 19, 2023
@github-actions github-actions bot added this to the v18.0.0 milestone Sep 19, 2023
@vmg vmg removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Sep 20, 2023
@frouioui frouioui removed this from the v18.0.0 milestone Sep 20, 2023
@frouioui
Copy link
Member

I have removed the milestone on this Pull Request as we want this in v19.0.0.

@frouioui frouioui added this to the v19.0.0 milestone Sep 20, 2023
@vmg vmg mentioned this pull request Sep 22, 2023
4 tasks
@vmg vmg force-pushed the connpool branch 2 times, most recently from 54c23e7 to 1c8c859 Compare September 22, 2023 09:44
@vmg vmg marked this pull request as ready for review September 22, 2023 14:09
@vmg
Copy link
Collaborator Author

vmg commented Sep 22, 2023

This PR is now 🍏 and ready for review. 🎉

Some notes on the development and integration of the PR:

  • As you can see, the new connection pool has different enough semantics that it hasn't been feasible to implement it as a swappable replacement for the old one. I think that's perfectly OK: I tried really hard to write the LFU cache replacement to be swappable, and I don't think that any of the users ever really opted out of the new implementation because the new cache was just better in every way. Looking at all the analysis I've performed so far, I think the same thing is gonna happen here: the new pool is just strictly better for all workloads than the old one. The only reason to switch to the old one would be bugs we haven't noticed yet, which brings me to the next point...
  • Obviously this shouldn't go into Vitess 18 which is being frozen next week. We'll merge after the release branch is cut and we have all the Vitess 19 period to iron out all the bugs and kinks in the new implementation. We'll be able to ship this pool in Vitess 19 with high confidence.
  • I've found several bugs in tests regarding broken connections which are not related to this PR, but were surfaced by it, because we now re-use connections in a LIFO manner. I've extracted those fixes into a separate PR which has already been merged: ci: pool-related test flakyness #14076

@vmg
Copy link
Collaborator Author

vmg commented Sep 22, 2023

One last significant bug that is handled by this PR, and which was present in the old connection pool implementation:

Because of the way the MySQL protocol works (it is client-led), there's no way to detect when the MySQL server has closed a connection from its side. So there's this issue where we can pull a connection from the pool that has been sitting there idle for a while, and the connection is broken because there was a network flake or the MySQL server just killed us. However, we won't know that the connection is broken until we actually try to perform a query on it, and we get an actual error.

Most of the time, this is well handled in Vitess because our default implementation for a pooled connection has transparent retries: when you attempt a query and find out that he connection is broken, we'll automatically attempt a reconnection and retry the query.

But there are times when this is not an option: when using a StatefulConnection in the vttablet, we cannot perform any reconnection/retries, because the connection represents a consistent state (often an open transaction) and re-connecting it would break that invariant and lose our transaction. Whenever a query fails in such connection, we cannot afford to retry-it (even if it's the very first statement in the transaction, because the error could have happened after the statement was executed).

This leaves us in a situation when sometimes a StatefulConnection just fails right away because it pulled a broken connection from the pool, and we have to return an error to the user. This situation has always been present, but it's become more apparent now in CI because the new LIFO pool finds more of these broken connections, particularly in the healthcheck endtoend tests, which often include killing or restarting mysqld instances to see how the tablet reacts.

So how do we fix this? Fortunately I've been through this before: I found the same issue when deploying GitHub's first production service in Go a few years ago. The OSS MySQL driver for Go (go-mysql-driver) suffered from the same problem of returning broken connections from the pool. I devised an ingenious approach to figure out whether the connection is broken by skipping the Go socket abstraction and performing directly a non-blocking read on the socket.

You can read a deep dive into the problem and the fix in this blog post I wrote: https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

I've implemented the same fix in Vitess' mysql.Conn and lo & behold, our StatefulConnection can no longer fail when they pull a broken connection from the pool. This removes a good chunk of flaky tests from CI and I'm sure will improve reliability when using the new connection pool in busy mysqld instances.

@harshit-gangal harshit-gangal added the Benchmark me Add label to PR to run benchmarks label Sep 27, 2023
vmg added 17 commits October 18, 2023 16:02
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
@vmg
Copy link
Collaborator Author

vmg commented Oct 18, 2023

Whew! So the benchmarks for this PR were not looking as expected. AreWeFast was showing up a ~4% reduction in QPS, and most shockingly, a 18%+ increase in CPU time in the vttablets when measuring this PR against the base branch.

I've spent the last 2 weeks (has it really been that long???) figuring this out. Fortunately you get to see the recap as a summary without having to read about all the wrong things I tested. And I tested many many things! The most puzzling thing about this performance regression is that is was impossible to reproduce anywhere but in the arewefastyet runners. But with some help with @frouioui, I acquired SSH access to those runners and started profiling benchmarks for our main branch against this PR.

The first clue about the performance regression was in comparing the CPU profiles for a main benchmark and this branch. Our branch was spending an 4 times longer in the GC marking phase, but not in collecting, compared to old pool. This was weird, because one of the properties of the old pool is that it didn't allocate memory, and obviously the new pool didn't allocate memory either (that would have been a performance regression). The large set of benchmarks that ship with this PR verify that this is the case. The pools don't allocate memory during usage. So why are we spending so much time in GC with the new pool then?

After a lot of misses, I managed to acquire an execution trace of a benchmark for main and for this PR, and the issue began taking shape:

image
image

Existing code in main during the benchmark (top) vs code in this branch (bottom)

The difference is stark: when running with the new connection pool, the go GC just keeps aggressively collecting. All the time. That definitely explains the 18% increased CPU time seen in the benchmarks, and the regression in QPS. We're just spending a lot of time garbage collecting, and the old pool is not!

Did we do something wrong here to continuously trigger the garbage collector? We thought that, and we carefully reviewed the code for allocations I may have missed, but there's nothing. The new pool is very efficient and just doesn't allocate.

We then noticed something very interesting about the GC pattern in the new code. The Go GC is trying to continuously keep the Go heap at around 30MB with the new pool. In the old pool, however, it allows it to run up to 100MB until it triggers a garbage collection cycle, which is why GC cycles happen much less often.

We had looked at a lot of heap profiles for the old and new pools without success already, but knowing that the heap of the new pool is just much smaller than the old one, we knew there was something hidden in those heap dumps that we had to find.

The culprit was not visible because of the -nodefraction default that hides small allocations in a pprof view. The allocations were very few, but they were not small at all. Setting -nodefraction=0 and looking at a heap dump for the old pool implementation made everything make sense:

image

Why is the new pool slower in arewefastyet? Because it does exactly what we set out to do. It fixes the concern that @rbranson had since the beginning. It uses a stack of connections to ensure the vttablet keeps re-using the same small set of connections over and over (LIFO), while the old pool used a channel that causes connections to continuously cycle (FIFO).

With a limit of 2000 concurrent transactions in the AreWeFast benchmark, the result is that each mysqld instance used in the benchmark had more than 2000 connections open from the tablet. When switching to the new pool, the mysqld instances average 55 connections open for the totality of the benchmark.

...And with only 55 connections open in the new pool, the old one had a serious, unintended performance advantage. The 2000 connections continuously in the old pool each had a 16kb I/O buffer, providing a homemade memory ballast of 30mb, as seen on the now infamous blog post from Twitch. As a refresher, the target heap (and hence the GC frequency) of the Go Garbage Collector is modeled as such:

Target heap memory = Live heap + (Live heap + GC roots) * GOGC / 100

So 30 extra megabytes of memory continuously on the heap, plus more allocations per second because the new pool serves requests faster, results in much more frequent GC runs in the new pool.

Once we hacked an adjusted GOGC percentage into the AreWeFast benchmark, the results were what we expected: the new pool has between 5 and 6% more throughput in queries per second, 2% reduced latency, a 98% reduction in open connections to mysqld and a 70% reduction in the heap of vttablet instances (because of the significantly reduced amount of open connections).

For people running Vitess in production, we'll add a note to the release notes explaining that vttablets will now see significantly reduced memory usage, and that their throughput can be increased by setting a higher GOGC value in a production deployment.

For now, I believe this PR is ready to merge. 👍 please?

(lastly, I'd like to point out that I only figured this out because I paired with @dbussink)

@dbussink
Copy link
Contributor

(lastly, I'd like to point out that I only figured this out because I paired with @dbussink)

🦆

Copy link
Contributor

@dbussink dbussink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, that was a tricky one to find, but we found it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Smarter Connection Pooling
5 participants