-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: sqlproxy connection migration #75707
rfcs: sqlproxy connection migration #75707
Conversation
5a20549
to
c02ceb3
Compare
5c1a226
to
77db251
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely written. Only did a quick skim in order to leave a few comments about some relevant concerns about related work.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):
draining phase lasts for at most 10 minutes, and if connections are still active at the end of the phase, they will be terminated abruptly, which can lead to poor user experiences.
In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 54 at r1 (raw file):
For every client connection, the proxy selects a backend based on some routing algorithm, performs authentication with it, and does a [transparent forwarding](https://github.com/cockroachdb/cockroach/blob/ced473fa0ed00660812a6f6678155fb30e6a871f/pkg/ccl/sqlproxyccl/proxy.go#L84-L86) of all packets from the client to the server without intercepting messages after
Nit: the messages are still intercepted, right? That is, we're still reading from one socket and writing to another socket. It is simply that we're not decoding or altering the messages after the authentication phase.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 105 at r1 (raw file):
is aware of Postgres message boundaries. We propose to make the proxy aware of pgwire message boundaries. This implies
Coincidentally, we likely need to make the proxy aware of pgwire message contents in order to support the pgwire cancel protocol.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):
instances. These instances come with an additional decoding overhead for every message. For the proxy, there is no need to decode the messages in a steady state.
This might need to change to support the pgwire cancel protocol. In particular, I think we'd need to decode BackendKeyData
messages sent by the server in order to alter them before sending them to the client. It isn't clear yet exactly how this is going to work, but worth staying abreast of the cancel protocol requirements. Cc @rafiss who is leading the charge on that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! For some reason I'm not seeing the comments on Reviewable, so I quoted again.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, and @rafiss)
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):
From Peter:
In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.
This is an interesting use-case, and I hadn't thought about that. Thanks for bringing this up. Thinking about it more, we could even use this for tenant upgrades within Serverless. Today, we just terminate old pods during upgrades, and spin up new ones, resulting in rude disconnects. There's an issue filed to make the process less disruptive: https://cockroachlabs.atlassian.net/browse/CC-4796. cc: @alyshanjahani-crl
This RFC proposes a foundation to how we will transfer connections from one pod to another, and will support the use-case that was described. The interesting bit that needs to be updated when that happens would be the pod selection algorithm. When transferring connections during upgrades, a caller could invoke the relevant APIs on the forwarder (e.g. RequestTransfer) to transfer a connection, and the pod selection algorithm would need to return a pod that is ready to serve connections (e.g. newer pods which are ready, or older pods which haven't been drained yet).
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 55 at r1 (raw file):
From Peter:
Nit: the messages are still intercepted, right? That is, we're still reading from one socket and writing to another socket. It is simply that we're not decoding or altering the messages after the authentication phase.
Right, we're still reading from one, and writing over to the other. My orignal intention was to emphasize that bytes are just forwarded in a transparent manner, without any notion of pgwire message boundaries. I'll rework this.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 105 at r1 (raw file):
From Peter:
Coincidentally, we likely need to make the proxy aware of pgwire message contents in order to support the pgwire cancel protocol.
Ack - that's good to know.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):
From Peter:
This might need to change to support the pgwire cancel protocol. In particular, I think we'd need to decode BackendKeyData messages sent by the server in order to alter them before sending them to the client. It isn't clear yet exactly how this is going to work, but worth staying abreast of the cancel protocol requirements. Cc @rafiss who is leading the charge on that work.
That's fine, and the proposed design in this RFC can do that. We could peek on the interceptor to look for BackendKeyData messages during the steady state, and decode accordingly in the response processor. I'm not entirely sure how the cancel operation will be requested (e.g. user-initiated, server-initiated), but if we didn't want to decode all BackendKeyData messages during the steady state, we could have an additional signal (e.g. check on SimpleQuery in the case of user-initiated cancel operations) to start the decoding window. Regardless, with decoding, there will be some performance hit as there might be allocations and copying, depending on how large these messages are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
From Peter:
This might need to change to support the pgwire cancel protocol. In particular, I think we'd need to decode BackendKeyData messages sent by the server in order to alter them before sending them to the client. It isn't clear yet exactly how this is going to work, but worth staying abreast of the cancel protocol requirements. Cc @rafiss who is leading the charge on that work.
That's fine, and the proposed design in this RFC can do that. We could peek on the interceptor to look for BackendKeyData messages during the steady state, and decode accordingly in the response processor. I'm not entirely sure how the cancel operation will be requested (e.g. user-initiated, server-initiated), but if we didn't want to decode all BackendKeyData messages during the steady state, we could have an additional signal (e.g. check on SimpleQuery in the case of user-initiated cancel operations) to start the decoding window. Regardless, with decoding, there will be some performance hit as there might be allocations and copying, depending on how large these messages are.
The BackendKeyData
messages are very small. 12 or 16 bytes. I think we'd need to decode all of them. While cancellation is user-initiated, the CancelMessage
from the user contains info in the BackendKeyData
messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Interesting how the comments appeared (see timestamps):
I marked some of them as done to avoid confusion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, @petermattis, and @rafiss)
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.
Done.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 105 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Coincidentally, we likely need to make the proxy aware of pgwire message contents in order to support the pgwire cancel protocol.
Done.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 378 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
The
BackendKeyData
messages are very small. 12 or 16 bytes. I think we'd need to decode all of them. While cancellation is user-initiated, theCancelMessage
from the user contains info in theBackendKeyData
messages.
Ack. Since the messages are small, I wouldn't be too worried about this since it's likely that the memory will come from the internal buffer without any additional allocations (at least for most of the time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Interesting how the comments appeared (see timestamps):
Looks like some sort of reviewable bug.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, and @rafiss)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I do have a couple production concerns. You can add them to this RFC or open a Jira ticket.
- There should be metrics tracking the number of failed transfers and the number of connections closed mid-transfer. The metrics should be visible on the sqlproxy grafana page.
- We need a way to disable this for a specific tenant. That way if there is a workload that really tickles the known edge cases in the protocol, we can opt the tenant out of session transfers. I'm imaging something like a flag on the sqlproxy or an option returned by the tenant directory.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jaylim-crl, and @rafiss)
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
From Peter:
In addition to migration due to scaling events, we may want to migrate connections when performing a rolling upgrade of the CRDB version the pods are running. This use case also applies to Dedicated clusters and I can imagine a day in the future where we deploy SQL proxy in front of Dedicated clusters in order to get hit-less rolling upgrades. I'm not arguing for supporting Dedicated clusters in this work, but it would be nice if anything done here doesn't preclude that eventuality.
This is an interesting use-case, and I hadn't thought about that. Thanks for bringing this up. Thinking about it more, we could even use this for tenant upgrades within Serverless. Today, we just terminate old pods during upgrades, and spin up new ones, resulting in rude disconnects. There's an issue filed to make the process less disruptive: https://cockroachlabs.atlassian.net/browse/CC-4796. cc: @alyshanjahani-crl
This RFC proposes a foundation to how we will transfer connections from one pod to another, and will support the use-case that was described. The interesting bit that needs to be updated when that happens would be the pod selection algorithm. When transferring connections during upgrades, a caller could invoke the relevant APIs on the forwarder (e.g. RequestTransfer) to transfer a connection, and the pod selection algorithm would need to return a pod that is ready to serve connections (e.g. newer pods which are ready, or older pods which haven't been drained yet).
The big asterisk for migrating sessions during the upgrade process is version skew. If we want to transfer between versions, we need the DB to guarantee the serialized session is compatible.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 227 at r1 (raw file):
coupled with the [proxy's handler](https://github.com/cockroachdb/cockroach/blob/ad27839166526a92444bf0f719cf9ec5f62c2f7f/pkg/ccl/sqlproxyccl/proxy_handler.go#L209). At the same time, the [authenticate](https://github.com/cockroachdb/cockroach/blob/ad27839166526a92444bf0f719cf9ec5f62c2f7f/pkg/ccl/sqlproxyccl/authentication.go#L21) function that is used for authentication is coupled with the throttler. We
The sqlproxy does not need to check the throttler when it is transferring a connection between sql pods. The throttle is intended to prevent credential guesses and the proxy only transfers authenticated connections. The design you discuss handles this case well. The sqlproxy can inject no-op throttle hooks when transferring sessions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jaylim-crl, and @rafiss)
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
The big asterisk for migrating sessions during the upgrade process is version skew. If we want to transfer between versions, we need the DB to guarantee the serialized session is compatible.
Currently CockroachDB changes query planning rules, adds new SQL session variables, and makes other user-visible changes automatically during upgrades.
This is not compatible with seamless session migration across versions: a SQL client could auto-detect a version upon connecting, make choices/assumptions about SQL semantics over that connection, and if we migrate to a new version these assumptions would break with inscrutable errors client-side.
At the very least, we should never transfer a session across an upgrade to a different major version of cockroachdb.
Then I am tempted to have us make the choice to transfer sessions across versions configurable for the end-user, so that they can disable this if it's incompatible with their client app. Something like a choice between the following three:
- enable seamless migrations, warning: this can cause your apps to experience different versions of CockroachDB within the same SQL connection.
- disable seamless migrations, block upgrades with open connections, warning: this can cause your apps to block cockroachdb bug fixes.
- disable seamless migrations, close connections upon upgrades, warning: this can cause your apps to experience a server-side disconnect upon upgrades.
77db251
to
76b8559
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Updated this slightly to reflect the implementation that was merged.
There should be metrics tracking the number of failed transfers and the number of connections closed mid-transfer. The metrics should be visible on the sqlproxy grafana page.
I added this to the meta tracking issue here: #76000
We need a way to disable this for a specific tenant. That way if there is a workload that really tickles the known edge cases in the protocol, we can opt the tenant out of session transfers. I'm imaging something like a flag on the sqlproxy or an option returned by the tenant directory.
That's going to be part of the next RFC. This RFC is only scoped to how to transfer. The mechanism of when to transfer will be handled by the rebalancer, and for that, we could have it check on the tenant metadata to know whether connections can be transferred.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jeffswenson, and @rafiss)
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 29 at r1 (raw file):
Previously, knz (kena) wrote…
Currently CockroachDB changes query planning rules, adds new SQL session variables, and makes other user-visible changes automatically during upgrades.
This is not compatible with seamless session migration across versions: a SQL client could auto-detect a version upon connecting, make choices/assumptions about SQL semantics over that connection, and if we migrate to a new version these assumptions would break with inscrutable errors client-side.At the very least, we should never transfer a session across an upgrade to a different major version of cockroachdb.
Then I am tempted to have us make the choice to transfer sessions across versions configurable for the end-user, so that they can disable this if it's incompatible with their client app. Something like a choice between the following three:
- enable seamless migrations, warning: this can cause your apps to experience different versions of CockroachDB within the same SQL connection.
- disable seamless migrations, block upgrades with open connections, warning: this can cause your apps to block cockroachdb bug fixes.
- disable seamless migrations, close connections upon upgrades, warning: this can cause your apps to experience a server-side disconnect upon upgrades.
Those are all valid points. We will consider that in the upcoming RFC about when to transfer a connection. In CC today, tenant upgrades destroy the old SQL pods right away without putting them into the DRAINING state, causing connections to be terminated right away, so migrating across version upgrades is not possible. That said, if we wanted to be defensive to start, we could ensure that we never transfer connections whenever versions do not match.
docs/RFCS/20220129_sqlproxy_connection_migration.md, line 227 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
The sqlproxy does not need to check the throttler when it is transferring a connection between sql pods. The throttle is intended to prevent credential guesses and the proxy only transfers authenticated connections. The design you discuss handles this case well. The sqlproxy can inject no-op throttle hooks when transferring sessions.
The current design only throttles whenever we connect using the regular auth method. If there's a need in the future, we could also throttle the token approach (assuming that it becomes an end-user feature). I removed the part of decoupling throttler and authenticator since that's already implicit with the code today (i.e. authenticate takes in a throttleHook, though we still import the throttler).
76b8559
to
edb29eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything has been merged, and this feature is now completed. The RFC has been updated to reflect what we have merged. I'll merge this next week if no one else has concerns.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @andy-kimball, @jeffswenson, and @rafiss)
This proposal describes the connection migration mechanism within the SQL Proxy that allows us to migrate SQL connections from one SQL pod to another by building on top of the session migration work by the SQL Experience team. This RFC is only scoped to _how_ to perform the migration from one SQL pod to another within the SQL proxy. The details of _when_ to perform the migration will be discussed in a follow-up RFC. Note that the main use-case here is for CockroachDB Serverless, and we should optimize for that. What this will eventually solve: - rude disconnects during Serverless scale-down events - load imbalance during steady state and Serverless scale-up events From the user's perspective, the connection should remain usable without interruption, and there will be better resource utilization across all provisioned SQL pods. Release justification: RFC/docs only change. Release note: None
edb29eb
to
8ce1d8a
Compare
bors r=JeffSwenson If there are more concerns, I'd be happy to address in another PR. |
Build succeeded: |
RFC Text
This proposal describes the connection migration mechanism within the SQL
Proxy that allows us to migrate SQL connections from one SQL pod to another
by building on top of the session migration work by the SQL Experience team.
This RFC is only scoped to how to perform the migration from one SQL pod to
another within the SQL proxy. The details of when to perform the migration
will be discussed in a follow-up RFC. Note that the main use-case here is for
CockroachDB Serverless, and we should optimize for that.
What this will eventually solve:
From the user's perspective, the connection should remain usable without
interruption, and there will be better resource utilization across all
provisioned SQL pods.
This RFC has already received some reviews in an earlier form of an internal
Google Doc, Proxy Connection Migration.
Release justification: RFC/docs only change.
Release note: None