ccl/sqlproxyccl: add helpers related to connection migration #77109

jaylim-crl · 2022-02-27T21:46:06Z

ccl/sqlproxyccl: add helpers related to connection migration

Informs #76000. Extracted from #76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

Release note: None

Release justification: Helpers added in this commit are needed for the
connection migration work. Connection migration is currently not being used
in production, and CockroachCloud is the only user of sqlproxy.

ccl/sqlproxyccl: fix math for defaultBufferSize in interceptors

Previously, we incorrectly defined defaultBufferSize as 16K bytes. Note that
2 << 13 is 16K bytes. This commit fixes that behavior to match the original
intention of 8K bytes.

Release note: None

Release justification: This fixes an unintentional buglet within the sqlproxy
code that was introduced with the interceptors back then. Not having this in
means we're using double the memory for each connection within the sqlproxy.

cockroach-teamcity · 2022-02-27T21:46:15Z

This change is

Informs cockroachdb#76000. Builds on top of cockroachdb#77109 and cockroachdb#77111. This commit completes the connection migration feature in the the forwarder within sqlproxy. The idea is as described in the RFC. A couple of new sqlproxy metrics have been added as well: - proxy.conn_migration.requested - proxy.conn_migration.accepted - proxy.conn_migration.rejected - proxy.conn_migration.success - proxy.conn_migration.error - proxy.conn_migration.fail - proxy.conn_migration.timeout_closed - proxy.conn_migration.timeout_recoverable For more details, see metrics.go in the sqlproxyccl package. Release note: None

pkg/ccl/sqlproxyccl/conn_migration.go

pkg/ccl/sqlproxyccl/conn_migration_test.go

pkg/ccl/sqlproxyccl/forwarder.go

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)

pkg/ccl/sqlproxyccl/conn_migration.go, line 46 at r2 (raw file):

the sql server can determine the limit when serializing the session

Can you elaborate more on this? We don't have such limitations in the SQL server today. Also, adding a session serialization limit to the SQL server feels off without having a cluster setting that denotes such a limit. See the next sentence in this comment.

// bytes). This may occur if a user sets their cluster settings to large values.
// We can potentially add a TenantReadOnly cluster setting that restricts the
// maximum length of session variables for a better migration experience. For
// now, we'll just close the connection.

How do we protect against a user setting a session variable to a large string, and then us attempting to load everything into the proxy's memory during the read?

cc: @rafiss

pkg/ccl/sqlproxyccl/conn_migration.go, line 84 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: I think this use of expectNextServerMessage is doing more harm as a helper than good. The code would be easier to interpret if it used peekMsg directly.

E.g. I would prefer:
if ctx.Err() != nil {
    return ctx.ErR()
}

typ, size, err := interceptor.PeekMsg()
if type == pgwirebase.ServerMsgErrorResponse {
    return errors.New("ambiguous ErrorResponse")
}
if type != pgwirebase.ServerMsgRowDescription || 4k < size {
  	if _, err := serverInterceptor.ForwardMsg(clientConn); err != nil {
			return "", "", errors.Wrap(err, "forwarding message")
	}
    continue
}
msg, err = serverInterceptor.ReadMsg()
By using PeekMsg the code reads like normal control flow instead of error handling. It also allows you to remove the skipRead parameter.

Are you referring to just this call site, or all uses of expectNextServerMessage?

FYI, the original implementation in the other PR did that, but in a recursive manner: https://github.com/cockroachdb/cockroach/blob/6a8d61039dec0eb8e6f43cc13578d8f4857cfe04/pkg/ccl/sqlproxyccl/conn_migration.go#L88-L120

After rewriting it to this approach, I prefer this over the other one. If you're referring just this call site, that's fine, and we could break it down to call PeekMsg directly, but if it's for all call sites, I don't think it's a good idea.

pkg/ccl/sqlproxyccl/conn_migration.go, line 364 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Is this here as an optimization? The only benefit I see is it guarantees ReadMsg never needs to allocate. I think we should remove this check. It adds an error case the caller needs to consider.

This case handles the other comment you pointed out. For connection migrations, imposing a hard limit on the maximum size that we'll read makes things easier to reason about, and we don't need to care about edge cases. The case that I was trying to handle here is if SHOW TRANSFER STATE returned a very large message (e.g. serialized session in the form of MBs).

pkg/ccl/sqlproxyccl/forwarder.go, line 214 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: I recommend deleting the markAsConnRecoverableError and isConnRecoverableError. These wrappers do not provide much value over errors.Mark(err, errConnRecoverable).

I can remove this, but FYI, this seems to be a common pattern in the DB code:

cockroach/pkg/jobs/errors.go

Lines 31 to 55 in 4fa089a

    
           // MarkAsRetryJobError marks an error as a retriable job error which 
        
           // indicates that the registry should retry the job. 
        
           func MarkAsRetryJobError(err error) error { 
        
           	return errors.Mark(err, errRetryJobSentinel) 
        
           } 
        
           // Registry does not retry a job that fails due to a permanent error. 
        
           var errJobPermanentSentinel = errors.New("permanent job error") 
        
           // MarkAsPermanentJobError marks an error as a permanent job error, which indicates 
        
           // Registry to not retry the job when it fails due to this error. 
        
           func MarkAsPermanentJobError(err error) error { 
        
           	return errors.Mark(err, errJobPermanentSentinel) 
        
           } 
        
           // IsPermanentJobError checks whether the given error is a permanent error. 
        
           func IsPermanentJobError(err error) bool { 
        
           	return errors.Is(err, errJobPermanentSentinel) 
        
           } 
        
           // IsPauseSelfError checks whether the given error is a 
        
           // PauseRequestError. 
        
           func IsPauseSelfError(err error) bool { 
        
           	return errors.Is(err, errPauseSelfSentinel) 
        
           }

Maybe it made more sense if this were to be used outside the sqlproxyccl package, i.e. these functions are exported.

jaylim-crl

I've addressed the markAsConnRecoverableError and isConnRecoverableError nits. The rest are still pending for discussions.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)

pkg/ccl/sqlproxyccl/conn_migration.go, line 84 at r2 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Are you referring to just this call site, or all uses of expectNextServerMessage?

FYI, the original implementation in the other PR did that, but in a recursive manner: https://github.com/cockroachdb/cockroach/blob/6a8d61039dec0eb8e6f43cc13578d8f4857cfe04/pkg/ccl/sqlproxyccl/conn_migration.go#L88-L120

After rewriting it to this approach, I prefer this over the other one. If you're referring just this call site, that's fine, and we could break it down to call PeekMsg directly, but if it's for all call sites, I don't think it's a good idea.

I've thought about this deeper, and I plan to leave it as-is if you don't feel strongly otherwise. Having all the parsing logic within a certain location, which in this case is expectNextServerMessage makes things easier to reason about and test. The biggest drawback to doing the approach that you suggested is that we now need to check the size (i.e. 4k) in two different places, which is what I'd like to avoid.

pkg/ccl/sqlproxyccl/forwarder.go, line 214 at r2 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I can remove this, but FYI, this seems to be a common pattern in the DB code:

cockroach/pkg/jobs/errors.go

Lines 31 to 55 in 4fa089a

// MarkAsRetryJobError marks an error as a retriable job error which

// indicates that the registry should retry the job.

func MarkAsRetryJobError(err error) error {

return errors.Mark(err, errRetryJobSentinel)

}

// Registry does not retry a job that fails due to a permanent error.

var errJobPermanentSentinel = errors.New("permanent job error")

// MarkAsPermanentJobError marks an error as a permanent job error, which indicates

// Registry to not retry the job when it fails due to this error.

func MarkAsPermanentJobError(err error) error {

return errors.Mark(err, errJobPermanentSentinel)

}

// IsPermanentJobError checks whether the given error is a permanent error.

func IsPermanentJobError(err error) bool {

return errors.Is(err, errJobPermanentSentinel)

}

// IsPauseSelfError checks whether the given error is a

// PauseRequestError.

func IsPauseSelfError(err error) bool {

return errors.Is(err, errPauseSelfSentinel)

}

Maybe it made more sense if this were to be used outside the sqlproxyccl package, i.e. these functions are exported.

Done.

jaylim-crl

I updated the waitForShowTransferState function to also return the transferErr. With that change, all the conn recoverable stuff is deferred to the other PR. Doing this also means that we can be sure that if we're in the stateTransferSessionSerialization state, the connection will be non-recoverable.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)

Informs cockroachdb#76000. Builds on top of cockroachdb#77109 and cockroachdb#77111. This commit completes the connection migration feature in the the forwarder within sqlproxy. The idea is as described in the RFC. A couple of new sqlproxy metrics have been added as well: - proxy.conn_migration.requested - proxy.conn_migration.success - proxy.conn_migration.error_fatal - proxy.conn_migration.error_recoverable - proxy.conn_migration.attempted - proxy.conn_migration.protocol_error For more details, see metrics.go in the sqlproxyccl package. Release note: none Release justification: This completes the first half of the connection migration feature. This is low risk as part of the code is guarded behind the connection migration feature, which is currently not being used in production. To add on, CockroachCloud is the only user of sqlproxy.

jaylim-crl

As discussed offline, I went ahead and updated 4KB to 1MiB. Here are a couple of follow ups:

Attempt to refactor expectNextServerMessage again. The approach of us checking the message size can be confusing as-is here.
Add metrics to measure DataRow size. This will allow us to figure out what's the right size.
File DB ticket to add TenantReadOnly cluster setting that restricts serialized session size. i.e. we should get a SQL error if it exceeds that cluster setting's value.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, @jeffswenson, and @rafiss)

jeffswenson

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @rafiss)

pkg/ccl/sqlproxyccl/conn_migration.go, line 46 at r2 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

the sql server can determine the limit when serializing the session

Can you elaborate more on this? We don't have such limitations in the SQL server today. Also, adding a session serialization limit to the SQL server feels off without having a cluster setting that denotes such a limit. See the next sentence in this comment.

// bytes). This may occur if a user sets their cluster settings to large values.
// We can potentially add a TenantReadOnly cluster setting that restricts the
// maximum length of session variables for a better migration experience. For
// now, we'll just close the connection.

How do we protect against a user setting a session variable to a large string, and then us attempting to load everything into the proxy's memory during the read?

cc: @rafiss

We discussed this offline. For now we will use a 1 Mib constant in the sqlproxy. Long term we should have the sql server return an error if the session is too large.

jaylim-crl

TFTR! (1) and (2) are tracked in #76000. For (3), an issue has been created: #77302.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @rafiss)

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @rafiss)

pkg/ccl/sqlproxyccl/conn_migration.go, line 46 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

We discussed this offline. For now we will use a 1 Mib constant in the sqlproxy. Long term we should have the sql server return an error if the session is too large.

I agree that 1M is a better limit than 4K. Even just a few prepared queries would quickly exceed a 4K limit. I also like the idea of having the SQL pod return a "can't serialize" error when the amount of state exceeds some pre-configured value (it'd be nice to be able to set this, in case we find that 1M is not right).

pkg/ccl/sqlproxyccl/conn_migration.go, line 50 at r17 (raw file):

// potentially add a TenantReadOnly cluster setting that restricts the maximum
// length of session variables for a better migration experience. For now, we'll
// just close the connection.

It seems pretty bad to "just close the connection", as that means we'll be regularly rudely killing user connections that have a lot of prepared statements. I don't think we can ship with this (but fine for this PR, until we get support in the database).

pkg/ccl/sqlproxyccl/conn_migration.go, line 396 at r17 (raw file):

}

// isErrorResponseError returns true if the error is a type mismatch due to

It's not great that we're parsing error messages like this. Why can't we use errors.Is function and a set of reference errors that we define at the top of this file?

if errors.Is(err, errorResponseErr)

pkg/ccl/sqlproxyccl/conn_migration.go, line 426 at r17 (raw file):

		return fmt.Sprintf("%v", msg)
	}
	return encoding.UnsafeString(m)

I don't think it's safe (or even if safe, it's too fragile) to use UnsafeString in situations like this. I don't know how pgproto3 works, but what if BackendMessage reuses the same underlying bytes memory for each incoming message? If you reinterpret that memory as being string rather than bytes, it could change out from underneath you in unpredictable ways and perhaps even cause hard process crashes / security exploits.

Looking at your usages of UnsafeString, the performance-critical cases are comparing a string with bytes. I think I'd create a little helper function called compareStringWithBytes or similar to do that operation. In that function, I'd first try to find some existing Go function that did what I needed it to do (compare string and byte slice byte-by-byte). If I couldn't find something like that, I might, when all other avenues are exhausted, do the unsafe string trick. But I don't think I'd export unsafeString from the encoding package, as it violates that package's purpose and clean API. It's a trivial function, so I'd either find a different shared package where it might live, or I'd just create a copy of it.

The other use cases for unsafeString are just error cases, so I'd just let those be slow and use a regular []bytes => string conversion for those.

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @rafiss)

pkg/ccl/sqlproxyccl/conn_migration.go, line 46 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I agree that 1M is a better limit than 4K. Even just a few prepared queries would quickly exceed a 4K limit. I also like the idea of having the SQL pod return a "can't serialize" error when the amount of state exceeds some pre-configured value (it'd be nice to be able to set this, in case we find that 1M is not right).

The PR to implement the idea as discussed is already up: #77307. We'll rely on that cluster setting once we set it on the host.

pkg/ccl/sqlproxyccl/conn_migration.go, line 50 at r17 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

It seems pretty bad to "just close the connection", as that means we'll be regularly rudely killing user connections that have a lot of prepared statements. I don't think we can ship with this (but fine for this PR, until we get support in the database).

Yes - agreed. There will be a follow-up PR to simplify all this work, and rely on the limits set within the DB.

pkg/ccl/sqlproxyccl/conn_migration.go, line 396 at r17 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

It's not great that we're parsing error messages like this. Why can't we use errors.Is function and a set of reference errors that we define at the top of this file?
if errors.Is(err, errorResponseErr)

I originally had it that way, but realized that doing what I did here seems to be a common thing in the DB, so I used this approach instead. That said, there will be an upcoming PR to remove all this.

pkg/ccl/sqlproxyccl/conn_migration.go, line 426 at r17 (raw file):

I don't know how pgproto3 works, but what if BackendMessage reuses the same underlying bytes memory for each incoming message? If you reinterpret that memory as being string rather than bytes, it could change out from underneath you in unpredictable ways and perhaps even cause hard process crashes / security exploits.

The underlying memory is backed by the interceptor's memory, which reuses the same underlying bytes memory. When designing the API for the interceptor, it was clear that when using ReadMsg, callers should be informed that the underlying memory would not be valid the next time we call other methods on the interceptor.

But I don't think I'd export unsafeString from the encoding package, as it violates that package's purpose and clean API. It's a trivial function, so I'd either find a different shared package where it might live, or I'd just create a copy of it.

FWIW, that package exposes encoding.UnsafeConvertStringToBytes that does the opposite.

That said, I think you have a couple of good points there. I will back out all these changes, and do the regular []byte to string. It appears that the Go compiler is smart enough to optimize for this case: golang/go@69cd91a. Also see: https://go.dev/play/p/KtNa0nR_biU 😄

goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkRegular-16         	160258896	         7.387 ns/op
BenchmarkBytesCompare-16    	23054590	        49.03 ns/op
BenchmarkByteByByte-16      	24618225	        56.12 ns/op
BenchmarkUnsafe-16          	155243484	         7.468 ns/op
BenchmarkUnsafeEqual-16     	157344313	         7.599 ns/op
PASS
ok  	command-line-arguments	9.155s

Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. Release note: None Release justification: Helpers added in this commit are needed for the connection migration work. Connection migration is currently not being used in production, and CockroachCloud is the only user of sqlproxy.

Previously, we incorrectly defined defaultBufferSize as 16K bytes. Note that 2 << 13 is 16K bytes. This commit fixes that behavior to match the original intention of 8K bytes. Release note: None Release justification: This fixes an unintentional buglet within the sqlproxy code that was introduced with the interceptors back then. Not having this in means we're using double the memory for each connection within the sqlproxy.

jaylim-crl

I'm taking back the approach of doing it in a different PR. I've updated this PR accordingly with all the necessary changes. The only follow up from here would be metrics for the DataRow size for us to tune the new cluster setting that will be added.

This is ready for another round of review.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @rafiss)

andy-kimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @rafiss)

jeffswenson · 2022-03-03T14:45:49Z

LGTM 💯

jaylim-crl · 2022-03-03T17:23:04Z

TFTR!

bors r=JeffSwenson,andy-kimball

craig · 2022-03-03T20:20:46Z

Build succeeded:

GitHub CI (Cockroach)

jaylim-crl requested review from jeffswenson and andy-kimball February 27, 2022 21:46

jaylim-crl requested a review from a team as a code owner February 27, 2022 21:46

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch from 73f8a0b to d757545 Compare February 28, 2022 06:05

jaylim-crl mentioned this pull request Feb 28, 2022

ccl/sqlproxyccl: complete connection migration support in the forwarder #76805

Merged

jeffswenson reviewed Feb 28, 2022

View reviewed changes

jaylim-crl requested review from jeffswenson and rafiss February 28, 2022 16:30

jaylim-crl commented Feb 28, 2022

View reviewed changes

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch from d757545 to fd30de9 Compare February 28, 2022 23:56

jaylim-crl commented Feb 28, 2022

View reviewed changes

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch 2 times, most recently from 3321c8e to 91144f8 Compare March 1, 2022 04:23

jaylim-crl commented Mar 1, 2022

View reviewed changes

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch from 91144f8 to 4019a47 Compare March 1, 2022 04:44

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch from 4019a47 to 7a182b1 Compare March 1, 2022 20:12

jaylim-crl commented Mar 1, 2022

View reviewed changes

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch from 7a182b1 to 2b1d593 Compare March 2, 2022 03:51

jeffswenson approved these changes Mar 2, 2022

View reviewed changes

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch from 2b1d593 to e6806ae Compare March 2, 2022 18:57

jaylim-crl commented Mar 2, 2022

View reviewed changes

andy-kimball reviewed Mar 2, 2022

View reviewed changes

jaylim-crl requested a review from andy-kimball March 3, 2022 02:25

jaylim-crl commented Mar 3, 2022

View reviewed changes

jaylim-crl added 2 commits March 2, 2022 23:09

jaylim-crl force-pushed the jay/add-conn-migration-helpers branch from e6806ae to 4e668b1 Compare March 3, 2022 04:11

jaylim-crl commented Mar 3, 2022

View reviewed changes

andy-kimball approved these changes Mar 3, 2022

View reviewed changes

craig bot merged commit 960f2b4 into cockroachdb:master Mar 3, 2022

jaylim-crl deleted the jay/add-conn-migration-helpers branch March 3, 2022 21:26

jaylim-crl mentioned this pull request Mar 13, 2022

sql, ccl/sqlproxyccl: connection migration tracking issue #76000

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/sqlproxyccl: add helpers related to connection migration #77109

ccl/sqlproxyccl: add helpers related to connection migration #77109

jaylim-crl commented Feb 27, 2022 •

edited

Loading

cockroach-teamcity commented Feb 27, 2022

jaylim-crl left a comment

jaylim-crl left a comment

jaylim-crl left a comment

jaylim-crl left a comment

jeffswenson left a comment

jaylim-crl left a comment

andy-kimball left a comment

jaylim-crl left a comment

jaylim-crl left a comment

andy-kimball left a comment

jeffswenson commented Mar 3, 2022

jaylim-crl commented Mar 3, 2022

craig bot commented Mar 3, 2022

	// MarkAsRetryJobError marks an error as a retriable job error which
	// indicates that the registry should retry the job.
	func MarkAsRetryJobError(err error) error {
	return errors.Mark(err, errRetryJobSentinel)
	}

	// Registry does not retry a job that fails due to a permanent error.
	var errJobPermanentSentinel = errors.New("permanent job error")

	// MarkAsPermanentJobError marks an error as a permanent job error, which indicates
	// Registry to not retry the job when it fails due to this error.
	func MarkAsPermanentJobError(err error) error {
	return errors.Mark(err, errJobPermanentSentinel)
	}

	// IsPermanentJobError checks whether the given error is a permanent error.
	func IsPermanentJobError(err error) bool {
	return errors.Is(err, errJobPermanentSentinel)
	}

	// IsPauseSelfError checks whether the given error is a
	// PauseRequestError.
	func IsPauseSelfError(err error) bool {
	return errors.Is(err, errPauseSelfSentinel)
	}

ccl/sqlproxyccl: add helpers related to connection migration #77109

ccl/sqlproxyccl: add helpers related to connection migration #77109

Conversation

jaylim-crl commented Feb 27, 2022 • edited Loading

ccl/sqlproxyccl: add helpers related to connection migration

ccl/sqlproxyccl: fix math for defaultBufferSize in interceptors

cockroach-teamcity commented Feb 27, 2022

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jeffswenson left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

jeffswenson commented Mar 3, 2022

jaylim-crl commented Mar 3, 2022

craig bot commented Mar 3, 2022

jaylim-crl commented Feb 27, 2022 •

edited

Loading