-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: compute --drain-wait based on cluster setting values #98390
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
I'm investigating where this should be tested. It looks like roachtest/tests/drain.go might be best. |
09da9af
to
a1f019a
Compare
f06bd91
to
50ad228
Compare
50ad228
to
d2f3830
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the first and third commit. For the second commit, I'm not familiar with txn state transition so I would like to have others finalize the review.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @rafiss, and @srosenberg)
-- commits
line 5 at r1:
According to the code, if the --drain-wait
is zero, we don't auto-recompute the wait time. Is this intentional? If so, maybe here it can be updated to "if the command detects that it's of positive duration but smaller than the sum ..."
pkg/cli/rpc_node_shutdown.go
line 95 at r1 (raw file):
"cluster settings require a value of at least %s; using the larger value", drainCtx.drainWait, minWait) drainCtx.drainWait = minWait + 10*time.Second
I'm not sure why we have these additional 10 seconds here. Also, should we instead log using the lager value {minWait + 10 * time.Second}
?
pkg/cmd/roachtest/tests/drain.go
line 258 at r2 (raw file):
} if err == nil { return errors.New("expected pg_sleep query to fail")
Is this code reachable? if err == nil
, then testutils.IsError(err, "(query execution canceled|server is shutting down|connection reset by peer|unexpected EOF)")
is false, and it will just reach return nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @srosenberg, and @ZhouXing19)
Previously, ZhouXing19 (Jane Xing) wrote…
According to the code, if the
--drain-wait
is zero, we don't auto-recompute the wait time. Is this intentional? If so, maybe here it can be updated to "if the command detects that it's of positive duration but smaller than the sum ..."
yes, i should document that. 0 just means "no timeout"
pkg/cli/rpc_node_shutdown.go
line 95 at r1 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
I'm not sure why we have these additional 10 seconds here. Also, should we instead log
using the lager value {minWait + 10 * time.Second}
?
it's basically just a little bit of buffer. yeah i will fix that log message
pkg/cmd/roachtest/tests/drain.go
line 258 at r2 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
Is this code reachable? if
err == nil
, thentestutils.IsError(err, "(query execution canceled|server is shutting down|connection reset by peer|unexpected EOF)")
is false, and it will just reachreturn nil
.
whoops, i shouldn't have the !
. nice catch
Release note (cli change): The --drain-wait argument for the `drain` command will be automatically increased if the command detects that it is smaller than the sum of server.shutdown.drain_wait, server.shutdown.connection_wait, server.shutdown.query_wait times two, and server.shutdown.lease_transfer_wait. If the --drain-wait argument is 0, then no timeout is used. This recommendation was already documented, but now the advice will be applied automatically.
d2f3830
to
3274081
Compare
commit #2 i can explain: it just moves that check to the end of the function, which is after the state transitions occur: cockroach/pkg/sql/conn_executor.go Line 2185 in 2d63378
i think you're able to review it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like Jane, i looked at 1st and 3rd commit, and the 2nd one got me a bit confused. Maybe more words to explain what is going on in the commit message would help.
Otherwise i really like this!
Reviewed 5 of 5 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @rafiss, @srosenberg, and @ZhouXing19)
pkg/sql/conn_executor.go
line 2131 at r4 (raw file):
// a Sync is processed. defer func() { if ex.idleConn() {
Where is the res.Close call moved to?
This fixes a minor bug in which the connection would not get closed at the earliest possible time during server shutdown. The connection is supposed to be closed as soon as we handle a Sync message when the conn_executor is in the draining state and not in a transaction. Since the transaction state was checked before state transitions occurred, this would cause the connection to remain open for an extra bit of time. This was particularly a problem because the Sync message is also the command that auto-commits an implicit transaction. So before this commit, it was actually impossible for the check to work as it was supposed to. Now we check the txn state after state transitions occur. Release note: None
The test now does much more: - Checks that --drain-wait is automatically increased if it is set lower than the cluster settings require. - Check that the /health?ready=1 endpoint fails during the drain_wait period. - Check for the proper error message during the connection_wait phase. - Check for the proper error message when trying to begin a new query/transaction during the query_wait phase. - Check that an open transaction is allowed to continue during the query_wait phase. - Check for the proper error message when a query is canceled during shutdown. Release note: None
3274081
to
d6cdd94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've added details to the commit message that should add more clarity.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @srosenberg, and @ZhouXing19)
pkg/sql/conn_executor.go
line 2131 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Where is the res.Close call moved to?
now we use the Close call that occurs during the normal message handling / state transitions, here:
cockroach/pkg/sql/conn_executor.go
Line 2238 in 2d63378
res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks indeed this helps.
Reviewed 4 of 4 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @srosenberg, and @ZhouXing19)
tftr! bors r+ |
👎 Rejected by code reviews |
bors r+ |
Build failed (retrying...): |
bors r+ |
Already running a review |
Build failed: |
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 18eac87 to blathers/backport-release-22.1-98390: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. error creating merge commit from d6cdd94 to blathers/backport-release-22.2-98390: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This includes a few commits related to draining
cli: compute --drain-wait based on cluster setting values
fixes #98388
Release note (cli change): The --drain-wait argument for the
drain
command will be automatically increased if the command detects that it is smaller than the sum of server.shutdown.drain_wait, server.shutdown.connection_wait, server.shutdown.query_wait times two, and server.shutdown.lease_transfer_wait.This recommendation was already documented, but now the advice will be applied automatically.
sql: fix check for closing connExecutor during draining
This fixes a minor bug in which the connection would not get closed at
the earliest possible time during server shutdown.
The connection is supposed to be closed as soon as we handle a Sync
message when the conn_executor is in the draining state and not in a
transaction. Since the transaction state was checked before state
transitions occurred, this would cause the connection to remain open for
an extra bit of time.
roachtest: enhance drain test
The test now does much more:
than the cluster settings require.
period.
query/transaction during the query_wait phase.
query_wait phase.
shutdown.