Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: compute --drain-wait based on cluster setting values #98390

Merged
merged 3 commits into from
Mar 14, 2023

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented Mar 10, 2023

This includes a few commits related to draining


cli: compute --drain-wait based on cluster setting values

fixes #98388

Release note (cli change): The --drain-wait argument for the drain command will be automatically increased if the command detects that it is smaller than the sum of server.shutdown.drain_wait, server.shutdown.connection_wait, server.shutdown.query_wait times two, and server.shutdown.lease_transfer_wait.

This recommendation was already documented, but now the advice will be applied automatically.


sql: fix check for closing connExecutor during draining

This fixes a minor bug in which the connection would not get closed at
the earliest possible time during server shutdown.

The connection is supposed to be closed as soon as we handle a Sync
message when the conn_executor is in the draining state and not in a
transaction. Since the transaction state was checked before state
transitions occurred, this would cause the connection to remain open for
an extra bit of time.


roachtest: enhance drain test

The test now does much more:

  • Checks that --drain-wait is automatically increased if it is set lower
    than the cluster settings require.
  • Check that the /health?ready=1 endpoint fails during the drain_wait
    period.
  • Check for the proper error message during the connection_wait phase.
  • Check for the proper error message when trying to begin a new
    query/transaction during the query_wait phase.
  • Check that an open transaction is allowed to continue during the
    query_wait phase.
  • Check for the proper error message when a query is canceled during
    shutdown.

@rafiss rafiss requested a review from knz March 10, 2023 18:37
@rafiss rafiss requested a review from a team as a code owner March 10, 2023 18:37
@blathers-crl
Copy link

blathers-crl bot commented Mar 10, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rafiss
Copy link
Collaborator Author

rafiss commented Mar 10, 2023

I'm investigating where this should be tested. It looks like roachtest/tests/drain.go might be best.

@rafiss rafiss requested a review from a team as a code owner March 10, 2023 20:19
@rafiss rafiss requested review from herkolategan and srosenberg and removed request for a team March 10, 2023 20:19
@rafiss rafiss force-pushed the auto-compute-drain-wait branch 4 times, most recently from 09da9af to a1f019a Compare March 13, 2023 05:07
@rafiss rafiss requested a review from a team March 13, 2023 05:07
@rafiss rafiss requested a review from a team as a code owner March 13, 2023 05:07
@rafiss rafiss force-pushed the auto-compute-drain-wait branch 3 times, most recently from f06bd91 to 50ad228 Compare March 13, 2023 16:21
@rafiss rafiss requested a review from ZhouXing19 March 13, 2023 16:29
@rafiss rafiss force-pushed the auto-compute-drain-wait branch from 50ad228 to d2f3830 Compare March 13, 2023 16:37
Copy link
Collaborator

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the first and third commit. For the second commit, I'm not familiar with txn state transition so I would like to have others finalize the review.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @rafiss, and @srosenberg)


-- commits line 5 at r1:
According to the code, if the --drain-wait is zero, we don't auto-recompute the wait time. Is this intentional? If so, maybe here it can be updated to "if the command detects that it's of positive duration but smaller than the sum ..."


pkg/cli/rpc_node_shutdown.go line 95 at r1 (raw file):

			"cluster settings require a value of at least %s; using the larger value",
			drainCtx.drainWait, minWait)
		drainCtx.drainWait = minWait + 10*time.Second

I'm not sure why we have these additional 10 seconds here. Also, should we instead log using the lager value {minWait + 10 * time.Second}?


pkg/cmd/roachtest/tests/drain.go line 258 at r2 (raw file):

		}
		if err == nil {
			return errors.New("expected pg_sleep query to fail")

Is this code reachable? if err == nil, then testutils.IsError(err, "(query execution canceled|server is shutting down|connection reset by peer|unexpected EOF)") is false, and it will just reach return nil.

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @srosenberg, and @ZhouXing19)


-- commits line 5 at r1:

Previously, ZhouXing19 (Jane Xing) wrote…

According to the code, if the --drain-wait is zero, we don't auto-recompute the wait time. Is this intentional? If so, maybe here it can be updated to "if the command detects that it's of positive duration but smaller than the sum ..."

yes, i should document that. 0 just means "no timeout"


pkg/cli/rpc_node_shutdown.go line 95 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

I'm not sure why we have these additional 10 seconds here. Also, should we instead log using the lager value {minWait + 10 * time.Second}?

it's basically just a little bit of buffer. yeah i will fix that log message


pkg/cmd/roachtest/tests/drain.go line 258 at r2 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Is this code reachable? if err == nil, then testutils.IsError(err, "(query execution canceled|server is shutting down|connection reset by peer|unexpected EOF)") is false, and it will just reach return nil.

whoops, i shouldn't have the !. nice catch

Release note (cli change): The --drain-wait argument for the `drain`
command will be automatically increased if the command detects that it
is smaller than the sum of server.shutdown.drain_wait,
server.shutdown.connection_wait, server.shutdown.query_wait times two,
and server.shutdown.lease_transfer_wait. If the --drain-wait argument is
0, then no timeout is used.

This recommendation was already documented, but now the advice will be
applied automatically.
@rafiss rafiss force-pushed the auto-compute-drain-wait branch from d2f3830 to 3274081 Compare March 13, 2023 21:59
@rafiss rafiss requested a review from ZhouXing19 March 13, 2023 21:59
@rafiss
Copy link
Collaborator Author

rafiss commented Mar 13, 2023

commit #2 i can explain: it just moves that check to the end of the function, which is after the state transitions occur:

// If an event was generated, feed it to the state machine.

i think you're able to review it :)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like Jane, i looked at 1st and 3rd commit, and the 2nd one got me a bit confused. Maybe more words to explain what is going on in the commit message would help.

Otherwise i really like this!

Reviewed 5 of 5 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @rafiss, @srosenberg, and @ZhouXing19)


pkg/sql/conn_executor.go line 2131 at r4 (raw file):

			// a Sync is processed.
			defer func() {
				if ex.idleConn() {

Where is the res.Close call moved to?

This fixes a minor bug in which the connection would not get closed at
the earliest possible time during server shutdown.

The connection is supposed to be closed as soon as we handle a Sync
message when the conn_executor is in the draining state and not in a
transaction. Since the transaction state was checked before state
transitions occurred, this would cause the connection to remain open for
an extra bit of time. This was particularly a problem because the Sync
message is also the command that auto-commits an implicit transaction.
So before this commit, it was actually impossible for the check to work
as it was supposed to.

Now we check the txn state after state transitions occur.

Release note: None
The test now does much more:
- Checks that --drain-wait is automatically increased if it is set lower
  than the cluster settings require.
- Check that the /health?ready=1 endpoint fails during the drain_wait
  period.
- Check for the proper error message during the connection_wait phase.
- Check for the proper error message when trying to begin a new
  query/transaction  during the query_wait phase.
- Check that an open transaction is allowed to continue during the
  query_wait phase.
- Check for the proper error message when a query is canceled during
  shutdown.

Release note: None
@rafiss rafiss force-pushed the auto-compute-drain-wait branch from 3274081 to d6cdd94 Compare March 13, 2023 23:00
Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've added details to the commit message that should add more clarity.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @srosenberg, and @ZhouXing19)


pkg/sql/conn_executor.go line 2131 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Where is the res.Close call moved to?

now we use the Close call that occurs during the normal message handling / state transitions, here:

res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()))

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks indeed this helps.

Reviewed 4 of 4 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @srosenberg, and @ZhouXing19)

@rafiss
Copy link
Collaborator Author

rafiss commented Mar 13, 2023

tftr!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 13, 2023

👎 Rejected by code reviews

@rafiss
Copy link
Collaborator Author

rafiss commented Mar 14, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2023

Build failed (retrying...):

@rafiss
Copy link
Collaborator Author

rafiss commented Mar 14, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2023

Already running a review

@craig
Copy link
Contributor

craig bot commented Mar 14, 2023

Build failed:

@rafiss
Copy link
Collaborator Author

rafiss commented Mar 14, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2023

Build succeeded:

@craig craig bot merged commit 19e5845 into cockroachdb:master Mar 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 14, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 18eac87 to blathers/backport-release-22.1-98390: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from d6cdd94 to blathers/backport-release-22.2-98390: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: the --drain-wait arg for the drain command should check the server.shutdown.*_wait cluster settings
4 participants