Merge #106118 #109415 #111397

106118: pkg/ui: emphasize "cluster virtualization" in human-visible surfaces r=yuzefovich a=knz Informs #106068. Epic: CRDB-29380 At this stage, we are focusing on strings and comments that are read by humans. A separate pass will take care of the method and variable names later. Release note: None 109415: server, all: rename some cluster settings r=rafiss a=knz Fixes #75520. Epic: CRDB-28893 The following settings have been renamed, to emphasize how the initial drain wait is a *mandatory wait*, whereas the others are simply *timeouts*. Release note (cli change): The following cluster settings have been renamed; the previous names remain available for backward-compatibility. | Previous name | New Name | |---------------------------------------|----------------------------------------------------| | `server.shutdown.drain_wait` | `server.shutdown.initial_wait` | | `server.shutdown.lease_transfer_wait` | `server.shutdown.lease_transfer_iteration.timeout` | | `server.shutdown.query_wait` | `server.shutdown.queries.timeout` | | `server.shutdown.connection_wait` | `server.shutdown.transactions.timeout` | | `server.shutdown.jobs_wait` | `server.shutdown.jobs.timeout` | 111397: catalog/lease: reduce lock contention on descVersionState r=fqazi a=fqazi Recently, a regression was noticed on master related to AcquireByName being slower between builds. Using a profiler, this was tracked down to contention on descriptorVersionState's mutex. To address this, this patch avoids acquiring this lock twice in the hot path, which resolves the regression. After: BenchmarkLeaseAcquireByNameCached-24 2378610 451.9 ns/op Before: BenchmarkLeaseAcquireByNameCached-24 1493899 727.9 ns/op f13c021: BenchmarkLeaseAcquireByNameCached-24 2129446 568.1 ns/op Fixes: #111094 Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
cockroachdb · Oct 2, 2023 · 0be8311 · 0be8311
4 parents c133865 + 69b9cac + 7137d57 + 318b193
commit 0be8311
Show file tree

Hide file tree

Showing 23 changed files with 127 additions and 105 deletions.
diff --git a/docs/RFCS/20211207_graceful_draining.md b/docs/RFCS/20211207_graceful_draining.md
@@ -26,11 +26,11 @@ in which the server disallows new connections and waits until all sessions
 are closed by clients to shut down. While in CockroachDB’s case, the
 draining process consists of two consecutive periods, `drain_wait` and
 `query_wait`. Their duration is configured via two
-[cluster settings][cockroach cluster settings],`server.shutdown.drain_wait`
-and `server.shutdown.query_wait`, with the former “controlling the amount of
+[cluster settings][cockroach cluster settings],`server.shutdown.initial_wait`
+and `server.shutdown.transactions.timeout`, with the former “controlling the amount of
 time a server waits in an unready state before proceeding with the rest of the
 shutdown process”, and the latter specifying “the amount of time a server waits
-for existent queries to finish”. CockroachDB expects all SQL connections to be
+for current txns to finish”. CockroachDB expects all SQL connections to be
 closed by clients before the end of `drain_wait`. After `drain_wait` ends and
 `query_wait` starts, the server shuts down all SQL connections that are not with
 inflight queries (such as open transaction sessions). After the end of
@@ -50,7 +50,7 @@ and the workaround setting for `drain_wait`.
 
 ![Current Draining Process][current draining pic]
 The current workaround that CockroachDB provides is to set the cluster setting
-`server.shutdown.drain_wait` longer than the sum of the maximum lifetime of a
+`server.shutdown.initial_wait` longer than the sum of the maximum lifetime of a
 connection in the connection pool (`MaxLifeTime`), the time for the load
 balancer to stop routing new SQL connections, and the max transaction latency.
 This setting will let connections reach `MaxLifeTime` and be recycled by
@@ -61,9 +61,9 @@ from a different node.
 
 The downsides of this workaround are:
 
-  1. The setting `server.shutdown.drain_wait` has to be adjusted based on
+  1. The setting `server.shutdown.initial_wait` has to be adjusted based on
      clients’ configuration of the connection pooling and load balancer.
-  2. The server is doing a hard wait till the end of `server.shutdown.drain_wait`.
+  2. The server is doing a hard wait till the end of `server.shutdown.initial_wait`.
      So though setting a long `drain_wait` can help ensure all SQL connections
      are closed before the (true) draining starts, it brings a long downtime
      for this node.
@@ -165,8 +165,8 @@ connectionWait = settings.RegisterDurationSetting(
     continue with the rest of the shutdown process before the end of
     connection_wait.”,
     0*time.Second,
+    settings.WithName("server.shutdown.connections.timeout"),
 ).WithPublic()
-
 ```
 
 The default duration length of `connection_wait` is set to be 0 seconds.
@@ -250,11 +250,11 @@ timeline for the draining:
 - lease transfer stage (unchanged): The server keeps retrying until all range
   leases and raft leaderships on this draining node have been tranfered.
   **In each iteration**, the kvserver transfers the leases with timeout
-  specified by `server.shutdown.lease_transfer_wait`.
+  specified by `server.shutdown.lease_transfer_iteration.timeout`.
   (Which means the total duration of the lease transfer stage is not totally
-  determined by the value of `server.shutdown.lease_transfer_wait"`.
+  determined by the value of `server.shutdown.lease_transfer_iteration.timeout"`.
   We [propose][Lease Transfer Issue] to rename this cluster setting with
-  `server.shutdown.lease_transfer_timeout"`, and document it seperately with
+  `server.shutdown.lease_transfer_iteration.timeout"`, and document it seperately with
   the other three waiting periods.)
 
 # Demo with changes from Technical Design
@@ -313,7 +313,7 @@ during draining.
   upgrade? Which migrations are needed?
   - If certain nodes in the cluster are not updated to support this feature,
     it may lead to confusion. We may need a version gate for
-    `server.shutdown.connection_wait`.
+    `server.shutdown.connections.timeout`.
 - Is the result/usage of this change different for CC end-users than for
   on-prem deployments? How?
   - The new functionality in this design is expected to be used in multitenent
@@ -372,20 +372,20 @@ during draining.
     system during each waiting period. We can refer to the `final state of this
     design` in this RFC.
   - How to correctly tune the cluster settings
-      - `server.shutdown.drain_wait` (the wait for the load balancer to detect
+      - `server.shutdown.initial_wait` (the wait for the load balancer to detect
         the draining status of the node) should be in coordination with the load
         balancer's settings, such as the health check interval.
-      - `server.shutdown.connection_wait` (the timeout to wait for all SQL
+      - `server.shutdown.connections.timeout` (the timeout to wait for all SQL
         connections to be closed) should be longer than the maximum life
         time of a connection as specified in the connection pool.
-      - `server.shutdown.query_wait` (the timeout to wait all SQL queries to
+      - `server.shutdown.transactions.timeout` (the timeout to wait all SQL txns to
         finish) should be greater than any of the followings:
         - The longest possible transaction in the workload expected to complete
           successfully under normal operating conditions.
         - The `sql.defaults.idle_in_transaction_session_timeout` cluster setting.
         - The `sql.defaults.statement_timeout` cluster setting.
 
-    Note that the cluster setting `server.shutdown.drain_wait` is of different
+    Note that the cluster setting `server.shutdown.initial_wait` is of different
     meaning of the `--drain-wait` flag under the `cockroach node drain` command.
     The latter specifies the "_amount of time to wait for the node to drain
     before returning to the client_", and it is usually set with several

diff --git a/docs/generated/settings/settings-for-tenants.txt b/docs/generated/settings/settings-for-tenants.txt
@@ -87,10 +87,10 @@ server.oidc_authentication.provider_url	string		sets OIDC provider URL ({provide
 server.oidc_authentication.redirect_url	string	https://localhost:8080/oidc/v1/callback	sets OIDC redirect URL via a URL string or a JSON string containing a required `redirect_urls` key with an object that maps from region keys to URL strings (URLs should point to your load balancer and must route to the path /oidc/v1/callback)	application
 server.oidc_authentication.scopes	string	openid	sets OIDC scopes to include with authentication request (space delimited list of strings, required to start with `openid`)	application
 server.rangelog.ttl	duration	720h0m0s	if nonzero, entries in system.rangelog older than this duration are periodically purged	application
-server.shutdown.connection_wait	duration	0s	the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)	application
-server.shutdown.drain_wait	duration	0s	the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.drain_wait is to set the wait time for health probes to notice that the node is not ready.)	application
-server.shutdown.jobs_wait	duration	10s	the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown	application
-server.shutdown.query_wait	duration	10s	the timeout for waiting for active queries to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)	application
+server.shutdown.connections.timeout	duration	0s	the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)	application
+server.shutdown.initial_wait	duration	0s	the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.initial_wait is to set the wait time for health probes to notice that the node is not ready.)	application
+server.shutdown.jobs.timeout	duration	10s	the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown	application
+server.shutdown.transactions.timeout	duration	10s	the timeout for waiting for active transactions to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)	application
 server.time_until_store_dead	duration	5m0s	the time after which if there is no new gossiped information about a store, it is considered dead	application
 server.user_login.cert_password_method.auto_scram_promotion.enabled	boolean	true	whether to automatically promote cert-password authentication to use SCRAM	application
 server.user_login.downgrade_scram_stored_passwords_to_bcrypt.enabled	boolean	true	if server.user_login.password_encryption=crdb-bcrypt, this controls whether to automatically re-encode stored passwords using scram-sha-256 to crdb-bcrypt	application

diff --git a/docs/generated/settings/settings.html b/docs/generated/settings/settings.html
@@ -112,11 +112,11 @@
 <tr><td><div id="setting-server-oidc-authentication-redirect-url" class="anchored"><code>server.oidc_authentication.redirect_url</code></div></td><td>string</td><td><code>https://localhost:8080/oidc/v1/callback</code></td><td>sets OIDC redirect URL via a URL string or a JSON string containing a required `redirect_urls` key with an object that maps from region keys to URL strings (URLs should point to your load balancer and must route to the path /oidc/v1/callback)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
 <tr><td><div id="setting-server-oidc-authentication-scopes" class="anchored"><code>server.oidc_authentication.scopes</code></div></td><td>string</td><td><code>openid</code></td><td>sets OIDC scopes to include with authentication request (space delimited list of strings, required to start with `openid`)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
 <tr><td><div id="setting-server-rangelog-ttl" class="anchored"><code>server.rangelog.ttl</code></div></td><td>duration</td><td><code>720h0m0s</code></td><td>if nonzero, entries in system.rangelog older than this duration are periodically purged</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
-<tr><td><div id="setting-server-shutdown-connection-wait" class="anchored"><code>server.shutdown.connection_wait</code></div></td><td>duration</td><td><code>0s</code></td><td>the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
-<tr><td><div id="setting-server-shutdown-drain-wait" class="anchored"><code>server.shutdown.drain_wait</code></div></td><td>duration</td><td><code>0s</code></td><td>the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.drain_wait is to set the wait time for health probes to notice that the node is not ready.)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
-<tr><td><div id="setting-server-shutdown-jobs-wait" class="anchored"><code>server.shutdown.jobs_wait</code></div></td><td>duration</td><td><code>10s</code></td><td>the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
-<tr><td><div id="setting-server-shutdown-lease-transfer-wait" class="anchored"><code>server.shutdown.lease_transfer_wait</code></div></td><td>duration</td><td><code>5s</code></td><td>the timeout for a single iteration of the range lease transfer phase of draining (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Dedicated/Self-Hosted</td></tr>
-<tr><td><div id="setting-server-shutdown-query-wait" class="anchored"><code>server.shutdown.query_wait</code></div></td><td>duration</td><td><code>10s</code></td><td>the timeout for waiting for active queries to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
+<tr><td><div id="setting-server-shutdown-connection-wait" class="anchored"><code>server.shutdown.connections.timeout</code></div></td><td>duration</td><td><code>0s</code></td><td>the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
+<tr><td><div id="setting-server-shutdown-drain-wait" class="anchored"><code>server.shutdown.initial_wait</code></div></td><td>duration</td><td><code>0s</code></td><td>the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.initial_wait is to set the wait time for health probes to notice that the node is not ready.)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
+<tr><td><div id="setting-server-shutdown-jobs-wait" class="anchored"><code>server.shutdown.jobs.timeout</code></div></td><td>duration</td><td><code>10s</code></td><td>the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
+<tr><td><div id="setting-server-shutdown-lease-transfer-wait" class="anchored"><code>server.shutdown.lease_transfer_iteration.timeout</code></div></td><td>duration</td><td><code>5s</code></td><td>the timeout for a single iteration of the range lease transfer phase of draining (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Dedicated/Self-Hosted</td></tr>
+<tr><td><div id="setting-server-shutdown-query-wait" class="anchored"><code>server.shutdown.transactions.timeout</code></div></td><td>duration</td><td><code>10s</code></td><td>the timeout for waiting for active transactions to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
 <tr><td><div id="setting-server-time-until-store-dead" class="anchored"><code>server.time_until_store_dead</code></div></td><td>duration</td><td><code>5m0s</code></td><td>the time after which if there is no new gossiped information about a store, it is considered dead</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
 <tr><td><div id="setting-server-user-login-cert-password-method-auto-scram-promotion-enabled" class="anchored"><code>server.user_login.cert_password_method.auto_scram_promotion.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>whether to automatically promote cert-password authentication to use SCRAM</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
 <tr><td><div id="setting-server-user-login-downgrade-scram-stored-passwords-to-bcrypt-enabled" class="anchored"><code>server.user_login.downgrade_scram_stored_passwords_to_bcrypt.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>if server.user_login.password_encryption=crdb-bcrypt, this controls whether to automatically re-encode stored passwords using scram-sha-256 to crdb-bcrypt</td><td>Serverless/Dedicated/Self-Hosted</td></tr>

diff --git a/pkg/ccl/changefeedccl/changefeed_processors.go b/pkg/ccl/changefeedccl/changefeed_processors.go
@@ -581,7 +581,7 @@ var aggregatorHeartbeatFrequency = settings.RegisterDurationSetting(
 	settings.ApplicationLevel,
 	"changefeed.aggregator.heartbeat",
 	"changefeed aggregator will emit a heartbeat message to the coordinator with this frequency; 0 disables. "+
-		"The setting value should be <=1/2 of server.shutdown.jobs_wait period",
+		"The setting value should be <=1/2 of server.shutdown.jobs.timeout period",
 	4*time.Second,
 	settings.NonNegativeDuration,
 )

diff --git a/pkg/ccl/serverccl/server_controller_test.go b/pkg/ccl/serverccl/server_controller_test.go
@@ -474,7 +474,7 @@ func TestServerStartStop(t *testing.T) {
 
 		// Don't wait for graceful jobs shutdown in this test since
 		// we want to make sure test completes reasonably quickly.
-		_, err = db2.Exec("SET CLUSTER SETTING server.shutdown.jobs_wait='0s'")
+		_, err = db2.Exec("SET CLUSTER SETTING server.shutdown.jobs.timeout='0s'")
 		require.NoError(t, err)
 
 		return nil

diff --git a/pkg/cli/rpc_node_shutdown.go b/pkg/cli/rpc_node_shutdown.go
@@ -17,6 +17,8 @@ import (
 	"math"
 	"time"
 
+	"github.com/cockroachdb/cockroach/pkg/kv/kvserver"
+	"github.com/cockroachdb/cockroach/pkg/server"
 	"github.com/cockroachdb/cockroach/pkg/server/serverpb"
 	"github.com/cockroachdb/cockroach/pkg/util/grpcutil"
 	"github.com/cockroachdb/cockroach/pkg/util/log"
@@ -66,11 +68,11 @@ func doDrain(
 	if err := timeutil.RunWithTimeout(ctx, "get-drain-settings", 5*time.Second, func(ctx context.Context) error {
 		shutdownSettings, err := c.Settings(ctx, &serverpb.SettingsRequest{
 			Keys: []string{
-				"server.shutdown.drain_wait",
-				"server.shutdown.connection_wait",
-				"server.shutdown.jobs_wait",
-				"server.shutdown.query_wait",
-				"server.shutdown.lease_transfer_wait",
+				string(server.DrainWait.InternalKey()),
+				string(server.ConnectionShutdownTimeout.InternalKey()),
+				string(server.JobShutdownTimeout.InternalKey()),
+				string(server.QueryShutdownTimeout.InternalKey()),
+				string(kvserver.LeaseTransferPerIterationTimeout.InternalKey()),
 			},
 			UnredactedValues: true,
 		})
@@ -85,7 +87,7 @@ func doDrain(
 			}
 			minWait += wait
 			// query_wait is used twice during draining, so count it twice here.
-			if k == "server.shutdown.query_wait" {
+			if k == string(server.QueryShutdownTimeout.InternalKey()) {
 				minWait += wait
 			}
 		}
@@ -107,7 +109,7 @@ func doDrain(
 	if errors.HasType(err, (*timeutil.TimeoutError)(nil)) || grpcutil.IsTimeout(err) {
 		log.Infof(ctx, "drain timed out: %v", err)
 		err = errors.New("drain timeout, consider adjusting --drain-wait, especially under " +
-			"custom server.shutdown.{drain,query,connection,lease_transfer}_wait cluster settings")
+			"custom server.shutdown cluster settings")
 	}
 	return
 }