-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: introduce contention event store RFC #71965
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I left a lot of nits (as I often do) plus some questions, but generally your idea makes sense to me.
I'll do another pass and will share more feedback.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)
docs/RFCS/20211013_contentions.md, line 24 at r1 (raw file):
| Contending Transaction | The transaction that is holding the lock that blocks other transaction | | Blocked Transaction | The transaction that is blocked by a contending transaction | | Txn ID Cache | A cache that stores the mapping from transaction ID to transaction Fingerprint ID |
I think we also want to define what the "transaction fingerprint ID" is in this table.
nit: s/Fingerprint/fingerprint/
.
docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):
contending transaction (TxnB), a contention event is generated at the KV layer when the TxnA is finally able to proceed. The contention event includes the of TxnB, the duration of the contention from TxnA's
nit: probably missing ID
, i.e. "the of TxnB" -> "the ID of TxnB".
docs/RFCS/20211013_contentions.md, line 55 at r1 (raw file):
This event is then stored within the tracing span, and then is sent back to the gateway node at the end of DistSQL Flow and being recorded at the contention
nit: "being recorded" -> "is recorded".
docs/RFCS/20211013_contentions.md, line 94 at r1 (raw file):
However, the current data model experiences few drawbacks: * If one of user's transaction is running slow due to contention, (a blocked
super nit: it's unusual to see commas around text in parenthesis.
docs/RFCS/20211013_contentions.md, line 99 at r1 (raw file):
transactions. * Even if user is able to (somehow magically) identify a list of contending transactions, the only piece of information a user can get is .
Incomplete sentence.
docs/RFCS/20211013_contentions.md, line 121 at r1 (raw file):
For each blocked transaction, user can view all the contending transactions that caused contentions. Both blocked transaction's txn ID and contending
nit: "contentions" -> "contention".
docs/RFCS/20211013_contentions.md, line 157 at r1 (raw file):
-- This might not be needed if we choose not to adapt the system table design. PRIMARY KEY (ts, blocked_txn_id, contending_txn_id, blocked_stmt_fingerprint_id)
I'm thinking that maybe blocked_stmt_fingerprint_id
should be before contending_txn_id
? I'm imagining a scenario where a user has already identified the time period when a particular statement from a particular txn experienced the contention, and then the user wants to find all contending txn IDs - if we switch the order, then we might tighter range scans (this would matter for blocked txns with many statements).
docs/RFCS/20211013_contentions.md, line 163 at r1 (raw file):
In order to facilitate historical queries, Contention Event Store stores timeseries data with a timsstamp as its first column. Each row in the table
nit: s/timsstamp/timestamp/
.
docs/RFCS/20211013_contentions.md, line 182 at r1 (raw file):
`contending_txn_fingerprint_id` column. Effectively, this means we need a mechanism to resolve a TxnID (that belongs to a transaction that was potentially executed on a different node) into its corresponding a transaction
nit: probably "corresponding a transaction" -> "contending transaction".
docs/RFCS/20211013_contentions.md, line 204 at r1 (raw file):
is as follow: Each SQL node in the CockroachDB maintains an in-memory FIFO cache of recently
This TxnIDCache
doesn't exist, right? If so, I'd change the language slightly to use the future tense, i.e. "Each SQL node in the CockroachDB will maintain an in-memory FIFO cache...".
docs/RFCS/20211013_contentions.md, line 205 at r1 (raw file):
Each SQL node in the CockroachDB maintains an in-memory FIFO cache of recently executed transactions. (namely `TxnIDCache`). The in-memory cache
nit: no period before (namely
.
docs/RFCS/20211013_contentions.md, line 206 at r1 (raw file):
Each SQL node in the CockroachDB maintains an in-memory FIFO cache of recently executed transactions. (namely `TxnIDCache`). The in-memory cache stores the tuple of `(transaction_id, transaction_fingerprint_id)`.
nit: probably "the tuple" -> "the tuples".
docs/RFCS/20211013_contentions.md, line 209 at r1 (raw file):
`transaction_id` is a UUID with 16-byte storage overhead and `transaction_fingerprint_id` is a unsigned 64-bit integer with 8-byte storage overhead. Hence, with an overhead of 1 MiB, we will be able to store ~43,690
nit: I'd link this number to the estimation done below.
docs/RFCS/20211013_contentions.md, line 217 at r1 (raw file):
``` proto message TransactionIDResolutionRequest {
nit: I'd consider using Txn
instead of Transaction
in the type/variable names for brevity.
docs/RFCS/20211013_contentions.md, line 228 at r1 (raw file):
// inquired node. If the transactionID is not present on the inquired // node, then it is not returned. bytes transactionIDs = 1 [(gogoproto.customname) = "ID",
nit: s/transactionIDs/transactionID/
docs/RFCS/20211013_contentions.md, line 238 at r1 (raw file):
// This would require the node to check ActiveQueries store in addition // to TxnID Cache. uint64 transactionFingerprintIDs = 2 [(gogoproto.customname) = "TransactionFingerprintIDs",
nit: s/transactionFingerprintIDs/transactionFingerprintID/
.
docs/RFCS/20211013_contentions.md, line 253 at r1 (raw file):
Each SQL node collects contention events as they come in and stores the unresolved TxnID into a temporary buffer. Periodically, (subject to jitter), each SQL node
Can you expand on what is "subject to jitter"? Will we introduce a cluster setting for the period? Will it be probabilistic (i.e. after every 1000th contention event)?
docs/RFCS/20211013_contentions.md, line 278 at r1 (raw file):
contention event. (In Serverless mode, we would embed `SQLInstanceID`). This means when a contention event is generated, it will also embed this identifier. This would require the he KV layer to store this `NodeID` into the lock objects as
nit: "the he".
docs/RFCS/20211013_contentions.md, line 286 at r1 (raw file):
make here is on what we do with unresolved TxnIDs. Unlike the previous design where we stores a temporary buffer of unresolved
nit: s/stores/store/
.
nit: unclear what is "unresolved s".
docs/RFCS/20211013_contentions.md, line 289 at r1 (raw file):
s, we expand that temporary buffer to store a map from each `NodeID` to a list of unresolved TxnIDs (`map[node_id][]transaction_id`). Periodically, instead of perform cluster-wide RPC fanout, each node iterates
nit: s/perform/performing/
.
docs/RFCS/20211013_contentions.md, line 313 at r1 (raw file):
See [Footnote](#Footnote) for examples. ### System Table vs. Virtual Table
Do you mention options with a virtual table only because we might run out of the IDs for system tables? Are there other advantages / disadvantages of the virtual table option?
docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):
#### Virtual Table Only As of 21.2, CockroachDB does not support dynamic system table IDs. 50 system
Maybe we don't support "fixed" or "static" system table IDs and not "dynamic"?
docs/RFCS/20211013_contentions.md, line 319 at r1 (raw file):
As of 21.2, CockroachDB does not support dynamic system table IDs. 50 system table descriptor IDs have been previously reserved. As of writing, there are 3 system table descriptor ID remains. This means there is a possibility that the
nit: s/remains/left/
.
docs/RFCS/20211013_contentions.md, line 345 at r1 (raw file):
entries into the system table. The system table will adopt hash-sharded primary key to reduce write-write contention similar to SQL Stats tables. A scheduled job will be created to perform cleanup tasks and delete stale contention events.
Could you expand on this? What are "cleanup tasks" - is it deleting rows past certain TTL? What are the "stale contention events"?
docs/RFCS/20211013_contentions.md, line 352 at r1 (raw file):
necessary observability into the new subsystem. In addition, new cluster settings need to be introduced to provide runtime tuning knobs. This provides DB operators the ability to modify subsystem behaviours (to a certain extend)
nit: s/behaviors/behavior/
and s/extend/extent/
.
docs/RFCS/20211013_contentions.md, line 367 at r1 (raw file):
Additional monitoring metrics that will be introduced: * `sql.contentions.unresolved_event.count`: this is a gauge metric that keeps
nit: "contentions" reads unusual to me, maybe s/contentions/contention/
?
docs/RFCS/20211013_contentions.md, line 375 at r1 (raw file):
memory monitor response for memory accounting for the new contention event store. * `sql.contentions.resolution_latency`: this is a latency metric that keeps
Does this track how long the resolution of a single batch of TxnIDs take? Or is it about the latency between a contention event entering the temporary buffer and that event being resolved?
docs/RFCS/20211013_contentions.md, line 385 at r1 (raw file):
contention events flushed into the system table. * `sql.contentions.flush.duration`: this is a latency metric that records the duration of flushing contention event into the system table.
nit: s/event/events/
docs/RFCS/20211013_contentions.md, line 386 at r1 (raw file):
* `sql.contentions.flush.duration`: this is a latency metric that records the duration of flushing contention event into the system table. * `sql.contentions.flush.error`: this is a counter metrics that records the
nit: s/metrics/metric/
.
docs/RFCS/20211013_contentions.md, line 399 at r1 (raw file):
ID resolutions. All contention event generated will be hold in the unresolved buffer with FIFO evicton policies. * `sql.contentions.txn_id_resolution.{interval, jitter}`: these two cluster
What is "jitter" in this context?
docs/RFCS/20211013_contentions.md, line 404 at r1 (raw file):
constraints the maximum amount of entries stored in unresolved contention event buffer per node. * `sql.contentions.txn_id_cache.max_entry`: this is a size setting that
nit: "max_entry" to me sounds more like "the limit on a size of a single entry", so maybe do "max_entry" -> "max_count" or "num_max_entries".
docs/RFCS/20211013_contentions.md, line 417 at r1 (raw file):
will be retained in the system table. <!-- should we go with TTL-based expiration? !--> * `sql.contentions.cleanup.recurrence`: this is a cron expression that specify
Do we have a precedent for a similar (i.e. "cron expression") setting already?
docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):
<!-- should we go with TTL-based expiration? !--> * `sql.contentions.cleanup.recurrence`: this is a cron expression that specify how the frequency that the clean up job will be executed.
nit: "how frequently the cleanup job will be executed".
docs/RFCS/20211013_contentions.md, line 448 at r1 (raw file):
every transaction finishes execution, it is inserted into TxnID Cache, whereas contention event is inserted into unresolved buffer each time a contention event is generated. Two buffer are being populated at a drastically different
nit: s/Two buffer/Two buffers/
.
docs/RFCS/20211013_contentions.md, line 466 at r1 (raw file):
the unresolved contention event. The rate at which we discard unresolved contention event is closely related to
nit: s/event/events/
here and on the next line.
docs/RFCS/20211013_contentions.md, line 467 at r1 (raw file):
The rate at which we discard unresolved contention event is closely related to the rate at whcih we collect contention event, and the size of TxnID Cache.
nit: s/whcih/which/
.
docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):
Suppose a node is experiencing a consistent 2,000 QPS. Currently, we enable tracing on the statement with 1% probability. Suppose 40% of the statement experiences contentions (this number is hard to estimate, as one workload
nit: s/contentions/contention/
, also something is off in "one workload and drastically different from another".
docs/RFCS/20211013_contentions.md, line 474 at r1 (raw file):
`2000 * 1% * 40% = 8` contention events per second. The amount of time each node has to perform TxnID resolution equals to how long TxnID Cache can hold each TxnID to transaction fingerprint ID cache. This depends on the
nit: something is off in "each TxnID to transaction fingerprint ID cache".
docs/RFCS/20211013_contentions.md, line 478 at r1 (raw file):
then each cache entry takes 24 bytes (UUID + uint64), we can store up to `~43,690` entries in this cache. This means with 2000 QPS, we have ~21 seconds before we start to discarding contention event at 8 events per second.
nit: s/start to discarding contention event/start discarding contention events/
.
docs/RFCS/20211013_contentions.md, line 511 at r1 (raw file):
# Future work ## Statement-level reoslution
nit: s/reoslution/resolution/
.
docs/RFCS/20211013_contentions.md, line 513 at r1 (raw file):
## Statement-level reoslution We want to extend from understanding the which contending transaction
nit: s/understanding the which/understanding which/
.
docs/RFCS/20211013_contentions.md, line 543 at r1 (raw file):
information with each other (similar to the fashion outlined in this RFC), we should start considering a generalized framework that handles all of the background communication as well as error handlings.
nit: s/handlings/handling/
.
docs/RFCS/20211013_contentions.md, line 549 at r1 (raw file):
## Normal Execution Path Exeucted as part of query exeuction code path
nit: s/exeuction/execution/
.
docs/RFCS/20211013_contentions.md, line 552 at r1 (raw file):
Step 1: Two nodes executing conflicting Transactions
Nice diagrams!
docs/RFCS/20211013_contentions.md, line 565 at r1 (raw file):
| | | | +---------+ +-----------------------+ | | | | CES | |unresolved CES buffer | | |
super nit: I don't think you introduced the acronym CES
- might be worth mentioning it the very first time you use "Contention Event Store" at the very top or here, right before the first diagram.
docs/RFCS/20211013_contentions.md, line 592 at r1 (raw file):
Step 2: each node populates TxnID Cache, node 2 receives contention events with unresolved transaciton ID
nit: s/transaciton/transaction/
.
docs/RFCS/20211013_contentions.md, line 716 at r1 (raw file):
Step 5: More Transaction are being executed, the oldest entries in TxnID Cache gets evicted
nit: s/Transaction/Transactions/
and s/gets/get/
docs/RFCS/20211013_contentions.md, line 1064 at r1 (raw file):
Scratch Note (delete before send out draft)
Do you wanna delete this now?
I think this RFC is skirting around the elephant in the room: isn't this contention table, by itself, going to become a major source of contention if stored in KV? Like, what provision is made to ensure it doesn't become a hotspot? Don't we want this to be split in such a way that different nodes can write their contention data into different ranges? Also, what is the eviction algorithm? What provision is made that this table doesn't grow too large, too fast? The idea to have the data memory-only is good, especially if combined with a max size policy (and corresponding eviction rules). I personally prefer that option, although my opinion could be swayed the other way if there are suitable answers to the points above. If we go for memory-only, we'll probably want the RFC to also mention a way this data can be dumped to disk, included in UI, exported in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded the system table section to describe the reasoning behind why I think this new contention table will not introducing not ranges nor causing contention. It also includes the section describes the provisions made that this table doesn't grow too large and too fast.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
docs/RFCS/20211013_contentions.md, line 24 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think we also want to define what the "transaction fingerprint ID" is in this table.
nit:
s/Fingerprint/fingerprint/
.
Done.
docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: probably missing
ID
, i.e. "the of TxnB" -> "the ID of TxnB".
Done.
docs/RFCS/20211013_contentions.md, line 55 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "being recorded" -> "is recorded".
Done.
docs/RFCS/20211013_contentions.md, line 94 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
super nit: it's unusual to see commas around text in parenthesis.
Done.
docs/RFCS/20211013_contentions.md, line 99 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Incomplete sentence.
Done.
docs/RFCS/20211013_contentions.md, line 121 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "contentions" -> "contention".
Done.
docs/RFCS/20211013_contentions.md, line 157 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'm thinking that maybe
blocked_stmt_fingerprint_id
should be beforecontending_txn_id
? I'm imagining a scenario where a user has already identified the time period when a particular statement from a particular txn experienced the contention, and then the user wants to find all contending txn IDs - if we switch the order, then we might tighter range scans (this would matter for blocked txns with many statements).
Good point. Done.
docs/RFCS/20211013_contentions.md, line 182 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: probably "corresponding a transaction" -> "contending transaction".
Done.
docs/RFCS/20211013_contentions.md, line 204 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This
TxnIDCache
doesn't exist, right? If so, I'd change the language slightly to use the future tense, i.e. "Each SQL node in the CockroachDB will maintain an in-memory FIFO cache...".
Done.
docs/RFCS/20211013_contentions.md, line 205 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: no period before
(namely
.
Done.
docs/RFCS/20211013_contentions.md, line 206 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: probably "the tuple" -> "the tuples".
Done.
docs/RFCS/20211013_contentions.md, line 209 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I'd link this number to the estimation done below.
Done.
docs/RFCS/20211013_contentions.md, line 217 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I'd consider using
Txn
instead ofTransaction
in the type/variable names for brevity.
Done.
docs/RFCS/20211013_contentions.md, line 228 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/transactionIDs/transactionID/
Done.
docs/RFCS/20211013_contentions.md, line 238 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/transactionFingerprintIDs/transactionFingerprintID/
.
Done.
docs/RFCS/20211013_contentions.md, line 253 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Can you expand on what is "subject to jitter"? Will we introduce a cluster setting for the period? Will it be probabilistic (i.e. after every 1000th contention event)?
Expanded on the calculation using both interval and jitter cluster settings. It will be a probabilistic calculation.
In the cluster setting section below I introduced two additional cluster settings: sql.contention.txn_id_resolution.{interval, jitter}
docs/RFCS/20211013_contentions.md, line 278 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "the he".
Done.
docs/RFCS/20211013_contentions.md, line 286 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/stores/store/
.nit: unclear what is "unresolved s".
Done.
docs/RFCS/20211013_contentions.md, line 289 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/perform/performing/
.
Done.
docs/RFCS/20211013_contentions.md, line 313 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Do you mention options with a virtual table only because we might run out of the IDs for system tables? Are there other advantages / disadvantages of the virtual table option?
The main reason that I'm weighing these two options right now are indeed our shortage of descriptor IDs.
I think the main trade-off here are:
- simple in-memory system vs. complex system involving flushing and GC
- length retention period
- survive node restarts
Longer retention period and data survivability across node restart were the two main reason that motivated us to implement persisted sql stats for 21.2, since we have received feedback that it was one of the primary pain point to users when the sql stats was originally introduced in 21.1.
docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Maybe we don't support "fixed" or "static" system table IDs and not "dynamic"?
I'm not sure if I'm following here?
docs/RFCS/20211013_contentions.md, line 345 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Could you expand on this? What are "cleanup tasks" - is it deleting rows past certain TTL? What are the "stale contention events"?
Expanded this section to describe cleanup job in detail.
docs/RFCS/20211013_contentions.md, line 367 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "contentions" reads unusual to me, maybe
s/contentions/contention/
?
Done.
docs/RFCS/20211013_contentions.md, line 375 at r1 (raw file):
Does this track how long the resolution of a single batch of TxnIDs take?
This is my original thinking. If this metrics is spiking then it's a good indicator that either:
- service latency for resolution RPC is high. (e.g. network issues?)
- cluster size too large? perhaps cluster is reaching a size where cluster-wide RPC fanout is becoming prohibitively expensive ?
- uncaught bugs?
Or is it about the latency between a contention event entering the temporary buffer and that event being resolved?
Hmm I think this is also a useful metric, thought I think it overlaps with the previous metric to a certain extent. Though this metric will be sensitive to long running transactions. This is because we cannot compute transaction fingerprint id for active transactions. This means the txn id resolution need to retry which can spike this metric.
Hmm do you think we would need both?
docs/RFCS/20211013_contentions.md, line 385 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/event/events/
Done.
docs/RFCS/20211013_contentions.md, line 386 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/metrics/metric/
.
Done.
docs/RFCS/20211013_contentions.md, line 399 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
What is "jitter" in this context?
jitter
here is a floating point number in [0, 1]
. This is to introduce some randomness to the interval to prevent thundering herd problem. AFAIK, it's currently used in the job system and sql stats system.
docs/RFCS/20211013_contentions.md, line 404 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "max_entry" to me sounds more like "the limit on a size of a single entry", so maybe do "max_entry" -> "max_count" or "num_max_entries".
Switched to max_count
. Though, do you think we should specify this limit in terms of count or in terms of byte size?
docs/RFCS/20211013_contentions.md, line 417 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Do we have a precedent for a similar (i.e. "cron expression") setting already?
We currently do have it in sql stats. This is because scheduled job allows each schedule to have a cron expression for it's schedule.
docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "how frequently the cleanup job will be executed".
Done.
docs/RFCS/20211013_contentions.md, line 448 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/Two buffer/Two buffers/
.
Done.
docs/RFCS/20211013_contentions.md, line 466 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/event/events/
here and on the next line.
Done.
docs/RFCS/20211013_contentions.md, line 467 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/whcih/which/
.
Done.
docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/contentions/contention/
, also something is off in "one workload and drastically different from another".
Done.
docs/RFCS/20211013_contentions.md, line 474 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: something is off in "each TxnID to transaction fingerprint ID cache".
Reworded.
docs/RFCS/20211013_contentions.md, line 478 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/start to discarding contention event/start discarding contention events/
.
Done.
docs/RFCS/20211013_contentions.md, line 549 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/exeuction/execution/
.
Done.
docs/RFCS/20211013_contentions.md, line 552 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Nice diagrams!
Thanks! 😅
docs/RFCS/20211013_contentions.md, line 565 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
super nit: I don't think you introduced the acronym
CES
- might be worth mentioning it the very first time you use "Contention Event Store" at the very top or here, right before the first diagram.
Done.
docs/RFCS/20211013_contentions.md, line 716 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/Transaction/Transactions/
ands/gets/get/
Done.
docs/RFCS/20211013_contentions.md, line 1064 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Do you wanna delete this now?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @nvanbenschoten, and @yuzefovich)
docs/RFCS/20211013_contentions.md, line 38 at r2 (raw file):
Currently, CockroachDB collects transaction-level contention information from KV-layer when tracing is enabled. When a transaction (TxnA) is blocked by a
fwiw, I'm trying to get some level of tracing to be always active. This can mean (although it doesn't have to mean) that all conflicts encountered by statements could be reported to their gateways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small portion of nits from me with no additional feedback yet.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @nvanbenschoten, and @yuzefovich)
docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
I'm not sure if I'm following here?
Sorry, my comment was confusing.
I guess I'm just confused by the first sentence here - to me it implies that before 21.2 we had some "dynamic" scheme of assigning IDs to the system tables, but I don't think that it was true. I think the IDs have always been "fixed" - i.e. once a new system table is introduced, we pick the next available ID and use it for that table. We have a limit of 50 such IDs, and we're about to run out.
So, I'd probably rephrase the first sentence a little: "At the time of this writing, CockroachDB does not support dynamic ...".
docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Done.
I am still confused by this sentence.
docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Done.
something is off in "one workload and drastically different from another"
docs/RFCS/20211013_contentions.md, line 22 at r2 (raw file):
| Terminology | Explanation | | --- | --- | | Statement Fingerprint | SQL statement string with literal value replaced with underscores |
nit: s/value/values/
.
docs/RFCS/20211013_contentions.md, line 329 at r2 (raw file):
As of 21.2, CockroachDB does not support dynamic system table descriptor IDs. 50 system table descriptor IDs have been previously reserved. As of writing, there are 3 unused system table descriptor ID left. This means there is a
nit: s/ID/IDs/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @nvanbenschoten, and @yuzefovich)
docs/RFCS/20211013_contentions.md, line 365 at r2 (raw file):
Since the contention event table stores the time series data, it will adapt hash-sharded primary index to ensure we do not run into the hot range problem.
can you spell out the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, and @yuzefovich)
docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Sorry, my comment was confusing.
I guess I'm just confused by the first sentence here - to me it implies that before 21.2 we had some "dynamic" scheme of assigning IDs to the system tables, but I don't think that it was true. I think the IDs have always been "fixed" - i.e. once a new system table is introduced, we pick the next available ID and use it for that table. We have a limit of 50 such IDs, and we're about to run out.
So, I'd probably rephrase the first sentence a little: "At the time of this writing, CockroachDB does not support dynamic ...".
Done.
docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I am still confused by this sentence.
Rewrite this sentence.
docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
something is off in "one workload and drastically different from another"
Ah, I meant "one workload can be drastically different from another". Fixed.
docs/RFCS/20211013_contentions.md, line 22 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/value/values/
.
Done.
docs/RFCS/20211013_contentions.md, line 38 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
fwiw, I'm trying to get some level of tracing to be always active. This can mean (although it doesn't have to mean) that all conflicts encountered by statements could be reported to their gateways.
This is good to hear!
docs/RFCS/20211013_contentions.md, line 329 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/ID/IDs/
.
Done.
docs/RFCS/20211013_contentions.md, line 365 at r2 (raw file):
Previously, knz (kena) wrote…
can you spell out the schema?
The schema is listed out in the SQL Interface from above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @Azhng, @knz, and @yuzefovich)
docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Done.
It's more than just the ID, it's the entire TxnMeta
. It's worth calling that out, because there's additional information in there, including the gateway_node_id
when we add it.
docs/RFCS/20211013_contentions.md, line 92 at r3 (raw file):
It is evident that the existing contention event subsystem is built with a specific (table-centric) workflow in mind, that is: * allow user to understand which table/index is running hot
nit: consider making it more clear that these are sequential steps. It threw me while reading to see "identify the hot row in that table/index" here but then "the user does not have an easy way to identify which row the blocked transaction is trying to read" below.
docs/RFCS/20211013_contentions.md, line 109 at r3 (raw file):
In summary, user cannot easily connect blocked transaction to contending transaction, and even if user can somehow make that connection, the current
s/if user/if the user/
docs/RFCS/20211013_contentions.md, line 152 at r3 (raw file):
contending_txn_id UUID NOT NULL, blocked_stmt_fingerprint_id BYTES NOT NULL,
If the fingerprint IDs are 64-bits, should these be a fixed-width datatype like an INT8 for a more compact encoding?
docs/RFCS/20211013_contentions.md, line 289 at r3 (raw file):
when a contention event is generated, it will also embed this identifier. This would require the KV layer to store this `NodeID` into the lock objects as transactions acquire them.
Thanks for writing up #72303. We've discussed it on the KV team and don't think it will be a large lift. Do you mind linking to that issue from here?
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
stale data. One thing that we need to be careful is we need avoid writing too much data
These few paragraphs are really helpful. I was getting concerned that performance had not been taken into consideration. We've seen in the past that observability projects that don't define performance goals from the start are at risk of learning about performance problems late in the development cycle. I'm interested in whether we've defined such goals for this project (e.g. "enabling the contention event store with eager txn ID resolution on YCSB-A will reduce top-line throughput by no more than 3%") and how we plan to incrementally measure the various components to make sure they are performing sufficiently well.
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
stale data. One thing that we need to be careful is we need avoid writing too much data
Related to my other comment about performance, this back-of-the-envelope math is great. But the 2,000 QPS cited also very conservative. We have customers that run production clusters with a steady-state pushing 1M QPS, and should likely design any new component to scale to an order of magnitude beyond this. The sampling and the low-pass filter give us the ability to significantly downsample, but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments, especially if the contention event store scales any less well than the rest of the user's workload.
So we should probably adjust these numbers to represent a typical large customer (maybe 200k QPS) and also do the math on the top end of the design.
Much like we're doing with the row-level TTL work, we should probably also find a way early on in the development cycle to benchmark the contention event store in isolation.
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
(third thread about performance in a row, because why not 😃)
but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments
Does capturing a static fraction of all user traffic make sense for larger deployments? I'd imagine that there are diminishing returns as we capture more and more information. Have you considered a design where either the sampling rate or the low-pass filter is adaptive and targets some maximum contention event rate?
docs/RFCS/20211013_contentions.md, line 399 at r3 (raw file):
This can provide us with guidance on the value for the low pass filter. To ensure contention event table doesn't grow too large, a new cleanup job
This may be a crazy idea, but instead of building a new cleanup job, can we instead take advantage of the new row-level TTL project? It seems like many of these pieces are already being built elsewhere by @otan. A benefit of this is that the cleanup would then benefit from any performance optimization (of which there are many possibilities) we add now or later to make row-level TTL faster or less disruptive.
pkg/sql/opt/exec/explain/plan_gist_factory.og.go, line 1 at r3 (raw file):
// Code generated by optgen; DO NOT EDIT.
I don't think you meant to add this to the PR.
adc4f2b
to
1cc9d83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, @otan, and @yuzefovich)
docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's more than just the ID, it's the entire
TxnMeta
. It's worth calling that out, because there's additional information in there, including thegateway_node_id
when we add it.
Updated to TxnMeta
and updated the terminology to specifically call out TxnMeta
docs/RFCS/20211013_contentions.md, line 92 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: consider making it more clear that these are sequential steps. It threw me while reading to see "identify the hot row in that table/index" here but then "the user does not have an easy way to identify which row the blocked transaction is trying to read" below.
Done. Rewrote the paragraph to make it more clear that these are sequential steps
docs/RFCS/20211013_contentions.md, line 109 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/if user/if the user/
Done.
docs/RFCS/20211013_contentions.md, line 152 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
If the fingerprint IDs are 64-bits, should these be a fixed-width datatype like an INT8 for a more compact encoding?
When the fingerprint ID was originally introduced, it was a 64 bit unsigned integer. However, since we don't support unsigned integer type in SQL, if we trivially cast the fingerprint ID into signed integers, it led to a very strange UX. Since the fingerprint ID essentially is just a binary blob, we went with the BYTES
type.
docs/RFCS/20211013_contentions.md, line 289 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Thanks for writing up #72303. We've discussed it on the KV team and don't think it will be a large lift. Do you mind linking to that issue from here?
Done.
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Related to my other comment about performance, this back-of-the-envelope math is great. But the 2,000 QPS cited also very conservative. We have customers that run production clusters with a steady-state pushing 1M QPS, and should likely design any new component to scale to an order of magnitude beyond this. The sampling and the low-pass filter give us the ability to significantly downsample, but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments, especially if the contention event store scales any less well than the rest of the user's workload.
So we should probably adjust these numbers to represent a typical large customer (maybe 200k QPS) and also do the math on the top end of the design.
Much like we're doing with the row-level TTL work, we should probably also find a way early on in the development cycle to benchmark the contention event store in isolation.
Hmm this is very interesting. Is the 200k QPS ~ 1M QPS on the cluster-level or per-node level ? The 2,000 QPS cited here is at a per-node level. I got this number from our docs on performance which seems like we are able to handle 600K QPS with ~230 nodes. I think that roughly comes down to about 2600 QPS.
Also added a section discussing the component-level benchmark for this project using the idea from the row ttl project
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
These few paragraphs are really helpful. I was getting concerned that performance had not been taken into consideration. We've seen in the past that observability projects that don't define performance goals from the start are at risk of learning about performance problems late in the development cycle. I'm interested in whether we've defined such goals for this project (e.g. "enabling the contention event store with eager txn ID resolution on YCSB-A will reduce top-line throughput by no more than 3%") and how we plan to incrementally measure the various components to make sure they are performing sufficiently well.
I split this paragraph into its own section, and explicitly called out the top level performance objective. Quick question about the benchmark we are using, i'm not quite familiar with YCSB workload, previously we have been using TPCC for measuring the overall performance impact. Is there anything special about using YCSB-A in this case?
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
(third thread about performance in a row, because why not 😃)
but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments
Does capturing a static fraction of all user traffic make sense for larger deployments? I'd imagine that there are diminishing returns as we capture more and more information. Have you considered a design where either the sampling rate or the low-pass filter is adaptive and targets some maximum contention event rate?
Added a section later discussing the idea of adaptive low-pass filter threshold
docs/RFCS/20211013_contentions.md, line 399 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This may be a crazy idea, but instead of building a new cleanup job, can we instead take advantage of the new row-level TTL project? It seems like many of these pieces are already being built elsewhere by @otan. A benefit of this is that the cleanup would then benefit from any performance optimization (of which there are many possibilities) we add now or later to make row-level TTL faster or less disruptive.
Added using row-level TTL to "Rationale and Alternative" section. What is the current progress of the row-level TTL? Have we committed to shipping it for 22.1?
pkg/sql/opt/exec/explain/plan_gist_factory.og.go, line 1 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think you meant to add this to the PR.
Removed
974b93c
to
1f501ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should go with the system table approach and with teaching the KV to embed the NodeID/SQLInstanceID
identifier into the contention events.
Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @Azhng, @knz, @nvanbenschoten, and @otan)
docs/RFCS/20211013_contentions.md, line 313 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
The main reason that I'm weighing these two options right now are indeed our shortage of descriptor IDs.
I think the main trade-off here are:
- simple in-memory system vs. complex system involving flushing and GC
- length retention period
- survive node restarts
Longer retention period and data survivability across node restart were the two main reason that motivated us to implement persisted sql stats for 21.2, since we have received feedback that it was one of the primary pain point to users when the sql stats was originally introduced in 21.1.
I see, I think the persisted contention information could be extremely useful. I remember seeing an email from Andrew W or Marius about reaching out to them in case you're considering introducing a new system table for 22.1, so I think it'd be worth to check with SQL Schema team.
docs/RFCS/20211013_contentions.md, line 375 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Does this track how long the resolution of a single batch of TxnIDs take?
This is my original thinking. If this metrics is spiking then it's a good indicator that either:
- service latency for resolution RPC is high. (e.g. network issues?)
- cluster size too large? perhaps cluster is reaching a size where cluster-wide RPC fanout is becoming prohibitively expensive ?
- uncaught bugs?
Or is it about the latency between a contention event entering the temporary buffer and that event being resolved?
Hmm I think this is also a useful metric, thought I think it overlaps with the previous metric to a certain extent. Though this metric will be sensitive to long running transactions. This is because we cannot compute transaction fingerprint id for active transactions. This means the txn id resolution need to retry which can spike this metric.
Hmm do you think we would need both?
Your thinking makes sense to me, only the first seems sufficient (we can extrapolate the time it takes for a contention event between entering the temporary buffer and being resolved as "flush interval + resolution_latency").
docs/RFCS/20211013_contentions.md, line 404 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Switched to
max_count
. Though, do you think we should specify this limit in terms of count or in terms of byte size?
Hm, I think having it in terms of byte size makes more sense to me.
My attempt at doing some math lol: an entry in the TxnIDCache is 24 bytes, so 4MiB cache would be able to hold 175k entries. IIUC any txn might become the contending one, so we have to keep track of all txns executed on each node that was the gateway for those txns. Assuming 20k QPS per node (and assuming single stmt txns) this would give us about 9 seconds which seems a bit too short. We need the TxnIDCache to keep the records roughly for the duration of the longest-running txn (probably in minutes in the worst cases) plus the flush interval (probably in seconds). If we bump the size to 64MiB, we get over 2.5 minutes in the worst case which seems better.
As the implementations details go, we probably should have lazy allocation of this cache (i.e. don't allocate the whole 64MiB right away if we choose 64MiB as the default). We might also consider introducing the eviction not only on size (FIFO style) but also based on some expiration time (say we'd track node-wide longest-running txn duration) - this would, however, require adding a timestamp to the entry into TxnIDCache. It seems a bit wasteful to keep very old txns in the TxnIDCache if we know that no other txn will ever need to resolve a TxnID for the contention purposes. 64MiB seems like the largest size we could go for without introducing the expiration eviction.
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
Previously, Azhng (Archer Zhang) wrote…
I split this paragraph into its own section, and explicitly called out the top level performance objective. Quick question about the benchmark we are using, i'm not quite familiar with YCSB workload, previously we have been using TPCC for measuring the overall performance impact. Is there anything special about using YCSB-A in this case?
Thanks Nathan for bringing up the performance considerations, I'm a heavy +1 on defining acceptable performance regressions upfront.
In the past we have definitely been hit by working on the observability improvements only to find later the performance overhead was too much, and then we scrambled to keep the improvements at least partially, usually configurable by a cluster setting.
docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Hmm this is very interesting. Is the 200k QPS ~ 1M QPS on the cluster-level or per-node level ? The 2,000 QPS cited here is at a per-node level. I got this number from our docs on performance which seems like we are able to handle 600K QPS with ~230 nodes. I think that roughly comes down to about 2600 QPS.
Also added a section discussing the component-level benchmark for this project using the idea from the row ttl project
I'm pretty sure that 200k-1M QPS range is on the cluster-level. I have seen that our larger customers have clusters with like 50-80 nodes that support this kind of range of QPS, so it might be better to base the calculation assuming that each node needs to support 20k QPS on the high end. Hopefully @nvanbenschoten can share his experience too.
docs/RFCS/20211013_contentions.md, line 399 at r3 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Added using row-level TTL to "Rationale and Alternative" section. What is the current progress of the row-level TTL? Have we committed to shipping it for 22.1?
My understanding is that TTL project is still in a discovery stage, so it would probably be better to not assume it'll be available for production use for 22.1. That said it might be worth for you to chat with Oliver and Vy to see what their thinking/timeline is.
docs/RFCS/20211013_contentions.md, line 102 at r4 (raw file):
However, the current data model experiences few drawbacks: * If one of user has a transaction that is running slow due to contention (a blocked
nit: probably s/one of user/a user/
.
docs/RFCS/20211013_contentions.md, line 287 at r4 (raw file):
least amount of unknown-unknowns and least amount of inter-team dependency. ### Option 2: KV Team embeds `NodeID` into Contention Event
This approach seems like the right way to go. As non-KV person, I think that the changes in the KV land will be very minimal, so I'd assume that it is safe to use this approach for the implementation. Hopefully @nvanbenschoten agrees.
This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in cockroachdb#71965, utilizing option 2. Release note: None
This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in cockroachdb#71965, utilizing option 2. Release note: None
This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in cockroachdb#71965, utilizing option 2. Release note: None
This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in cockroachdb#71965, utilizing option 2. Release note: None
This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in cockroachdb#71965, utilizing option 2. Release note: None
73500: kv,storage: persist gateway node id in transaction intents r=AlexTalks a=AlexTalks This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in #71965, utilizing option 2. Release note: None 73862: sql: add test asserting CREATE/USAGE on public schema r=otan a=rafiss refs #70266 The public schema currently always has CREATE/USAGE privileges for the public role. Add a test that confirms this. Release note: None 73873: scdeps: tighten dependencies, log more side effects r=postamar a=postamar This commit reworks the dependency injection for the event logger, among other declarative schema changer dependencies. It also makes the test dependencies more chatty in the side effects log. Release note: None 73932: ui: select grants tab on table details page r=maryliag a=maryliag Previosuly, when the grants view was selected on the Database Details page, it was going to the Table Details with the Overview tab selected. With this commit, if the view mode selected is Grant, the grant tab is selected on the Table Details page. Fixes #68829 Release note: None 73943: cli: support --locality and --max-offset flags with sql tenant pods r=nvanbenschoten a=nvanbenschoten This commit adds support for the `--locality` and `--max-offset` flags to the `cockroach mt start-sql` command. The first of these is important because tenant SQL pods should know where they reside. This will be important in the future for multi-region serverless and also for projects like #72593. The second of these is important because the SQL pod's max-offset setting needs to be the same as the host cluster's. If we want to be able to configure the host cluster's maximum clock offset to some non-default value, we'll need SQL pods to be configured identically. Validation of plumbing: ```sh ./cockroach start-single-node --insecure --max-offset=250ms ./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)' # verify --max-offset ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)" ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms # successful # verify --locality ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' ERROR: gateway_region(): no region set on the locality flag on this node ./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1 ./cockroach sql --insecure --port=26258 -e 'select gateway_region()' gateway_region ------------------ us-east1 ``` 73946: ccl/sqlproxyccl: fix TestWatchPods under stressrace r=jaylim-crl a=jaylim-crl Fixes #69220. Regression from #67452. In #67452, we omitted DRAINING pods from the tenant directory. Whenever a pod goes into the DRAINING state, the pod watcher needs time to update the directory. Not waiting for that while calling EnsureTenantAddr produces a stale result. This commit updates TestWatchPods by polling on EnsureTenantAddr until the pod watcher updates the directory. Release note: None 73954: sqlsmith: don't compare voids for joins r=rafiss a=otan No comparison expr is defined on voids, so don't generate comparisons for them. Resolves #73901 Resolves #73898 Resolves #73777 Release note: None Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Marius Posta <marius@cockroachlabs.com> Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to call out a couple things:
- The KV-related work to store
NodeID/SQLInstanceID
in theTxnMeta
, which is included in theContentionEvent
structure, is wrapped up by kv,storage: persist gateway node id in transaction intents #73500. - As we discussed recently, there may be some considerations to make about the RPCs in a multi-tenant cluster, as the
SQLInstanceID
s, while guaranteed to live as long as a transaction, may not necessarily be long-lived. The ID embedded by KV is theSQLInstanceID
, which, in a single-tenant cluster, is equivalent to aNodeID
, but in a multi-tenant cluster is a tenant-specific SQL pod identifier. Once the work described in rfc: mini RFC for refactoring node IDs #73309 wraps up, we may be able to transition to usingGenericNodeID
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @Azhng, @knz, @nvanbenschoten, and @otan)
This change augments the `TxnMeta` protobuf structure to include the gateway node ID (responsible for initiating the transaction) when serializing the intent. By doing so, this commit enables the Contention Event Store proposed in cockroachdb#71965, utilizing option 2. Release note: None
Resolves cockroachdb#71598 Release note: None
TFTR! bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
Resolves #71598
Rendered link: https://github.com/Azhng/cockroach/blob/rfc-ces/docs/RFCS/20211013_contentions.md
Release note: None