Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: introduce contention event store RFC #71965

Merged
merged 1 commit into from
May 7, 2022
Merged

Conversation

Azhng
Copy link
Contributor

@Azhng Azhng commented Oct 25, 2021

@Azhng Azhng requested review from yuzefovich and a team October 25, 2021 22:18
@Azhng Azhng requested a review from a team as a code owner October 25, 2021 22:18
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left a lot of nits (as I often do) plus some questions, but generally your idea makes sense to me.

I'll do another pass and will share more feedback.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)


docs/RFCS/20211013_contentions.md, line 24 at r1 (raw file):

| Contending Transaction |  The transaction that is holding the lock that blocks other transaction |
| Blocked Transaction | The transaction that is blocked by a contending transaction |
| Txn ID Cache | A cache that stores the mapping from transaction ID to transaction Fingerprint ID |

I think we also want to define what the "transaction fingerprint ID" is in this table.

nit: s/Fingerprint/fingerprint/.


docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):

contending transaction (TxnB), a contention event is generated at the KV layer
when the TxnA is finally able to proceed. The contention event includes the
 of TxnB, the duration of the contention from TxnA's

nit: probably missing ID, i.e. "the of TxnB" -> "the ID of TxnB".


docs/RFCS/20211013_contentions.md, line 55 at r1 (raw file):

This event is then stored within the tracing span, and then is sent back to the
gateway node at the end of DistSQL Flow and being recorded at the contention

nit: "being recorded" -> "is recorded".


docs/RFCS/20211013_contentions.md, line 94 at r1 (raw file):

However, the current data model experiences few drawbacks:
* If one of user's transaction is running slow due to contention, (a blocked

super nit: it's unusual to see commas around text in parenthesis.


docs/RFCS/20211013_contentions.md, line 99 at r1 (raw file):

  transactions.
* Even if user is able to (somehow magically) identify a list of contending
  transactions, the only piece of information a user can get is .

Incomplete sentence.


docs/RFCS/20211013_contentions.md, line 121 at r1 (raw file):

For each blocked transaction, user can view all the contending transactions
that caused contentions. Both blocked transaction's txn ID and contending

nit: "contentions" -> "contention".


docs/RFCS/20211013_contentions.md, line 157 at r1 (raw file):

  -- This might not be needed if we choose not to adapt the system table design.
  PRIMARY KEY (ts, blocked_txn_id, contending_txn_id, blocked_stmt_fingerprint_id)

I'm thinking that maybe blocked_stmt_fingerprint_id should be before contending_txn_id? I'm imagining a scenario where a user has already identified the time period when a particular statement from a particular txn experienced the contention, and then the user wants to find all contending txn IDs - if we switch the order, then we might tighter range scans (this would matter for blocked txns with many statements).


docs/RFCS/20211013_contentions.md, line 163 at r1 (raw file):

In order to facilitate historical queries, Contention Event Store stores
timeseries data with a timsstamp as its first column.  Each row in the table

nit: s/timsstamp/timestamp/.


docs/RFCS/20211013_contentions.md, line 182 at r1 (raw file):

`contending_txn_fingerprint_id` column. Effectively, this means we need a
mechanism to resolve a TxnID (that belongs to a transaction that was
potentially executed on a different node) into its corresponding a transaction

nit: probably "corresponding a transaction" -> "contending transaction".


docs/RFCS/20211013_contentions.md, line 204 at r1 (raw file):

is as follow:

Each SQL node in the CockroachDB maintains an in-memory FIFO cache of recently

This TxnIDCache doesn't exist, right? If so, I'd change the language slightly to use the future tense, i.e. "Each SQL node in the CockroachDB will maintain an in-memory FIFO cache...".


docs/RFCS/20211013_contentions.md, line 205 at r1 (raw file):

Each SQL node in the CockroachDB maintains an in-memory FIFO cache of recently
executed transactions. (namely `TxnIDCache`). The in-memory cache

nit: no period before (namely.


docs/RFCS/20211013_contentions.md, line 206 at r1 (raw file):

Each SQL node in the CockroachDB maintains an in-memory FIFO cache of recently
executed transactions. (namely `TxnIDCache`). The in-memory cache
stores the tuple of `(transaction_id, transaction_fingerprint_id)`.

nit: probably "the tuple" -> "the tuples".


docs/RFCS/20211013_contentions.md, line 209 at r1 (raw file):

`transaction_id` is a UUID with 16-byte storage overhead and
`transaction_fingerprint_id` is a unsigned 64-bit integer with 8-byte storage
overhead. Hence, with an overhead of 1 MiB, we will be able to store ~43,690

nit: I'd link this number to the estimation done below.


docs/RFCS/20211013_contentions.md, line 217 at r1 (raw file):

``` proto 
message TransactionIDResolutionRequest {

nit: I'd consider using Txn instead of Transaction in the type/variable names for brevity.


docs/RFCS/20211013_contentions.md, line 228 at r1 (raw file):

     // inquired node. If the transactionID is not present on the inquired
     // node, then it is not returned.
     bytes transactionIDs = 1 [(gogoproto.customname) = "ID",

nit: s/transactionIDs/transactionID/


docs/RFCS/20211013_contentions.md, line 238 at r1 (raw file):

     // This would require the node to check ActiveQueries store in addition
     // to TxnID Cache.
     uint64 transactionFingerprintIDs = 2 [(gogoproto.customname) = "TransactionFingerprintIDs",

nit: s/transactionFingerprintIDs/transactionFingerprintID/.


docs/RFCS/20211013_contentions.md, line 253 at r1 (raw file):

Each SQL node collects contention events as they come in and stores the unresolved
TxnID into a temporary buffer. Periodically, (subject to jitter), each SQL node

Can you expand on what is "subject to jitter"? Will we introduce a cluster setting for the period? Will it be probabilistic (i.e. after every 1000th contention event)?


docs/RFCS/20211013_contentions.md, line 278 at r1 (raw file):

contention event. (In Serverless mode, we would embed `SQLInstanceID`). This means
when a contention event is generated, it will also embed this identifier. This
would require the he KV layer to store this `NodeID` into the lock objects as

nit: "the he".


docs/RFCS/20211013_contentions.md, line 286 at r1 (raw file):

make here is on what we do with unresolved TxnIDs.

Unlike the previous design where we stores a temporary buffer of unresolved

nit: s/stores/store/.

nit: unclear what is "unresolved s".


docs/RFCS/20211013_contentions.md, line 289 at r1 (raw file):

s, we expand that temporary buffer to store a map from each
`NodeID` to a list of unresolved TxnIDs (`map[node_id][]transaction_id`).
Periodically, instead of perform cluster-wide RPC fanout, each node iterates

nit: s/perform/performing/.


docs/RFCS/20211013_contentions.md, line 313 at r1 (raw file):

See [Footnote](#Footnote) for examples.

### System Table vs. Virtual Table

Do you mention options with a virtual table only because we might run out of the IDs for system tables? Are there other advantages / disadvantages of the virtual table option?


docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):

#### Virtual Table Only

As of 21.2, CockroachDB does not support dynamic system table IDs. 50 system

Maybe we don't support "fixed" or "static" system table IDs and not "dynamic"?


docs/RFCS/20211013_contentions.md, line 319 at r1 (raw file):

As of 21.2, CockroachDB does not support dynamic system table IDs. 50 system
table descriptor IDs have been previously reserved. As of writing, there are
3 system table descriptor ID remains. This means there is a possibility that the

nit: s/remains/left/.


docs/RFCS/20211013_contentions.md, line 345 at r1 (raw file):

entries into the system table. The system table will adopt hash-sharded primary
key to reduce write-write contention similar to SQL Stats tables. A scheduled
job will be created to perform cleanup tasks and delete stale contention events.

Could you expand on this? What are "cleanup tasks" - is it deleting rows past certain TTL? What are the "stale contention events"?


docs/RFCS/20211013_contentions.md, line 352 at r1 (raw file):

necessary observability into the new subsystem. In addition, new cluster
settings need to be introduced to provide runtime tuning knobs. This provides
DB operators the ability to modify subsystem behaviours (to a certain extend)

nit: s/behaviors/behavior/ and s/extend/extent/.


docs/RFCS/20211013_contentions.md, line 367 at r1 (raw file):

Additional monitoring metrics that will be introduced:

* `sql.contentions.unresolved_event.count`: this is a gauge metric that keeps

nit: "contentions" reads unusual to me, maybe s/contentions/contention/?


docs/RFCS/20211013_contentions.md, line 375 at r1 (raw file):

  memory monitor response for memory accounting for the new contention event
  store.
* `sql.contentions.resolution_latency`: this is a latency metric that keeps

Does this track how long the resolution of a single batch of TxnIDs take? Or is it about the latency between a contention event entering the temporary buffer and that event being resolved?


docs/RFCS/20211013_contentions.md, line 385 at r1 (raw file):

  contention events flushed into the system table.
* `sql.contentions.flush.duration`: this is a latency metric that records the
  duration of flushing contention event into the system table.

nit: s/event/events/


docs/RFCS/20211013_contentions.md, line 386 at r1 (raw file):

* `sql.contentions.flush.duration`: this is a latency metric that records the
  duration of flushing contention event into the system table.
* `sql.contentions.flush.error`: this is a counter metrics that records the

nit: s/metrics/metric/.


docs/RFCS/20211013_contentions.md, line 399 at r1 (raw file):

  ID resolutions. All contention event generated will be hold in the unresolved
  buffer with FIFO evicton policies.
* `sql.contentions.txn_id_resolution.{interval, jitter}`: these two cluster

What is "jitter" in this context?


docs/RFCS/20211013_contentions.md, line 404 at r1 (raw file):

  constraints the maximum amount of entries stored in unresolved contention
  event buffer per node.
* `sql.contentions.txn_id_cache.max_entry`: this is a size setting that

nit: "max_entry" to me sounds more like "the limit on a size of a single entry", so maybe do "max_entry" -> "max_count" or "num_max_entries".


docs/RFCS/20211013_contentions.md, line 417 at r1 (raw file):

  will be retained in the system table.
<!-- should we go with TTL-based expiration?  !-->
* `sql.contentions.cleanup.recurrence`: this is a cron expression that specify

Do we have a precedent for a similar (i.e. "cron expression") setting already?


docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):

<!-- should we go with TTL-based expiration?  !-->
* `sql.contentions.cleanup.recurrence`: this is a cron expression that specify
  how the frequency that the clean up job will be executed.

nit: "how frequently the cleanup job will be executed".


docs/RFCS/20211013_contentions.md, line 448 at r1 (raw file):

every transaction finishes execution, it is inserted into TxnID Cache, whereas
contention event is inserted into unresolved buffer each time a contention
event is generated. Two buffer are being populated at a drastically different

nit: s/Two buffer/Two buffers/.


docs/RFCS/20211013_contentions.md, line 466 at r1 (raw file):

the unresolved contention event.

The rate at which we discard unresolved contention event is closely related to

nit: s/event/events/ here and on the next line.


docs/RFCS/20211013_contentions.md, line 467 at r1 (raw file):

The rate at which we discard unresolved contention event is closely related to
the rate at whcih we collect contention event, and the size of TxnID Cache.

nit: s/whcih/which/.


docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):

Suppose a node is experiencing a consistent 2,000 QPS. Currently, we enable
tracing on the statement with 1% probability. Suppose 40% of the statement
experiences contentions (this number is hard to estimate, as one workload

nit: s/contentions/contention/, also something is off in "one workload and drastically different from another".


docs/RFCS/20211013_contentions.md, line 474 at r1 (raw file):

`2000 * 1% * 40% = 8` contention events per second. The amount of time each
node has to perform TxnID resolution equals to how long TxnID Cache can hold
each TxnID to transaction fingerprint ID cache. This depends on the

nit: something is off in "each TxnID to transaction fingerprint ID cache".


docs/RFCS/20211013_contentions.md, line 478 at r1 (raw file):

then each cache entry takes 24 bytes (UUID + uint64), we can store up to
`~43,690` entries in this cache. This means with 2000 QPS, we have ~21 seconds
before we start to discarding contention event at 8 events per second.

nit: s/start to discarding contention event/start discarding contention events/.


docs/RFCS/20211013_contentions.md, line 511 at r1 (raw file):

# Future work

## Statement-level reoslution

nit: s/reoslution/resolution/.


docs/RFCS/20211013_contentions.md, line 513 at r1 (raw file):

## Statement-level reoslution

We want to extend from understanding the which contending transaction

nit: s/understanding the which/understanding which/.


docs/RFCS/20211013_contentions.md, line 543 at r1 (raw file):

information with each other (similar to the fashion outlined in this
RFC), we should start considering a generalized framework that handles all of
the background communication as well as error handlings.

nit: s/handlings/handling/.


docs/RFCS/20211013_contentions.md, line 549 at r1 (raw file):

## Normal Execution Path

Exeucted as part of query exeuction code path

nit: s/exeuction/execution/.


docs/RFCS/20211013_contentions.md, line 552 at r1 (raw file):

Step 1: Two nodes executing conflicting Transactions

Nice diagrams!


docs/RFCS/20211013_contentions.md, line 565 at r1 (raw file):

   |                                               |                                                                       |
   |         +---------+ +-----------------------+ |                                                                       |
   |         | CES     | |unresolved CES buffer  | |                                                                       |

super nit: I don't think you introduced the acronym CES - might be worth mentioning it the very first time you use "Contention Event Store" at the very top or here, right before the first diagram.


docs/RFCS/20211013_contentions.md, line 592 at r1 (raw file):

Step 2: each node populates TxnID Cache, node 2 receives contention events with unresolved transaciton ID

nit: s/transaciton/transaction/.


docs/RFCS/20211013_contentions.md, line 716 at r1 (raw file):

Step 5: More Transaction are being executed, the oldest entries in TxnID Cache gets evicted

nit: s/Transaction/Transactions/ and s/gets/get/


docs/RFCS/20211013_contentions.md, line 1064 at r1 (raw file):

Scratch Note (delete before send out draft)

Do you wanna delete this now?

@knz
Copy link
Contributor

knz commented Oct 26, 2021

I think this RFC is skirting around the elephant in the room: isn't this contention table, by itself, going to become a major source of contention if stored in KV?

Like, what provision is made to ensure it doesn't become a hotspot?
How are its PK and indices designed to avoid contention?

Don't we want this to be split in such a way that different nodes can write their contention data into different ranges?

Also, what is the eviction algorithm? What provision is made that this table doesn't grow too large, too fast?
When are events deleted exactly? What happens with the deleted rows? What is the GC TTL?


The idea to have the data memory-only is good, especially if combined with a max size policy (and corresponding eviction rules). I personally prefer that option, although my opinion could be swayed the other way if there are suitable answers to the points above.

If we go for memory-only, we'll probably want the RFC to also mention a way this data can be dumped to disk, included in UI, exported in debug zip, etc.

Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the system table section to describe the reasoning behind why I think this new contention table will not introducing not ranges nor causing contention. It also includes the section describes the provisions made that this table doesn't grow too large and too fast.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


docs/RFCS/20211013_contentions.md, line 24 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think we also want to define what the "transaction fingerprint ID" is in this table.

nit: s/Fingerprint/fingerprint/.

Done.


docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: probably missing ID, i.e. "the of TxnB" -> "the ID of TxnB".

Done.


docs/RFCS/20211013_contentions.md, line 55 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "being recorded" -> "is recorded".

Done.


docs/RFCS/20211013_contentions.md, line 94 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

super nit: it's unusual to see commas around text in parenthesis.

Done.


docs/RFCS/20211013_contentions.md, line 99 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Incomplete sentence.

Done.


docs/RFCS/20211013_contentions.md, line 121 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "contentions" -> "contention".

Done.


docs/RFCS/20211013_contentions.md, line 157 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm thinking that maybe blocked_stmt_fingerprint_id should be before contending_txn_id? I'm imagining a scenario where a user has already identified the time period when a particular statement from a particular txn experienced the contention, and then the user wants to find all contending txn IDs - if we switch the order, then we might tighter range scans (this would matter for blocked txns with many statements).

Good point. Done.


docs/RFCS/20211013_contentions.md, line 182 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: probably "corresponding a transaction" -> "contending transaction".

Done.


docs/RFCS/20211013_contentions.md, line 204 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This TxnIDCache doesn't exist, right? If so, I'd change the language slightly to use the future tense, i.e. "Each SQL node in the CockroachDB will maintain an in-memory FIFO cache...".

Done.


docs/RFCS/20211013_contentions.md, line 205 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: no period before (namely.

Done.


docs/RFCS/20211013_contentions.md, line 206 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: probably "the tuple" -> "the tuples".

Done.


docs/RFCS/20211013_contentions.md, line 209 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I'd link this number to the estimation done below.

Done.


docs/RFCS/20211013_contentions.md, line 217 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I'd consider using Txn instead of Transaction in the type/variable names for brevity.

Done.


docs/RFCS/20211013_contentions.md, line 228 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/transactionIDs/transactionID/

Done.


docs/RFCS/20211013_contentions.md, line 238 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/transactionFingerprintIDs/transactionFingerprintID/.

Done.


docs/RFCS/20211013_contentions.md, line 253 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Can you expand on what is "subject to jitter"? Will we introduce a cluster setting for the period? Will it be probabilistic (i.e. after every 1000th contention event)?

Expanded on the calculation using both interval and jitter cluster settings. It will be a probabilistic calculation.

In the cluster setting section below I introduced two additional cluster settings: sql.contention.txn_id_resolution.{interval, jitter}


docs/RFCS/20211013_contentions.md, line 278 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "the he".

Done.


docs/RFCS/20211013_contentions.md, line 286 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/stores/store/.

nit: unclear what is "unresolved s".

Done.


docs/RFCS/20211013_contentions.md, line 289 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/perform/performing/.

Done.


docs/RFCS/20211013_contentions.md, line 313 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do you mention options with a virtual table only because we might run out of the IDs for system tables? Are there other advantages / disadvantages of the virtual table option?

The main reason that I'm weighing these two options right now are indeed our shortage of descriptor IDs.

I think the main trade-off here are:

  • simple in-memory system vs. complex system involving flushing and GC
  • length retention period
  • survive node restarts

Longer retention period and data survivability across node restart were the two main reason that motivated us to implement persisted sql stats for 21.2, since we have received feedback that it was one of the primary pain point to users when the sql stats was originally introduced in 21.1.


docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Maybe we don't support "fixed" or "static" system table IDs and not "dynamic"?

I'm not sure if I'm following here?


docs/RFCS/20211013_contentions.md, line 345 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Could you expand on this? What are "cleanup tasks" - is it deleting rows past certain TTL? What are the "stale contention events"?

Expanded this section to describe cleanup job in detail.


docs/RFCS/20211013_contentions.md, line 367 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "contentions" reads unusual to me, maybe s/contentions/contention/?

Done.


docs/RFCS/20211013_contentions.md, line 375 at r1 (raw file):

Does this track how long the resolution of a single batch of TxnIDs take?

This is my original thinking. If this metrics is spiking then it's a good indicator that either:

  • service latency for resolution RPC is high. (e.g. network issues?)
  • cluster size too large? perhaps cluster is reaching a size where cluster-wide RPC fanout is becoming prohibitively expensive ?
  • uncaught bugs?

Or is it about the latency between a contention event entering the temporary buffer and that event being resolved?

Hmm I think this is also a useful metric, thought I think it overlaps with the previous metric to a certain extent. Though this metric will be sensitive to long running transactions. This is because we cannot compute transaction fingerprint id for active transactions. This means the txn id resolution need to retry which can spike this metric.

Hmm do you think we would need both?


docs/RFCS/20211013_contentions.md, line 385 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/event/events/

Done.


docs/RFCS/20211013_contentions.md, line 386 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/metrics/metric/.

Done.


docs/RFCS/20211013_contentions.md, line 399 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

What is "jitter" in this context?

jitter here is a floating point number in [0, 1]. This is to introduce some randomness to the interval to prevent thundering herd problem. AFAIK, it's currently used in the job system and sql stats system.


docs/RFCS/20211013_contentions.md, line 404 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "max_entry" to me sounds more like "the limit on a size of a single entry", so maybe do "max_entry" -> "max_count" or "num_max_entries".

Switched to max_count. Though, do you think we should specify this limit in terms of count or in terms of byte size?


docs/RFCS/20211013_contentions.md, line 417 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do we have a precedent for a similar (i.e. "cron expression") setting already?

We currently do have it in sql stats. This is because scheduled job allows each schedule to have a cron expression for it's schedule.


docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "how frequently the cleanup job will be executed".

Done.


docs/RFCS/20211013_contentions.md, line 448 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/Two buffer/Two buffers/.

Done.


docs/RFCS/20211013_contentions.md, line 466 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/event/events/ here and on the next line.

Done.


docs/RFCS/20211013_contentions.md, line 467 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/whcih/which/.

Done.


docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/contentions/contention/, also something is off in "one workload and drastically different from another".

Done.


docs/RFCS/20211013_contentions.md, line 474 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: something is off in "each TxnID to transaction fingerprint ID cache".

Reworded.


docs/RFCS/20211013_contentions.md, line 478 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/start to discarding contention event/start discarding contention events/.

Done.


docs/RFCS/20211013_contentions.md, line 549 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/exeuction/execution/.

Done.


docs/RFCS/20211013_contentions.md, line 552 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Nice diagrams!

Thanks! 😅


docs/RFCS/20211013_contentions.md, line 565 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

super nit: I don't think you introduced the acronym CES - might be worth mentioning it the very first time you use "Contention Event Store" at the very top or here, right before the first diagram.

Done.


docs/RFCS/20211013_contentions.md, line 716 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/Transaction/Transactions/ and s/gets/get/

Done.


docs/RFCS/20211013_contentions.md, line 1064 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do you wanna delete this now?

Done.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @nvanbenschoten, and @yuzefovich)


docs/RFCS/20211013_contentions.md, line 38 at r2 (raw file):

Currently, CockroachDB collects transaction-level contention information from
KV-layer when tracing is enabled. When a transaction (TxnA) is blocked by a

fwiw, I'm trying to get some level of tracing to be always active. This can mean (although it doesn't have to mean) that all conflicts encountered by statements could be reported to their gateways.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small portion of nits from me with no additional feedback yet.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @nvanbenschoten, and @yuzefovich)


docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

I'm not sure if I'm following here?

Sorry, my comment was confusing.

I guess I'm just confused by the first sentence here - to me it implies that before 21.2 we had some "dynamic" scheme of assigning IDs to the system tables, but I don't think that it was true. I think the IDs have always been "fixed" - i.e. once a new system table is introduced, we pick the next available ID and use it for that table. We have a limit of 50 such IDs, and we're about to run out.

So, I'd probably rephrase the first sentence a little: "At the time of this writing, CockroachDB does not support dynamic ...".


docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Done.

I am still confused by this sentence.


docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Done.

something is off in "one workload and drastically different from another"


docs/RFCS/20211013_contentions.md, line 22 at r2 (raw file):

| Terminology | Explanation |
| --- | --- |
| Statement Fingerprint | SQL statement string with literal value replaced with underscores |

nit: s/value/values/.


docs/RFCS/20211013_contentions.md, line 329 at r2 (raw file):

As of 21.2, CockroachDB does not support dynamic system table descriptor IDs.
50 system table descriptor IDs have been previously reserved. As of writing,
there are 3 unused system table descriptor ID left. This means there is a

nit: s/ID/IDs/.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @nvanbenschoten, and @yuzefovich)


docs/RFCS/20211013_contentions.md, line 365 at r2 (raw file):

Since the contention event table stores the time series data, it will adapt
hash-sharded primary index to ensure we do not run into the hot range problem.

can you spell out the schema?

Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, and @yuzefovich)


docs/RFCS/20211013_contentions.md, line 317 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Sorry, my comment was confusing.

I guess I'm just confused by the first sentence here - to me it implies that before 21.2 we had some "dynamic" scheme of assigning IDs to the system tables, but I don't think that it was true. I think the IDs have always been "fixed" - i.e. once a new system table is introduced, we pick the next available ID and use it for that table. We have a limit of 50 such IDs, and we're about to run out.

So, I'd probably rephrase the first sentence a little: "At the time of this writing, CockroachDB does not support dynamic ...".

Done.


docs/RFCS/20211013_contentions.md, line 418 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I am still confused by this sentence.

Rewrite this sentence.


docs/RFCS/20211013_contentions.md, line 470 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

something is off in "one workload and drastically different from another"

Ah, I meant "one workload can be drastically different from another". Fixed.


docs/RFCS/20211013_contentions.md, line 22 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/value/values/.

Done.


docs/RFCS/20211013_contentions.md, line 38 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

fwiw, I'm trying to get some level of tracing to be always active. This can mean (although it doesn't have to mean) that all conflicts encountered by statements could be reported to their gateways.

This is good to hear!


docs/RFCS/20211013_contentions.md, line 329 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/ID/IDs/.

Done.


docs/RFCS/20211013_contentions.md, line 365 at r2 (raw file):

Previously, knz (kena) wrote…

can you spell out the schema?

The schema is listed out in the SQL Interface from above

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @Azhng, @knz, and @yuzefovich)


docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Done.

It's more than just the ID, it's the entire TxnMeta. It's worth calling that out, because there's additional information in there, including the gateway_node_id when we add it.


docs/RFCS/20211013_contentions.md, line 92 at r3 (raw file):

It is evident that the existing contention event subsystem is built with a
specific (table-centric) workflow in mind, that is:
* allow user to understand which table/index is running hot

nit: consider making it more clear that these are sequential steps. It threw me while reading to see "identify the hot row in that table/index" here but then "the user does not have an easy way to identify which row the blocked transaction is trying to read" below.


docs/RFCS/20211013_contentions.md, line 109 at r3 (raw file):

In summary, user cannot easily connect blocked transaction to contending
transaction, and even if user can somehow make that connection, the current

s/if user/if the user/


docs/RFCS/20211013_contentions.md, line 152 at r3 (raw file):

  contending_txn_id           UUID  NOT NULL,

  blocked_stmt_fingerprint_id BYTES NOT NULL,

If the fingerprint IDs are 64-bits, should these be a fixed-width datatype like an INT8 for a more compact encoding?


docs/RFCS/20211013_contentions.md, line 289 at r3 (raw file):

when a contention event is generated, it will also embed this identifier. This
would require the KV layer to store this `NodeID` into the lock objects as
transactions acquire them.

Thanks for writing up #72303. We've discussed it on the KV team and don't think it will be a large lift. Do you mind linking to that issue from here?


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):

stale data.

One thing that we need to be careful is we need avoid writing too much data

These few paragraphs are really helpful. I was getting concerned that performance had not been taken into consideration. We've seen in the past that observability projects that don't define performance goals from the start are at risk of learning about performance problems late in the development cycle. I'm interested in whether we've defined such goals for this project (e.g. "enabling the contention event store with eager txn ID resolution on YCSB-A will reduce top-line throughput by no more than 3%") and how we plan to incrementally measure the various components to make sure they are performing sufficiently well.


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):

stale data.

One thing that we need to be careful is we need avoid writing too much data

Related to my other comment about performance, this back-of-the-envelope math is great. But the 2,000 QPS cited also very conservative. We have customers that run production clusters with a steady-state pushing 1M QPS, and should likely design any new component to scale to an order of magnitude beyond this. The sampling and the low-pass filter give us the ability to significantly downsample, but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments, especially if the contention event store scales any less well than the rest of the user's workload.

So we should probably adjust these numbers to represent a typical large customer (maybe 200k QPS) and also do the math on the top end of the design.

Much like we're doing with the row-level TTL work, we should probably also find a way early on in the development cycle to benchmark the contention event store in isolation.


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):
(third thread about performance in a row, because why not 😃)

but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments

Does capturing a static fraction of all user traffic make sense for larger deployments? I'd imagine that there are diminishing returns as we capture more and more information. Have you considered a design where either the sampling rate or the low-pass filter is adaptive and targets some maximum contention event rate?


docs/RFCS/20211013_contentions.md, line 399 at r3 (raw file):

This can provide us with guidance on the value for the low pass filter.

To ensure contention event table doesn't grow too large, a new cleanup job

This may be a crazy idea, but instead of building a new cleanup job, can we instead take advantage of the new row-level TTL project? It seems like many of these pieces are already being built elsewhere by @otan. A benefit of this is that the cleanup would then benefit from any performance optimization (of which there are many possibilities) we add now or later to make row-level TTL faster or less disruptive.


pkg/sql/opt/exec/explain/plan_gist_factory.og.go, line 1 at r3 (raw file):

// Code generated by optgen; DO NOT EDIT.

I don't think you meant to add this to the PR.

@Azhng Azhng force-pushed the rfc-ces branch 2 times, most recently from adc4f2b to 1cc9d83 Compare November 3, 2021 21:28
Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, @otan, and @yuzefovich)


docs/RFCS/20211013_contentions.md, line 38 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's more than just the ID, it's the entire TxnMeta. It's worth calling that out, because there's additional information in there, including the gateway_node_id when we add it.

Updated to TxnMeta and updated the terminology to specifically call out TxnMeta


docs/RFCS/20211013_contentions.md, line 92 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider making it more clear that these are sequential steps. It threw me while reading to see "identify the hot row in that table/index" here but then "the user does not have an easy way to identify which row the blocked transaction is trying to read" below.

Done. Rewrote the paragraph to make it more clear that these are sequential steps


docs/RFCS/20211013_contentions.md, line 109 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/if user/if the user/

Done.


docs/RFCS/20211013_contentions.md, line 152 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If the fingerprint IDs are 64-bits, should these be a fixed-width datatype like an INT8 for a more compact encoding?

When the fingerprint ID was originally introduced, it was a 64 bit unsigned integer. However, since we don't support unsigned integer type in SQL, if we trivially cast the fingerprint ID into signed integers, it led to a very strange UX. Since the fingerprint ID essentially is just a binary blob, we went with the BYTES type.


docs/RFCS/20211013_contentions.md, line 289 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks for writing up #72303. We've discussed it on the KV team and don't think it will be a large lift. Do you mind linking to that issue from here?

Done.


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Related to my other comment about performance, this back-of-the-envelope math is great. But the 2,000 QPS cited also very conservative. We have customers that run production clusters with a steady-state pushing 1M QPS, and should likely design any new component to scale to an order of magnitude beyond this. The sampling and the low-pass filter give us the ability to significantly downsample, but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments, especially if the contention event store scales any less well than the rest of the user's workload.

So we should probably adjust these numbers to represent a typical large customer (maybe 200k QPS) and also do the math on the top end of the design.

Much like we're doing with the row-level TTL work, we should probably also find a way early on in the development cycle to benchmark the contention event store in isolation.

Hmm this is very interesting. Is the 200k QPS ~ 1M QPS on the cluster-level or per-node level ? The 2,000 QPS cited here is at a per-node level. I got this number from our docs on performance which seems like we are able to handle 600K QPS with ~230 nodes. I think that roughly comes down to about 2600 QPS.

Also added a section discussing the component-level benchmark for this project using the idea from the row ttl project


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

These few paragraphs are really helpful. I was getting concerned that performance had not been taken into consideration. We've seen in the past that observability projects that don't define performance goals from the start are at risk of learning about performance problems late in the development cycle. I'm interested in whether we've defined such goals for this project (e.g. "enabling the contention event store with eager txn ID resolution on YCSB-A will reduce top-line throughput by no more than 3%") and how we plan to incrementally measure the various components to make sure they are performing sufficiently well.

I split this paragraph into its own section, and explicitly called out the top level performance objective. Quick question about the benchmark we are using, i'm not quite familiar with YCSB workload, previously we have been using TPCC for measuring the overall performance impact. Is there anything special about using YCSB-A in this case?


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

(third thread about performance in a row, because why not 😃)

but work that is proportional to even a static fraction of the user traffic may be excessively costly in larger deployments

Does capturing a static fraction of all user traffic make sense for larger deployments? I'd imagine that there are diminishing returns as we capture more and more information. Have you considered a design where either the sampling rate or the low-pass filter is adaptive and targets some maximum contention event rate?

Added a section later discussing the idea of adaptive low-pass filter threshold


docs/RFCS/20211013_contentions.md, line 399 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This may be a crazy idea, but instead of building a new cleanup job, can we instead take advantage of the new row-level TTL project? It seems like many of these pieces are already being built elsewhere by @otan. A benefit of this is that the cleanup would then benefit from any performance optimization (of which there are many possibilities) we add now or later to make row-level TTL faster or less disruptive.

Added using row-level TTL to "Rationale and Alternative" section. What is the current progress of the row-level TTL? Have we committed to shipping it for 22.1?


pkg/sql/opt/exec/explain/plan_gist_factory.og.go, line 1 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think you meant to add this to the PR.

Removed

@Azhng Azhng force-pushed the rfc-ces branch 2 times, most recently from 974b93c to 1f501ba Compare November 3, 2021 21:37
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go with the system table approach and with teaching the KV to embed the NodeID/SQLInstanceID identifier into the contention events.

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @Azhng, @knz, @nvanbenschoten, and @otan)


docs/RFCS/20211013_contentions.md, line 313 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

The main reason that I'm weighing these two options right now are indeed our shortage of descriptor IDs.

I think the main trade-off here are:

  • simple in-memory system vs. complex system involving flushing and GC
  • length retention period
  • survive node restarts

Longer retention period and data survivability across node restart were the two main reason that motivated us to implement persisted sql stats for 21.2, since we have received feedback that it was one of the primary pain point to users when the sql stats was originally introduced in 21.1.

I see, I think the persisted contention information could be extremely useful. I remember seeing an email from Andrew W or Marius about reaching out to them in case you're considering introducing a new system table for 22.1, so I think it'd be worth to check with SQL Schema team.


docs/RFCS/20211013_contentions.md, line 375 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Does this track how long the resolution of a single batch of TxnIDs take?

This is my original thinking. If this metrics is spiking then it's a good indicator that either:

  • service latency for resolution RPC is high. (e.g. network issues?)
  • cluster size too large? perhaps cluster is reaching a size where cluster-wide RPC fanout is becoming prohibitively expensive ?
  • uncaught bugs?

Or is it about the latency between a contention event entering the temporary buffer and that event being resolved?

Hmm I think this is also a useful metric, thought I think it overlaps with the previous metric to a certain extent. Though this metric will be sensitive to long running transactions. This is because we cannot compute transaction fingerprint id for active transactions. This means the txn id resolution need to retry which can spike this metric.

Hmm do you think we would need both?

Your thinking makes sense to me, only the first seems sufficient (we can extrapolate the time it takes for a contention event between entering the temporary buffer and being resolved as "flush interval + resolution_latency").


docs/RFCS/20211013_contentions.md, line 404 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Switched to max_count. Though, do you think we should specify this limit in terms of count or in terms of byte size?

Hm, I think having it in terms of byte size makes more sense to me.

My attempt at doing some math lol: an entry in the TxnIDCache is 24 bytes, so 4MiB cache would be able to hold 175k entries. IIUC any txn might become the contending one, so we have to keep track of all txns executed on each node that was the gateway for those txns. Assuming 20k QPS per node (and assuming single stmt txns) this would give us about 9 seconds which seems a bit too short. We need the TxnIDCache to keep the records roughly for the duration of the longest-running txn (probably in minutes in the worst cases) plus the flush interval (probably in seconds). If we bump the size to 64MiB, we get over 2.5 minutes in the worst case which seems better.

As the implementations details go, we probably should have lazy allocation of this cache (i.e. don't allocate the whole 64MiB right away if we choose 64MiB as the default). We might also consider introducing the eviction not only on size (FIFO style) but also based on some expiration time (say we'd track node-wide longest-running txn duration) - this would, however, require adding a timestamp to the entry into TxnIDCache. It seems a bit wasteful to keep very old txns in the TxnIDCache if we know that no other txn will ever need to resolve a TxnID for the contention purposes. 64MiB seems like the largest size we could go for without introducing the expiration eviction.


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):

Previously, Azhng (Archer Zhang) wrote…

I split this paragraph into its own section, and explicitly called out the top level performance objective. Quick question about the benchmark we are using, i'm not quite familiar with YCSB workload, previously we have been using TPCC for measuring the overall performance impact. Is there anything special about using YCSB-A in this case?

Thanks Nathan for bringing up the performance considerations, I'm a heavy +1 on defining acceptable performance regressions upfront.

In the past we have definitely been hit by working on the observability improvements only to find later the performance overhead was too much, and then we scrambled to keep the improvements at least partially, usually configurable by a cluster setting.


docs/RFCS/20211013_contentions.md, line 379 at r3 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Hmm this is very interesting. Is the 200k QPS ~ 1M QPS on the cluster-level or per-node level ? The 2,000 QPS cited here is at a per-node level. I got this number from our docs on performance which seems like we are able to handle 600K QPS with ~230 nodes. I think that roughly comes down to about 2600 QPS.

Also added a section discussing the component-level benchmark for this project using the idea from the row ttl project

I'm pretty sure that 200k-1M QPS range is on the cluster-level. I have seen that our larger customers have clusters with like 50-80 nodes that support this kind of range of QPS, so it might be better to base the calculation assuming that each node needs to support 20k QPS on the high end. Hopefully @nvanbenschoten can share his experience too.


docs/RFCS/20211013_contentions.md, line 399 at r3 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Added using row-level TTL to "Rationale and Alternative" section. What is the current progress of the row-level TTL? Have we committed to shipping it for 22.1?

My understanding is that TTL project is still in a discovery stage, so it would probably be better to not assume it'll be available for production use for 22.1. That said it might be worth for you to chat with Oliver and Vy to see what their thinking/timeline is.


docs/RFCS/20211013_contentions.md, line 102 at r4 (raw file):

However, the current data model experiences few drawbacks:
* If one of user has a transaction that is running slow due to contention (a blocked

nit: probably s/one of user/a user/.


docs/RFCS/20211013_contentions.md, line 287 at r4 (raw file):

least amount of unknown-unknowns and least amount of inter-team dependency.

### Option 2: KV Team embeds `NodeID` into Contention Event

This approach seems like the right way to go. As non-KV person, I think that the changes in the KV land will be very minimal, so I'd assume that it is safe to use this approach for the implementation. Hopefully @nvanbenschoten agrees.

AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Dec 10, 2021
This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in cockroachdb#71965, utilizing option 2.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Dec 15, 2021
This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in cockroachdb#71965, utilizing option 2.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Dec 15, 2021
This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in cockroachdb#71965, utilizing option 2.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Dec 16, 2021
This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in cockroachdb#71965, utilizing option 2.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Dec 17, 2021
This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in cockroachdb#71965, utilizing option 2.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 17, 2021
73500: kv,storage: persist gateway node id in transaction intents r=AlexTalks a=AlexTalks

This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in #71965, utilizing option 2.

Release note: None

73862: sql: add test asserting CREATE/USAGE on public schema r=otan a=rafiss

refs #70266

The public schema currently always has CREATE/USAGE privileges
for the public role. Add a test that confirms this.

Release note: None

73873: scdeps: tighten dependencies, log more side effects r=postamar a=postamar

This commit reworks the dependency injection for the event logger, among
other declarative schema changer dependencies. It also makes the test
dependencies more chatty in the side effects log.

Release note: None

73932: ui: select grants tab on table details page r=maryliag a=maryliag

Previosuly, when the grants view was selected on the Database
Details page, it was going to the Table Details with the Overview
tab selected.
With this commit, if the view mode selected is Grant, the grant
tab is selected on the Table Details page.

Fixes #68829

Release note: None

73943: cli: support --locality and --max-offset flags with sql tenant pods r=nvanbenschoten a=nvanbenschoten

This commit adds support for the `--locality` and `--max-offset` flags to the `cockroach mt start-sql` command.

The first of these is important because tenant SQL pods should know where they reside. This will be important in the future for multi-region serverless and also for projects like #72593.

The second of these is important because the SQL pod's max-offset setting needs to be the same as the host cluster's. If we want to be able to configure the host cluster's maximum clock offset to some non-default value, we'll need SQL pods to be configured identically.

Validation of plumbing:
```sh
./cockroach start-single-node --insecure --max-offset=250ms
./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)'

 # verify --max-offset

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0
 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)"

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms
 # successful

 # verify --locality

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

ERROR: gateway_region(): no region set on the locality flag on this node

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

  gateway_region
------------------
  us-east1
```

73946: ccl/sqlproxyccl: fix TestWatchPods under stressrace r=jaylim-crl a=jaylim-crl

Fixes #69220.
Regression from #67452.

In #67452, we omitted DRAINING pods from the tenant directory. Whenever a pod
goes into the DRAINING state, the pod watcher needs time to update the
directory. Not waiting for that while calling EnsureTenantAddr produces a
stale result. This commit updates TestWatchPods by polling on EnsureTenantAddr
until the pod watcher updates the directory.

Release note: None

73954: sqlsmith: don't compare voids for joins r=rafiss a=otan

No comparison expr is defined on voids, so don't generate comparisons
for them.

Resolves #73901
Resolves #73898
Resolves #73777

Release note: None

Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Marius Posta <marius@cockroachlabs.com>
Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Jay <jay@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Copy link
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to call out a couple things:

  1. The KV-related work to store NodeID/SQLInstanceID in the TxnMeta, which is included in the ContentionEvent structure, is wrapped up by kv,storage: persist gateway node id in transaction intents #73500.
  2. As we discussed recently, there may be some considerations to make about the RPCs in a multi-tenant cluster, as the SQLInstanceIDs, while guaranteed to live as long as a transaction, may not necessarily be long-lived. The ID embedded by KV is the SQLInstanceID, which, in a single-tenant cluster, is equivalent to a NodeID, but in a multi-tenant cluster is a tenant-specific SQL pod identifier. Once the work described in rfc: mini RFC for refactoring node IDs #73309 wraps up, we may be able to transition to using GenericNodeID.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @Azhng, @knz, @nvanbenschoten, and @otan)

gustasva pushed a commit to gustasva/cockroach that referenced this pull request Jan 4, 2022
This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in cockroachdb#71965, utilizing option 2.

Release note: None
@Azhng
Copy link
Contributor Author

Azhng commented May 6, 2022

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented May 6, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented May 6, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented May 7, 2022

Build succeeded:

@craig craig bot merged commit a86e6c9 into cockroachdb:master May 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rfc: new contention event store RFC
8 participants