Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/concurrency: misleading duration of contention events with txn pushes #98104

Closed
yuzefovich opened this issue Mar 6, 2023 · 2 comments · Fixed by #99166
Closed

kv/concurrency: misleading duration of contention events with txn pushes #98104

yuzefovich opened this issue Mar 6, 2023 · 2 comments · Fixed by #99166
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-kv KV Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Mar 6, 2023

Here (internal slack) is a stmt bundle in which we have a case when the total execution time of subquery 1 was 1.9s yet we reported 28s of contention. This is due to the read in that subquery being repeatedly pushed, so we produced multiple contention events with constantly increasing duration, here is a subset (edited to remove possible PII):

"structuredRecords": [
    {
        "time": "2023-03-05T08:02:20.763129115Z",
        "payload": {
            "@type": "type.googleapis.com/cockroach.roachpb.ContentionEvent",
            "duration": "0.056493704s"
        }
    },
    {
        "time": "2023-03-05T08:02:20.823929395Z",
        "payload": {
            "@type": "type.googleapis.com/cockroach.roachpb.ContentionEvent",
            "duration": "0.117294565s"
        }
    },
    {
        "time": "2023-03-05T08:02:20.921669801Z",
        "payload": {
            "@type": "type.googleapis.com/cockroach.roachpb.ContentionEvent",
            "duration": "0.215034456s"
        }
    },
...

Notably, all these events occurred when performing the same read, but the duration keeps on increasing (due to the read being pushed many times AFAIU), until the last event records the actual contention of 1.88s.

SQL treats all of these events as separate, so it aggregates them here

func GetCumulativeContentionTime(

I believe it's a bug in the KV layer where we should be producing the contention events such that their duration doesn't include the duration of contention of previously reported events.

Jira issue: CRDB-25078

@yuzefovich yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 6, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 20, 2023

Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 20, 2023
@nvanbenschoten
Copy link
Member

I think this was broken by 70ef641.

Before that change, we would reset the clock on each contention event 70ef641#diff-b2445fe50a89171ab341cf76c645ef668d8dcccd14143b7c96d07d8b52047941L902. These events were emitted on every new (key, txn) pair.

Now, we reset it whenever we wait on a new key:

g.mu.curLockWaitStart = clock.PhysicalTime()

This means that we may emit multiple contention events on the same key but for different lock holder txns without resetting the timer.

@nvanbenschoten nvanbenschoten self-assigned this Mar 21, 2023
craig bot pushed a commit that referenced this issue Mar 29, 2023
99166: kv/concurrency: compute contention event duration from (key,txn) wait start time r=andreimatei a=nvanbenschoten

Fixes #98104.

This commit resolves the bug identified in #98104 where multiple contention events could be emitted with overlapping durations, such that when these durations were added together (e.g. by `GetCumulativeContentionTime`), their sum was larger than the runtime of the entire query. This was possible because as of 70ef641, we were failing to reset the waiting start time on each new lock holder transaction for the same key.

This commit fixes this by computing the contention event duration using `contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also cleans up some of this code and makes it harder to make such a mistake in the future.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 1201c01 Mar 29, 2023
blathers-crl bot pushed a commit that referenced this issue Mar 29, 2023
… start time

Fixes #98104.

This commit resolves the bug identified in #98104 where multiple contention
events could be emitted with overlapping durations, such that when these
durations were added together (e.g. by `GetCumulativeContentionTime`), their sum
was larger than the runtime of the entire query. This was possible because as of
70ef641, we were failing to reset the waiting start time on each new lock
holder transaction for the same key.

This commit fixes this by computing the contention event duration using
`contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also
cleans up some of this code and makes it harder to make such a mistake in the
future.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 29, 2023
… start time

Fixes cockroachdb#98104.

This commit resolves the bug identified in cockroachdb#98104 where multiple contention
events could be emitted with overlapping durations, such that when these
durations were added together (e.g. by `GetCumulativeContentionTime`), their sum
was larger than the runtime of the entire query. This was possible because as of
70ef641, we were failing to reset the waiting start time on each new lock
holder transaction for the same key.

This commit fixes this by computing the contention event duration using
`contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also
cleans up some of this code and makes it harder to make such a mistake in the
future.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 7, 2023
… start time

Fixes cockroachdb#98104.

This commit resolves the bug identified in cockroachdb#98104 where multiple contention
events could be emitted with overlapping durations, such that when these
durations were added together (e.g. by `GetCumulativeContentionTime`), their sum
was larger than the runtime of the entire query. This was possible because as of
70ef641, we were failing to reset the waiting start time on each new lock
holder transaction for the same key.

This commit fixes this by computing the contention event duration using
`contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also
cleans up some of this code and makes it harder to make such a mistake in the
future.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-kv KV Team
Projects
None yet
2 participants