-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/concurrency: misleading duration of contention events with txn pushes #98104
Comments
Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
I think this was broken by 70ef641. Before that change, we would reset the clock on each contention event 70ef641#diff-b2445fe50a89171ab341cf76c645ef668d8dcccd14143b7c96d07d8b52047941L902. These events were emitted on every new (key, txn) pair. Now, we reset it whenever we wait on a new key:
This means that we may emit multiple contention events on the same key but for different lock holder txns without resetting the timer. |
99166: kv/concurrency: compute contention event duration from (key,txn) wait start time r=andreimatei a=nvanbenschoten Fixes #98104. This commit resolves the bug identified in #98104 where multiple contention events could be emitted with overlapping durations, such that when these durations were added together (e.g. by `GetCumulativeContentionTime`), their sum was larger than the runtime of the entire query. This was possible because as of 70ef641, we were failing to reset the waiting start time on each new lock holder transaction for the same key. This commit fixes this by computing the contention event duration using `contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also cleans up some of this code and makes it harder to make such a mistake in the future. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
… start time Fixes #98104. This commit resolves the bug identified in #98104 where multiple contention events could be emitted with overlapping durations, such that when these durations were added together (e.g. by `GetCumulativeContentionTime`), their sum was larger than the runtime of the entire query. This was possible because as of 70ef641, we were failing to reset the waiting start time on each new lock holder transaction for the same key. This commit fixes this by computing the contention event duration using `contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also cleans up some of this code and makes it harder to make such a mistake in the future. Release note: None
… start time Fixes cockroachdb#98104. This commit resolves the bug identified in cockroachdb#98104 where multiple contention events could be emitted with overlapping durations, such that when these durations were added together (e.g. by `GetCumulativeContentionTime`), their sum was larger than the runtime of the entire query. This was possible because as of 70ef641, we were failing to reset the waiting start time on each new lock holder transaction for the same key. This commit fixes this by computing the contention event duration using `contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also cleans up some of this code and makes it harder to make such a mistake in the future. Release note: None
… start time Fixes cockroachdb#98104. This commit resolves the bug identified in cockroachdb#98104 where multiple contention events could be emitted with overlapping durations, such that when these durations were added together (e.g. by `GetCumulativeContentionTime`), their sum was larger than the runtime of the entire query. This was possible because as of 70ef641, we were failing to reset the waiting start time on each new lock holder transaction for the same key. This commit fixes this by computing the contention event duration using `contentionTag.waitStart` instead of `waitingState.lockWaitStart`. It also cleans up some of this code and makes it harder to make such a mistake in the future. Release note: None
Here (internal slack) is a stmt bundle in which we have a case when the total execution time of subquery 1 was 1.9s yet we reported 28s of contention. This is due to the read in that subquery being repeatedly pushed, so we produced multiple contention events with constantly increasing duration, here is a subset (edited to remove possible PII):
Notably, all these events occurred when performing the same read, but the duration keeps on increasing (due to the read being pushed many times AFAIU), until the last event records the actual contention of 1.88s.
SQL treats all of these events as separate, so it aggregates them here
cockroach/pkg/sql/execstats/stats.go
Line 38 in 2853470
I believe it's a bug in the KV layer where we should be producing the contention events such that their duration doesn't include the duration of contention of previously reported events.
Jira issue: CRDB-25078
The text was updated successfully, but these errors were encountered: