[dst] Improve tail-latency for operations of transactions using wait queues #13580

robertsami · 2022-08-11T14:32:54Z

Jira Link: DB-3158

The biggest contributor to higher tail latency is caused by the following case of starvation -- in case there is a high-degree of contention, waiting transactions may get starved by incoming operations which contend for the same latch. We currently have no mechanism to prevent this, which can lead to high tail-latency in some workloads.

Less critically, our process for determining which waiters can be resumed and subsequently resuming them could be improved in a couple ways:

We currently iterate over each of the blocker's waiters and separately acquire a write lock on a mutex to remove the waiter from waiter_status_ before resuming the waiter. We need not re-acquire this write lock for every waiter and can simply acquire it once
We currently resume waiters in the order they arrived, and in serial on a single thread. It might be better to understand which of the waiters will conflict with each other, and then either:
a. Resolve the first-in waiter and all non-conflicting other waiters in parallel
b. Resolve the largest set of non-conflicting waiters in parallel, then the second largest, etc

The text was updated successfully, but these errors were encountered:

Summary: tbd Test Plan: Jenkins Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D22968

Summary: The main contribution of this revision is to drastically improve p99 performance of workloads using wait-on-conflict concurrency control under high contention, without harming p50 or average performance under normal amounts of contention. We achieve this by making the following improvements: 1. Force incoming requests to check the wait queue once for active blockers, to ensure incoming requests cannot starve waiting transactions which are racing to exit the wait queue 2. Assign serial numbers to incoming requests, and whenever a batch of waiters can be resumed at the same time, ensure they are resumed roughly in the order in which they arrived at the tserver Additional enhancements include: 1. Reduce copying by consolidating on using TransactionData everywhere, which is pulled into a conflict_data.h file with associated data structures 2. Populate granular intent information on a sub-transaction basis for use by the wait queue 3. Piggy-back off transaction status request in conflict resolution to obtain status tablet info Test Plan: Performance was tested on a 16 core 32gb ram alma8 server with a full-LTO release build. In both cases we used the following set-up: ``` create table test (k int primary key, v int); insert into test select generate_series(0, 11), 0; ``` In both cases, we also ran ysql_bench as follows: ``` build/latest/postgres/bin/ysql_bench --transactions=2000 --jobs=16 --client=16 --file=workload.sql --progress=1 --debug=fails ``` = First test: Max contention = `workload.sql` ``` begin; select * from test where k=1 for update; commit; ``` Baseline: ``` latency average = 19.779 ms latency stddev = 26.684 ms tps = 792.780284 (including connections establishing) tps = 793.793930 (excluding connections establishing) ``` With revision: ``` latency average = 22.632 ms latency stddev = 3.266 ms tps = 705.108285 (including connections establishing) tps = 705.914647 (excluding connections establishing) ``` = Second test: Normal contention = `workload.sql` ``` begin; with rand as (select floor(random() * 10 + 1)::int as k) select * from test join rand on rand.k=test.k for update; commit; ``` Baseline: ``` latency average = 7.317 ms latency stddev = 6.516 ms tps = 2117.437801 (including connections establishing) tps = 2126.594897 (excluding connections establishing) ``` With revision: ``` latency average = 7.055 ms latency stddev = 5.124 ms tps = 2236.062486 (including connections establishing) tps = 2244.260708 (excluding connections establishing) ``` ==Takeaways== 1. stddev of latency is substantially improved with this revision, at the expense of a 25% drop in throughput and 35% increase in latency 2. Throughput is not significantly changed in normal contention case Reviewers: pjain, bkolagani, sergei Reviewed By: sergei Subscribers: mbautin, rthallam, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D22968

rthallamko3 · 2023-03-07T19:15:19Z

@robertsami , Can we close this or were you planning additional followups?

robertsami · 2023-03-10T17:55:17Z

the main fix for fairness has landed in f69dc2a

regarding the remaining minor points in the description:

is no longer relevant
is captured here and not as critical: [DocDB] Avoid re-running conflict resolution for waiters which are still blocked #16389

robertsami self-assigned this Aug 11, 2022

robertsami added the area/docdb YugabyteDB core features label Aug 11, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 11, 2022

This was referenced Aug 11, 2022

[dst] Prefer push-based mechanism for signaling waiting transactions #13578

Closed

[YSQL] Support wait-on-conflict concurrency control #5680

Closed

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Oct 19, 2022

rthallamko3 added the 2.17.0_blocker label Oct 26, 2022

rthallamko3 removed the 2.17.0_blocker label Nov 3, 2022

robertsami changed the title ~~[dst] Improve the continuation of waiters in wait_queue.cc~~ [dst] Improve tail-latency for operations of transactions using wait queues Jan 17, 2023

robertsami added a commit to robertsami/yugabyte-db that referenced this issue Feb 28, 2023

WIP [yugabyte#13580] Implement fairness in wait queues

3baad30

Summary: tbd Test Plan: Jenkins Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D22968

robertsami closed this as completed Mar 10, 2023

jharveysmith added this to Wait-Queue Based Locking Aug 16, 2023

jharveysmith moved this to Done in Wait-Queue Based Locking Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dst] Improve tail-latency for operations of transactions using wait queues #13580

[dst] Improve tail-latency for operations of transactions using wait queues #13580

robertsami commented Aug 11, 2022 •

edited

Loading

rthallamko3 commented Mar 7, 2023 •

edited

Loading

robertsami commented Mar 10, 2023 •

edited

Loading

[dst] Improve tail-latency for operations of transactions using wait queues #13580

[dst] Improve tail-latency for operations of transactions using wait queues #13580

Comments

robertsami commented Aug 11, 2022 • edited Loading

rthallamko3 commented Mar 7, 2023 • edited Loading

robertsami commented Mar 10, 2023 • edited Loading

robertsami commented Aug 11, 2022 •

edited

Loading

rthallamko3 commented Mar 7, 2023 •

edited

Loading

robertsami commented Mar 10, 2023 •

edited

Loading