Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290

awoods187 · 2019-03-01T01:23:17Z

Describe the problem
Cluster degraded over multiple runs:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    26985.3  95.4%    695.9    637.5   1140.9   1342.2   2013.3  12884.9

run 2:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    24028.6  81.2%   3726.7   2818.6   6710.9  10737.4  22548.6  98784.2

run 3:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    21926.1  74.1%   5381.5   4295.0   9126.8  14495.5  26843.5 103079.2

These degradations come with a large number of transaction restarts and long durations:

Full debug zip here

To Reproduce

Normal three-node tpc-c setup on AWS

Environment:
v19.1.0-beta.20190225-213-g71681f6

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-03-15T19:07:56Z

@ajwerner and I hit this today on a cluster.
First we are doing well:

elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  10m13s        0          220.7          212.3    184.5    352.3    402.7    469.8 delivery
  10m13s        0         2017.5         2072.8    151.0    285.2    369.1    570.4 newOrder
  10m13s        0          213.7          212.5      7.6     52.4    100.7    209.7 orderStatus
  10m13s        0         2326.1         2120.5    125.8    268.4    335.5    536.9 payment
  10m13s        0          173.8          212.0     22.0     83.9    134.2    285.2 stockLevel
  10m14s        0          189.6          212.2    234.9    503.3    671.1    771.8 delivery
  10m14s        0         1934.7         2072.7    184.5    335.5    419.4    536.9 newOrder
  10m14s        0          207.4          212.5      7.9     54.5    125.8    167.8 orderStatus
  10m14s        0         1980.1         2120.3    159.4    302.0    385.9    704.6 payment
  10m14s        0          239.0          212.1     23.1    109.1    209.7    243.3 stockLevel
  10m15s        0          214.2          212.2    218.1    503.3    637.5    771.8 delivery
  10m15s        0         2117.4         2072.8    234.9    402.7    503.3    771.8 newOrder
  10m15s        0          219.1          212.5      9.4     96.5    167.8    268.4 orderStatus
  10m15s        0         2066.0         2120.3    176.2    369.1    486.5    805.3 payment
  10m15s        0          204.3          212.1     24.1    104.9    260.0    352.3 stockLevel
  10m16s        0          189.7          212.2    285.2    570.4    771.8    939.5 delivery
  10m16s        0         2142.9         2072.9    285.2    503.3    604.0   1073.7 newOrder
  10m16s        0          216.8          212.5      9.4     65.0    109.1    130.0 orderStatus
  10m16s        0         2238.2         2120.4    243.3    453.0    570.4    872.4 payment
  10m16s        0          192.7          212.1     24.1     88.1    176.2    302.0 stockLevel

And then the spike in restarts happens:

Which then leads to poor performance:

elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  10m57s        0           71.9          203.7  12884.9  16643.0  26843.5  33286.0 delivery
  10m57s        0         1624.8         2007.9  23622.3  33286.0  36507.2  38654.7 newOrder
  10m57s        0           92.6          205.8  12348.0  12884.9  12884.9  12884.9 orderStatus
  10m57s        0          983.7         2054.6  13421.8  16106.1  18253.6  23622.3 payment
  10m57s        0           81.7          205.5  12884.9  15569.3  19327.4  19327.4 stockLevel
  10m58s        0           27.9          203.5  13958.6  15569.3  15569.3  15569.3 delivery
  10m58s        0          952.3         2006.3  22548.6  32212.3  36507.2  38654.7 newOrder
  10m58s        0           49.8          205.6  12884.9  13421.8  13421.8  13421.8 orderStatus
  10m58s        0          600.6         2052.4  13958.6  16106.1  17179.9  22548.6 payment
  10m58s        0           55.8          205.3  12884.9  16106.1  18253.6  20401.1 stockLevel
  10m59s        0           24.4          203.2  13958.6  15032.4  15569.3  15569.3 delivery
  10m59s        0          577.5         2004.1  25769.8  34359.7  36507.2  38654.7 newOrder
  10m59s        0           44.7          205.3  13421.8  13421.8  13421.8  13421.8 orderStatus
  10m59s        0          371.5         2049.9  13958.6  16643.0  18253.6  23622.3 payment
  10m59s        0           24.4          205.0  13421.8  13421.8  13958.6  13958.6 stockLevel
   11m0s        0           32.6          202.9  14495.5  16106.1  17179.9  17179.9 delivery
   11m0s        0          720.1         2002.1  24696.1  36507.2  38654.7  38654.7 newOrder
   11m0s        0           35.6          205.1  13958.6  13958.6  14495.5  14495.5 orderStatus
   11m0s        0          386.2         2047.3  14495.5  18253.6  19327.4  26843.5 payment
   11m0s        0           40.5          204.8  13958.6  18253.6  21474.8  21474.8 stockLevel

Which is also confirmed in the final numbers:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s   109082.6  84.8%   4170.8    218.1  15569.3  30064.8  49392.1 103079.2

And we aren't underprovisioned either:

This means that we have unpredictible performance which can be a dealbreaker

jordanlewis · 2019-05-09T03:59:35Z

Is keeping this open still useful?

ajwerner · 2019-05-09T04:10:49Z

I believe this was addressed by #36748 enough to close it.

awoods187 assigned ajwerner Mar 1, 2019

awoods187 added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Mar 1, 2019

awoods187 added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Mar 8, 2019

ajwerner mentioned this issue Mar 18, 2019

release: v19.1.0-beta.20190318 #35660

Closed

17 tasks

ajwerner mentioned this issue Mar 28, 2019

stats: excessive SELECTs against the system.table_statistics #36284

Closed

ajwerner closed this as completed May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290

Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290

awoods187 commented Mar 1, 2019

awoods187 commented Mar 15, 2019 •

edited

Loading

jordanlewis commented May 9, 2019

ajwerner commented May 9, 2019

Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290

Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290

Comments

awoods187 commented Mar 1, 2019

awoods187 commented Mar 15, 2019 • edited Loading

jordanlewis commented May 9, 2019

ajwerner commented May 9, 2019

awoods187 commented Mar 15, 2019 •

edited

Loading