Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290

Closed
awoods187 opened this issue Mar 1, 2019 · 3 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@awoods187
Copy link
Contributor

Describe the problem
Cluster degraded over multiple runs:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    26985.3  95.4%    695.9    637.5   1140.9   1342.2   2013.3  12884.9

run 2:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    24028.6  81.2%   3726.7   2818.6   6710.9  10737.4  22548.6  98784.2

run 3:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    21926.1  74.1%   5381.5   4295.0   9126.8  14495.5  26843.5 103079.2

These degradations come with a large number of transaction restarts and long durations:
image

Full debug zip here

To Reproduce

Normal three-node tpc-c setup on AWS

Environment:
v19.1.0-beta.20190225-213-g71681f6

@awoods187 awoods187 added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Mar 1, 2019
@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. and removed C-performance Perf of queries or internals. Solution not expected to change functional behavior. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Mar 1, 2019
@awoods187 awoods187 added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Mar 8, 2019
@awoods187
Copy link
Contributor Author

awoods187 commented Mar 15, 2019

@ajwerner and I hit this today on a cluster.
First we are doing well:

elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  10m13s        0          220.7          212.3    184.5    352.3    402.7    469.8 delivery
  10m13s        0         2017.5         2072.8    151.0    285.2    369.1    570.4 newOrder
  10m13s        0          213.7          212.5      7.6     52.4    100.7    209.7 orderStatus
  10m13s        0         2326.1         2120.5    125.8    268.4    335.5    536.9 payment
  10m13s        0          173.8          212.0     22.0     83.9    134.2    285.2 stockLevel
  10m14s        0          189.6          212.2    234.9    503.3    671.1    771.8 delivery
  10m14s        0         1934.7         2072.7    184.5    335.5    419.4    536.9 newOrder
  10m14s        0          207.4          212.5      7.9     54.5    125.8    167.8 orderStatus
  10m14s        0         1980.1         2120.3    159.4    302.0    385.9    704.6 payment
  10m14s        0          239.0          212.1     23.1    109.1    209.7    243.3 stockLevel
  10m15s        0          214.2          212.2    218.1    503.3    637.5    771.8 delivery
  10m15s        0         2117.4         2072.8    234.9    402.7    503.3    771.8 newOrder
  10m15s        0          219.1          212.5      9.4     96.5    167.8    268.4 orderStatus
  10m15s        0         2066.0         2120.3    176.2    369.1    486.5    805.3 payment
  10m15s        0          204.3          212.1     24.1    104.9    260.0    352.3 stockLevel
  10m16s        0          189.7          212.2    285.2    570.4    771.8    939.5 delivery
  10m16s        0         2142.9         2072.9    285.2    503.3    604.0   1073.7 newOrder
  10m16s        0          216.8          212.5      9.4     65.0    109.1    130.0 orderStatus
  10m16s        0         2238.2         2120.4    243.3    453.0    570.4    872.4 payment
  10m16s        0          192.7          212.1     24.1     88.1    176.2    302.0 stockLevel

And then the spike in restarts happens:
image

Which then leads to poor performance:

elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  10m57s        0           71.9          203.7  12884.9  16643.0  26843.5  33286.0 delivery
  10m57s        0         1624.8         2007.9  23622.3  33286.0  36507.2  38654.7 newOrder
  10m57s        0           92.6          205.8  12348.0  12884.9  12884.9  12884.9 orderStatus
  10m57s        0          983.7         2054.6  13421.8  16106.1  18253.6  23622.3 payment
  10m57s        0           81.7          205.5  12884.9  15569.3  19327.4  19327.4 stockLevel
  10m58s        0           27.9          203.5  13958.6  15569.3  15569.3  15569.3 delivery
  10m58s        0          952.3         2006.3  22548.6  32212.3  36507.2  38654.7 newOrder
  10m58s        0           49.8          205.6  12884.9  13421.8  13421.8  13421.8 orderStatus
  10m58s        0          600.6         2052.4  13958.6  16106.1  17179.9  22548.6 payment
  10m58s        0           55.8          205.3  12884.9  16106.1  18253.6  20401.1 stockLevel
  10m59s        0           24.4          203.2  13958.6  15032.4  15569.3  15569.3 delivery
  10m59s        0          577.5         2004.1  25769.8  34359.7  36507.2  38654.7 newOrder
  10m59s        0           44.7          205.3  13421.8  13421.8  13421.8  13421.8 orderStatus
  10m59s        0          371.5         2049.9  13958.6  16643.0  18253.6  23622.3 payment
  10m59s        0           24.4          205.0  13421.8  13421.8  13958.6  13958.6 stockLevel
   11m0s        0           32.6          202.9  14495.5  16106.1  17179.9  17179.9 delivery
   11m0s        0          720.1         2002.1  24696.1  36507.2  38654.7  38654.7 newOrder
   11m0s        0           35.6          205.1  13958.6  13958.6  14495.5  14495.5 orderStatus
   11m0s        0          386.2         2047.3  14495.5  18253.6  19327.4  26843.5 payment
   11m0s        0           40.5          204.8  13958.6  18253.6  21474.8  21474.8 stockLevel

Which is also confirmed in the final numbers:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s   109082.6  84.8%   4170.8    218.1  15569.3  30064.8  49392.1 103079.2

And we aren't underprovisioned either:
image

This means that we have unpredictible performance which can be a dealbreaker

@jordanlewis
Copy link
Member

Is keeping this open still useful?

@ajwerner
Copy link
Contributor

ajwerner commented May 9, 2019

I believe this was addressed by #36748 enough to close it.

@ajwerner ajwerner closed this as completed May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

3 participants