Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/index/tpcc/w=1000 failed #40566

Closed
cockroach-teamcity opened this issue Sep 6, 2019 · 7 comments · Fixed by #40924
Closed

roachtest: schemachange/index/tpcc/w=1000 failed #40566

cockroach-teamcity opened this issue Sep 6, 2019 · 7 comments · Fixed by #40924
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/4784fe3c51545db5fb5d411937ec1db2ef2b9761

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1472753&tab=buildLog

The test failed on branch=provisional_201909060000_v19.2.0-beta.20190910, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190906-1472753/schemachange/index/tpcc/w=1000/run_1
	cluster.go:1735,tpcc.go:184,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567786968-71-n5cpu16:5 -- ./workload run tpcc --warehouses=1000 --histograms=perf/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		   30014            0.0           23.4      0.0      0.0      0.0      0.0 stockLevel
		 2990.0s    37448            1.0           23.4  18253.6  18253.6  18253.6  18253.6 delivery
		 2990.0s    37448            0.0          233.6      0.0      0.0      0.0      0.0 newOrder
		 2990.0s    37448            1.0           23.4   4295.0   4295.0   4295.0   4295.0 orderStatus
		 2990.0s    37448            1.0          232.4  24696.1  24696.1  24696.1  24696.1 payment
		 2990.0s    37448            0.0           23.4      0.0      0.0      0.0      0.0 stockLevel
		 2991.0s    37448            0.0           23.4      0.0      0.0      0.0      0.0 delivery
		 2991.0s    37448            2.0          233.5  28991.0  66572.0  66572.0  66572.0 newOrder
		 2991.0s    37448            0.0           23.4      0.0      0.0      0.0      0.0 orderStatus
		 2991.0s    37448            5.0          232.3  19327.4  90194.3  90194.3  90194.3 payment
		 2991.0s    37448            0.0           23.4      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Sep 6, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Sep 6, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/47bb2a58c87fc1259291ec9dde78de3e54bd8a3d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1475396&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190909-1475396/schemachange/index/tpcc/w=1000/run_1
	cluster.go:1735,tpcc.go:184,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1568010398-75-n5cpu16:5 -- ./workload run tpcc --warehouses=1000 --histograms=perf/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		d
		 1276.0s    12560           16.0           26.2   2684.4   5905.6  40802.2  40802.2 delivery
		 1276.0s    12560          163.7          262.2   6442.5  28991.0  47244.6  53687.1 newOrder
		 1276.0s    12560           14.0           26.5    285.2   8053.1  26843.5  26843.5 orderStatus
		 1276.0s    12560          223.7          259.4  13421.8 103079.2 103079.2 103079.2 payment
		 1276.0s    12560           30.0           26.4   5368.7  49392.1  73014.4  73014.4 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		 1277.0s    12560           28.1           26.2   2952.8  19327.4  34359.7  34359.7 delivery
		 1277.0s    12560          214.4          262.2   6710.9  27917.3  64424.5  68719.5 newOrder
		 1277.0s    12560           13.0           26.4    604.0    973.1  51539.6  51539.6 orderStatus
		 1277.0s    12560          205.4          259.4  11274.3  94489.3 103079.2 103079.2 payment
		 1277.0s    12560           19.0           26.4   1073.7   8053.1  33286.0  33286.0 stockLevel
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1488785&tab=buildLog

The test failed on branch=40765, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190914-1488785/schemachange/index/tpcc/w=1000/run_1
	cluster.go:1735,tpcc.go:184,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1568493418-82-n5cpu16:5 -- ./workload run tpcc --warehouses=1000 --histograms=perf/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		l
		 4368.0s     9384            0.0           27.1      0.0      0.0      0.0      0.0 delivery
		 4368.0s     9384            0.0          271.0      0.0      0.0      0.0      0.0 newOrder
		 4368.0s     9384            1.0           27.1     10.5     10.5     10.5     10.5 orderStatus
		 4368.0s     9384           17.0          270.2  11811.2 103079.2 103079.2 103079.2 payment
		 4368.0s     9384            0.0           27.1      0.0      0.0      0.0      0.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		 4369.0s     9384            0.0           27.1      0.0      0.0      0.0      0.0 delivery
		 4369.0s     9384            0.0          271.0      0.0      0.0      0.0      0.0 newOrder
		 4369.0s     9384            0.0           27.1      0.0      0.0      0.0      0.0 orderStatus
		 4369.0s     9384            0.0          270.2      0.0      0.0      0.0      0.0 payment
		 4369.0s     9384            0.0           27.1      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=1000 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1489712&tab=buildLog

The test failed on branch=40765, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190915-1489712/schemachange/index/tpcc/w=1000/run_1
	cluster.go:1735,tpcc.go:184,cluster.go:2091,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1568575032-75-n5cpu16:5 -- ./workload run tpcc --warehouses=1000 --histograms=perf/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		2079           14.0          268.8  66572.0  81604.4 103079.2 103079.2 newOrder
		 4119.0s    22079            5.0           26.9    838.9  60129.5  60129.5  60129.5 orderStatus
		 4119.0s    22079           87.1          267.9  73014.4 103079.2 103079.2 103079.2 payment
		 4119.0s    22079           25.0           26.9  57982.1  98784.2 103079.2 103079.2 stockLevel
		E190916 01:53:45.224926 1 workload/cli/run.go:447  error in delivery: ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = transport is closing [propagate]) (SQLSTATE 40003)
		 4120.0s    22103           43.9           26.9  60129.5  85899.3 103079.2 103079.2 delivery
		 4120.0s    22103           64.9          268.7  64424.5  81604.4  90194.3 103079.2 newOrder
		 4120.0s    22103            8.0           26.9     46.1  60129.5  60129.5  60129.5 orderStatus
		 4120.0s    22103          142.7          267.9  68719.5 103079.2 103079.2 103079.2 payment
		 4120.0s    22103            8.0           26.9    385.9  73014.4  73014.4  73014.4 stockLevel
		: signal: killed

@nvanbenschoten
Copy link
Member

Cockroach is being killed by the linux OOM killer. In each case, the majority of the memory in the heap profiles is in optTableStat.init:
Screen Shot 2019-09-18 at 4 34 16 PM

@RaduBerinde we talked about this about a month ago on slack here: https://cockroachlabs.slack.com/archives/C8HD41C82/p1566678865056900. Do you mind taking a look at this?

@RaduBerinde
Copy link
Member

We are running with --wait=false so we have 10,000 workers and connections. For each one we keep a bunch of tables, including a few different versions of some (because of schema changes). I can work to improve that part (to clean up older versions), but really the root problem here is that we have so many connections. The memory in the picture comes out to 150kb per connection (which is not great but not horrifying either).

@RaduBerinde
Copy link
Member

@nvanbenschoten what are your thoughts on passing --workers 1000 for this test, at least until we have a way of sharing these structures between connections?

@nvanbenschoten
Copy link
Member

That sounds good to me. We'll also want to do the same thing with schemachange/mixed/tpcc.

craig bot pushed a commit that referenced this issue Sep 19, 2019
40924: roachtest: limit number of workers for schemachange tests r=RaduBerinde a=RaduBerinde

With `--wait=false`, tpcc defaults to `10*W` workers and connections.
The opt catalog objects are per-connection, so this ends up using a
lot of memory. The schema change part makes it worse because we keep
multiple versions of the changed tables.

Reduce the number of connections in the tests, at least until we
implement sharing of opt catalog objects between connections.

Fixes #40566.

Release justification: non-production code change.

Release note: None

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig craig bot closed this as completed in 1de3562 Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants