Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high latencies when running tpc-c 2100 on the same cluster as previous schema change testing #36856

Closed
awoods187 opened this issue Apr 15, 2019 · 6 comments
Labels
C-investigation Further steps needed to qualify. C-label will change.

Comments

@awoods187
Copy link
Contributor

awoods187 commented Apr 15, 2019

I ran a series of schema change tests on this cluster including stopping and starting workload. As I went to conduct another schema change test, I noticed that latencies were climbing (before I made a schema change).
image

I wonder if this was caused by the cluster entering into a bad state after this work or by something else.

At the time of the test I had ran SET CLUSTER SETTING kv.bulk_io_write.addsstable_max_rate=.1; and was running ` roachprod run $CLUSTER:4 "./workload run tpcc --ramp=5m --warehouses=2500 --active-warehouses=2100 --duration=2h --split --scatter {pgurl:1-3}"

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    23420.4  86.7%   2852.3   2013.3   5637.1   8589.9  19327.4 103079.2

Cockroach Version
v19.1.0-rc.2-40-g0c83360

@awoods187
Copy link
Contributor Author

@awoods187 awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label Apr 15, 2019
@awoods187
Copy link
Contributor Author

Here is an extract that narrows this debug.zip down to just the window of the latency spike:
extract.txt

@tbg
Copy link
Member

tbg commented Apr 16, 2019

There are a bunch of warnings from this assertion

} else if v == nil {
// This should never happen as ok is true above.
log.Warningf(ctx, "unable to find value for %s @ %s (%+v vs %+v)",
intent.Key, intent.Txn.Timestamp, meta, intent.Txn)

I feel like we've seen this before. The comment stats that it should never misfire. Do you remember what our thinking was there @nvanbenschoten? I'm hoping that this assertion misfires if the committed value is a deletion.

ip-172-31-38-247> W190415 21:58:39.978179 145640881 storage/engine/mvcc.go:2249  [n2,s2,r7295/2:/Table/55/1/20{50/0-60/0}] unable to find value for /Table/55/1/2051/3/0 @ 1555360481.920732762,301 (timestamp:<wall_time:1555365499582658610 logical:0 > deleted:false key_bytes:12 val_bytes:106  vs {ID:8159bc9f-79fa-4cb0-a70a-9a25919a9686 Key:[196 137 247 8 3 136] Epoch:0 Timestamp:1555360481.920732762,301 Priority:55715 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
ip-172-31-38-247> W190415 21:58:39.978341 145640881 storage/engine/mvcc.go:2249  [n2,s2,r7295/2:/Table/55/1/20{50/0-60/0}] unable to find value for /Table/55/1/2052/6/0 @ 1555360482.733941957,0 (timestamp:<wall_time:1555365491878235204 logical:0 > deleted:false key_bytes:12 val_bytes:97  vs {ID:9463d525-bcc8-418e-8815-96316b58602c Key:[196 137 247 8 4 136] Epoch:0 Timestamp:1555360482.733941957,0 Priority:1311503 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
ip-172-31-38-247> W190415 21:58:39.978441 145640881 storage/engine/mvcc.go:2249  [n2,s2,r7295/2:/Table/55/1/20{50/0-60/0}] unable to find value for /Table/55/1/2054/3/0 @ 1555360484.869336202,8 (timestamp:<wall_time:1555365456596059562 logical:0 > deleted:false key_bytes:12 val_bytes:90  vs {ID:cd8cf356-6d33-492c-a9df-fa5b5e644e0b Key:[196 137 247 8 6 136] Epoch:0 Timestamp:1555360484.869336202,8 Priority:349399 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
ip-172-31-38-165> W190415 21:58:44.362695 133998530 storage/engine/mvcc.go:2217  [n3,s3,r9886/3:/Table/{59/5/2019…-60}] unable to find value for /Table/59/5/2019-04-15T20:37:07Z/15/5/NULL/3160/0 ({ID:6bb14e37-52bd-4384-b25f-f4b097dba66b Key:[191 137 151 141 136] Epoch:0 Timestamp:1555360627.985619772,0 Priority:943357 Sequence:31 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
ip-172-31-46-232> I190415 21:58:45.552033 28 server/status/runtime.go:500  [n1] runtime stats: 14 GiB RSS, 7501 goroutines, 2.1 GiB/255 MiB/3.0 GiB GO alloc/idle/total, 8.3 GiB/10 GiB CGO alloc/total, 237864.8 CGO/sec, 1246.4/104.9 %(u/s)time, 0.4 %gc (5x), 139 MiB/147 MiB (r/w)net
ip-172-31-38-165> I190415 21:58:46.467241 391 server/status/runtime.go:500  [n3] runtime stats: 13 GiB RSS, 5910 goroutines, 2.0 GiB/293 MiB/2.8 GiB GO alloc/idle/total, 8.3 GiB/10 GiB CGO alloc/total, 225542.5 CGO/sec, 1104.1/111.9 %(u/s)time, 0.1 %gc (5x), 144 MiB/139 MiB (r/w)net
ip-172-31-46-232> I190415 21:58:47.959057 161073369 storage/replica_consistency.go:195  [n1,consistencyChecker,s1,r5248/1:/Table/57/1/55{0/4/-…-2/2/-…}] triggering stats recomputation to resolve delta of {ContainsEstimates:true LastUpdateNanos:1555365524016863456 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SysBytes:0 SysCount:0 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
ip-172-31-46-232> W190415 21:58:48.070753 161139371 storage/engine/mvcc.go:2249  [n1,s1,r7320/1:/Table/55/1/6{10/0-20/0}] unable to find value for /Table/55/1/619/8/0 @ 1555360105.096335711,15 (timestamp:<wall_time:1555365493298003759 logical:0 > deleted:false key_bytes:12 val_bytes:96  vs {ID:93b62fec-13f2-417e-a84d-1225ec0106eb Key:[196 137 247 2 107 136] Epoch:0 Timestamp:1555360105.096335711,15 Priority:409409 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
ip-172-31-46-232> W190415 21:58:48.070931 161139371 storage/engine/mvcc.go:2249  [n1,s1,r7320/1:/Table/55/1/6{10/0-20/0}] unable to find value for /Table/55/1/611/1/0 @ 1555358698.301905746,1 (timestamp:<wall_time:1555365458479325687 logical:0 > deleted:false key_bytes:12 val_bytes:99  vs {ID:7e0885fe-ce32-4818-9217-92d03a3d3dbd Key:[196 137 247 2 99 136] Epoch:1 Timestamp:1555358698.301905746,1 Priority:234933 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
ip-172-31-46-232> W190415 21:58:48.071115 161139371 storage/engine/mvcc.go:2249  [n1,s1,r7320/1:/Table/55/1/6{10/0-20/0}] unable to find value for /Table/55/1/616/10/0 @ 1555357966.540320719,580 (timestamp:<wall_time:1555365525791660656 logical:0 > deleted:false key_bytes:12 val_bytes:103  vs {ID:51ee6b54-42b5-4373-b45f-af9464df924f Key:[196 137 247 2 104 136] Epoch:0 Timestamp:1555357966.540320719,580 Priority:488197 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})

Then there's this:

ip-172-31-46-232> E190415 22:20:29.128484 171451729 sql/sqltelemetry/report.go:39  [intExec=create-stats] encountered internal error:
sql/distsqlpb/data.go:151 in NewError(): (XX000) internal error: uncaught error: batch timestamp 1555366200.694454690,0 must be after replica GC threshold 1555366210.980181922,0
-- detail --
data.go:151: uncaught error: %+v | *roachpb.BatchTimestampBeforeGCError
github.com/cockroachdb/cockroach/pkg/sql/distsqlpb/data.go:151: in NewError()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/stream_encoder.go:109: in AddMetadata()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/outbox.go:125: in addRow()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/outbox.go:293: in mainLoop()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/outbox.go:413: in run()

The rest of the log looks remarkably quiet. You can tell the system is working (lots of goroutines, etc) but it doesn't seem like the core is creaking.

@vivekmenezes
Copy link
Contributor

I was able to reproduce this problem after creating and dropping two indexes and then quickly reducing the GC TTL using ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 600

I grabbed @ajwerner and we took a look at the cluster. CPU was pegged above 80% on all nodes. The cluster seemed to be busy doing rocksdb compactions. I reduced the load on the cluster to an active warehouse load of 1000 warehouses and the problem immediately went away

See the dip at the end of these graphs

Screen Shot 2019-04-16 at 11 27 30 AM

with the latency going down dramatically:

Screen Shot 2019-04-16 at 11 29 38 AM

@ajwerner
Copy link
Contributor

@nvanbenschoten to what extent do you think #36748 informs this issue?

@ajwerner
Copy link
Contributor

I'm closing this as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

4 participants