Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: gets stuck at a certain point right after CompactHead and shipping. #3197

Closed
BhartiAgrawalNvidia opened this issue Sep 20, 2020 · 8 comments

Comments

@BhartiAgrawalNvidia
Copy link

BhartiAgrawalNvidia commented Sep 20, 2020

Thanos, Prometheus and Golang version used:

thanos, version 0.14.0 (branch: HEAD, revision: 19165d7)
build user: circleci@3f2974e89cb7
build date: 20200710-12:59:18
go version: go1.14.2

Object Storage Provider: SwiftStack

What happened: In our production env. after sometime of working fine 6 - 20 hours at random times the Head Truncation stops happening. The head has been compacted to disk and the block gets shipped. After this the process seems to not compact to disk anymore. We see the wal files growing and the memory increases from 7G to about 35G. CPU does not seem to go any higher. Cpu usage is around 2 cores and peaks to 17 cores. We have a rate of 250K samples per second from Prometheus across 8 replicas.

To get it get unstuck, we stop the process (so that it does not changes nodes) on the node and that allows it to reload the wal files and continue where it left off. Sometimes we have to restart a couple of times, if the processes are not all happening. We do this based on the logs to ensure the process are doing all the expected steps of writing block, uploading, doing GC and checkpointing Wal.

What you expected to happen: Truncation happens and does not block compaction and shipping to object store processes

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Logs

level=info ts=2020-09-15T19:56:36.874362831Z caller=main.go:151 msg="Tracing will be disabled" level=warn ts=2020-09-15T19:56:36.874546223Z caller=receive.go:209 component=receive msg="setting up receive" level=info ts=2020-09-15T19:56:36.87456456Z caller=options.go:23 component=receive protocol=HTTP msg="disabled TLS, key and cert must be set to enable" level=info ts=2020-09-15T19:56:36.874633726Z caller=factory.go:46 component=receive msg="loading bucket configuration" level=info ts=2020-09-15T19:56:36.875220062Z caller=receive.go:559 component=receive msg="default tenant data dir already present, not attempting to migrate storage" level=info ts=2020-09-15T19:56:36.875654045Z caller=receive.go:551 component=receive msg="starting receiver" level=info ts=2020-09-15T19:56:36.875860482Z caller=options.go:23 component=receive protocol=gRPC msg="disabled TLS, key and cert must be set to enable" level=info ts=2020-09-15T19:56:36.875924159Z caller=handler.go:185 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19291 level=info ts=2020-09-15T19:56:36.876077496Z caller=intrumentation.go:60 component=receive msg="changing probe status" status=healthy level=info ts=2020-09-15T19:56:36.876184742Z caller=http.go:56 component=receive service=http/server component=receive msg="listening for requests and metrics" address=0.0.0.0:10902 level=info ts=2020-09-15T19:56:36.876341485Z caller=handler.go:212 component=receive component=receive-handler msg="Serving plain HTTP" address=0.0.0.0:19291 level=warn ts=2020-09-15T19:56:36.876324522Z caller=intrumentation.go:54 component=receive msg="changing probe status" status=not-ready reason="hashring has changed; server is not ready to receive web requests." level=info ts=2020-09-15T19:56:36.876435411Z caller=receive.go:399 component=receive msg="hashring has changed; server is not ready to receive web requests." level=info ts=2020-09-15T19:56:36.876475251Z caller=receive.go:331 component=receive msg="updating Multi DB" level=info ts=2020-09-15T19:56:36.876708097Z caller=multitsdb.go:233 component=receive tenant=default-tenant msg="opening TSDB" level=info ts=2020-09-15T19:56:36.877324396Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600174800000 maxt=1600176600000 ulid=01EJ905WE0HTBSRVHW97SJKMM0 level=info ts=2020-09-15T19:56:36.877422164Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600176600000 maxt=1600178400000 ulid=01EJ91WWYX526HJNV1KHCRC2T0 level=info ts=2020-09-15T19:56:36.87748863Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600178400000 maxt=1600180200000 ulid=01EJ93KS5XTN6QMJPBKEXMJ0HJ level=info ts=2020-09-15T19:56:36.877569684Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600180200000 maxt=1600182000000 ulid=01EJ95AVQ9X6Q2R2N2Z5J1PQMS level=info ts=2020-09-15T19:56:36.877631118Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600182000000 maxt=1600183800000 ulid=01EJ971P07FPAC7GHRAXHD7ZJ2 level=info ts=2020-09-15T19:56:36.87769127Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600183800000 maxt=1600185600000 ulid=01EJ98RHGTDM75D5YXY9Y9M53B level=info ts=2020-09-15T19:56:36.877750089Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600185600000 maxt=1600187400000 ulid=01EJ9AFESJK9GJC6Y340X7JAEG level=info ts=2020-09-15T19:56:36.877809505Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600187400000 maxt=1600189200000 ulid=01EJ9C6N8Q3E7FQ4ZXRFNJB2B0 level=info ts=2020-09-15T19:56:36.877878733Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600189200000 maxt=1600191000000 ulid=01EJ9DXAWXW30J1WF96TE9VFRQ level=info ts=2020-09-15T19:56:36.877938168Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600191000000 maxt=1600192800000 ulid=01EJ9FM9AMDMS981TD4FMY9099 level=info ts=2020-09-15T19:56:36.877997671Z caller=repair.go:59 component=receive tenant=default-tenant msg="Found healthy block" mint=1600196400000 maxt=1600198200000 ulid=01EJ9N4W07G45227ZKZFBYYEXX level=info ts=2020-09-15T19:56:37.596969133Z caller=head.go:645 component=receive tenant=default-tenant msg="Replaying WAL and on-disk memory mappable chunks if any, this may take a while" level=info ts=2020-09-15T20:11:37.953607842Z caller=head.go:682 component=receive tenant=default-tenant msg="WAL checkpoint loaded" level=info ts=2020-09-15T20:11:37.953821605Z caller=head.go:709 component=receive tenant=default-tenant msg="WAL replay completed" duration=15m0.356789614s level=info ts=2020-09-15T20:11:38.411295371Z caller=multitsdb.go:260 component=receive tenant=default-tenant msg="TSDB is now ready" level=info ts=2020-09-15T20:11:38.813443603Z caller=intrumentation.go:48 component=receive msg="changing probe status" status=ready level=info ts=2020-09-15T20:11:38.813529154Z caller=receive.go:344 component=receive msg="tsdb started, and server is ready to receive web requests" level=info ts=2020-09-15T20:11:38.813725562Z caller=grpc.go:95 component=receive service=gRPC/server component=receive msg="registering as gRPC StoreAPI" level=info ts=2020-09-15T20:11:38.814249773Z caller=receive.go:470 component=receive msg="listening for StoreAPI and WritableStoreAPI gRPC" address=0.0.0.0:10901 level=info ts=2020-09-15T20:11:38.814367283Z caller=grpc.go:119 component=receive service=gRPC/server component=receive msg="listening for serving gRPC" address=0.0.0.0:10901 level=info ts=2020-09-15T20:23:42.886622535Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600198200000 maxt=1600200000000 ulid=01EJ9PYDTYQPGJ2TY0N7N4GCDG duration=49.351947825s level=info ts=2020-09-15T20:23:44.719474601Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.107179541s level=info ts=2020-09-15T20:24:07.182535212Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJ9PYDTYQPGJ2TY0N7N4GCDG level=info ts=2020-09-15T20:24:13.300455287Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1355 last=1360 duration=28.425123989s level=info ts=2020-09-15T20:45:48.398087035Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600200000000 maxt=1600201800000 ulid=01EJ9R6X83SCT1MYFMDWNNMPCS duration=48.362994058s level=info ts=2020-09-15T20:45:50.182571991Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.133479156s level=info ts=2020-09-15T20:46:07.338824482Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJ9R6X83SCT1MYFMDWNNMPCS level=info ts=2020-09-15T20:46:18.201845106Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1361 last=1365 duration=27.843677829s level=info ts=2020-09-15T21:15:46.296917124Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600201800000 maxt=1600203600000 ulid=01EJ9SXVFC29F6P50Z35PTEC19 duration=45.836207378s level=info ts=2020-09-15T21:15:48.260326832Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.219429161s level=info ts=2020-09-15T21:16:06.958560742Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJ9SXVFC29F6P50Z35PTEC19 level=info ts=2020-09-15T21:16:15.975102858Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1366 last=1370 duration=27.554007681s level=info ts=2020-09-15T21:45:49.207246789Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600203600000 maxt=1600205400000 ulid=01EJ9VMV66G9DXR7G2GTM4B1ZS duration=46.800727963s level=info ts=2020-09-15T21:45:50.961584111Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.089455833s level=info ts=2020-09-15T21:46:06.982097654Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJ9VMV66G9DXR7G2GTM4B1ZS level=info ts=2020-09-15T21:46:17.58835615Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1371 last=1375 duration=26.467043501s level=info ts=2020-09-15T22:15:47.933200949Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600205400000 maxt=1600207200000 ulid=01EJ9XBQN5PJJNCKQCTW9NAY72 duration=46.903855314s level=info ts=2020-09-15T22:15:49.770176347Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.049333236s level=info ts=2020-09-15T22:16:06.9333884Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJ9XBQN5PJJNCKQCTW9NAY72 level=info ts=2020-09-15T22:16:17.613740511Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1376 last=1380 duration=27.696871631s level=info ts=2020-09-15T22:45:47.93030111Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600207200000 maxt=1600209000000 ulid=01EJ9Z2MGFSJBDV4VA5822YNMZ duration=47.882836461s level=info ts=2020-09-15T22:45:49.907252548Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.312065918s level=info ts=2020-09-15T22:46:06.974475046Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJ9Z2MGFSJBDV4VA5822YNMZ level=info ts=2020-09-15T22:46:17.633415762Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1381 last=1385 duration=27.434076773s level=info ts=2020-09-15T23:15:49.769250381Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600209000000 maxt=1600210800000 ulid=01EJA0SJCWBYQ81RHYMN7KSMPZ duration=49.644888434s level=info ts=2020-09-15T23:15:51.684732013Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.215486614s level=info ts=2020-09-15T23:16:07.016463093Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJA0SJCWBYQ81RHYMN7KSMPZ level=info ts=2020-09-15T23:16:18.925318996Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1386 last=1390 duration=27.06613359s level=info ts=2020-09-15T23:45:50.349407402Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600210800000 maxt=1600212600000 ulid=01EJA2GG61VSM8RJR8MAZD97K0 duration=50.252372767s level=info ts=2020-09-15T23:45:52.334685158Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.217770361s level=info ts=2020-09-15T23:46:07.015296797Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJA2GG61VSM8RJR8MAZD97K0 level=info ts=2020-09-15T23:46:20.716419829Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1391 last=1395 duration=28.239839239s level=info ts=2020-09-16T00:15:51.498679461Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600212600000 maxt=1600214400000 ulid=01EJA47G1A4A0JVBY002X687GE duration=49.311982905s level=info ts=2020-09-16T00:15:53.592792438Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.29684659s level=info ts=2020-09-16T00:16:07.071473041Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJA47G1A4A0JVBY002X687GE level=info ts=2020-09-16T00:16:21.116460639Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1396 last=1400 duration=27.421283531s level=info ts=2020-09-16T00:45:55.221039118Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600214400000 maxt=1600216200000 ulid=01EJA5YF180MQMGVZ4JPSW2Z1N duration=51.820967744s level=info ts=2020-09-16T00:45:57.447021289Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.352488694s level=info ts=2020-09-16T00:46:06.957157097Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJA5YF180MQMGVZ4JPSW2Z1N level=info ts=2020-09-16T00:46:25.170960856Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1401 last=1405 duration=27.625500028s level=info ts=2020-09-16T01:16:22.579063711Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600216200000 maxt=1600218000000 ulid=01EJA7NDMEMPRCZ785AGH5K06Z duration=1m18.372924917s level=info ts=2020-09-16T01:16:25.882142796Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=2.352102153s level=info ts=2020-09-16T01:16:37.344549031Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJA7NDMEMPRCZ785AGH5K06Z level=info ts=2020-09-16T01:17:01.098146474Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1406 last=1410 duration=35.093052313s level=info ts=2020-09-16T01:45:53.054436585Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600218000000 maxt=1600219800000 ulid=01EJA9CB1X3NED5KCTC2N9XSYS duration=49.249259889s level=info ts=2020-09-16T01:45:55.090116845Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.246436826s level=info ts=2020-09-16T01:46:06.970426103Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJA9CB1X3NED5KCTC2N9XSYS level=info ts=2020-09-16T01:46:23.366813856Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1411 last=1415 duration=28.151879333s level=info ts=2020-09-16T02:15:50.986749163Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600219800000 maxt=1600221600000 ulid=01EJAB356EN15N1XPTX32258KM duration=50.940480691s level=info ts=2020-09-16T02:15:53.115517406Z caller=head.go:792 component=receive tenant=default-tenant msg="Head GC completed" duration=1.361866934s level=info ts=2020-09-16T02:16:06.955136662Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJAB356EN15N1XPTX32258KM level=info ts=2020-09-16T02:16:21.20090877Z caller=head.go:869 component=receive tenant=default-tenant msg="WAL checkpoint complete" first=1416 last=1420 duration=27.966189826s level=info ts=2020-09-16T02:45:54.993949887Z caller=compact.go:495 component=receive tenant=default-tenant msg="write block" mint=1600221600000 maxt=1600223400000 ulid=01EJACT6V23EY2JX7YRVFCYC3X duration=51.023719573s level=info ts=2020-09-16T02:46:06.944723383Z caller=shipper.go:361 component=receive tenant=default-tenant msg="upload new block" id=01EJACT6V23EY2JX7YRVFCYC3X

Anything else we need to know:

Configuration:
- receive
- --log.level=info
- --http-address=0.0.0.0:10902
- --grpc-address=0.0.0.0:10901
- --remote-write.address=0.0.0.0:19291
- --tsdb.path=/var/thanos/receive
- --objstore.config=$(OBJSTORE_CONFIG)
- --label=replica="$(NAME)"
- --label=receive="true"
- --tsdb.retention=6h
- --tsdb.min-block-duration=30m
- --tsdb.max-block-duration=30m
- --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
- --receive.local-endpoint=$(NAME).thanos-receive-default.$(NAMESPACE).svc.cluster.local:10901
- --shipper.allow-out-of-order-uploads
- --receive.replication-factor=1

@BhartiAgrawalNvidia
Copy link
Author

we also see a lot of duplicate metrics errors

level=warn ts=2020-09-20T22:20:26.155502953Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=2
level=warn ts=2020-09-20T22:20:26.159337854Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
level=warn ts=2020-09-20T22:20:26.159504587Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
level=warn ts=2020-09-20T22:20:26.16749575Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
level=warn ts=2020-09-20T23:20:23.00367522Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
level=warn ts=2020-09-20T23:20:23.00718343Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
level=warn ts=2020-09-20T23:20:23.008136578Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=2
level=warn ts=2020-09-20T23:20:23.009006264Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=3
level=warn ts=2020-09-20T23:20:23.011247461Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
level=warn ts=2020-09-20T23:20:23.015944995Z caller=writer.go:95 component=receive component=receive-writer msg="Error on ingesting samples with different value but same timestamp" num_dropped=1

@bwplotka
Copy link
Member

Thanks for this, we expierienced something like this but it went away when we increase CPU limit slightly. Let's investigate. Go routine dump would be interesting to see when this is happening!

@BhartiAgrawalNvidia
Copy link
Author

BhartiAgrawalNvidia commented Sep 24, 2020

here is the file for the goroutines from the time it was in this state.
pprof.thanos.goroutine.009.pb.gz
and the cpu profile
pprof.thanos.samples.cpu.002.pb.gz

@BhartiAgrawalNvidia
Copy link
Author

BhartiAgrawalNvidia commented Sep 24, 2020

i see spikes for CPU usages and it correlates to the http_total_request which i am thinking come from the prometheus remoteWrite as that is the only component using http others use grpc. We have resource setting for CPU but i see its mostly at 2 cores and then has spikes of usages that go to 21 cores.

image
namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate{ pod_name=~".thanos-receive.", container_name!="POD"}

image
rate(http_requests_total{pod=".thanos-receive.", service=".thanos-receive-default."}[1m])

@BhartiAgrawalNvidia
Copy link
Author

i do see the receive-controller have cpu-throttling happening. could this cause these issues.

@BhartiAgrawalNvidia
Copy link
Author

hi can you tell me how to investigate this best? should i create a unit test for this?

@BhartiAgrawalNvidia
Copy link
Author

@bwplotka would be great to get some input here as we are considering pulling this out. I am not able to recreate this in a non-prod env.

@BhartiAgrawalNvidia
Copy link
Author

This was caused by head.gc() getting stuck on pendingReaders WaitGroup. This was happening when there are long running queries that timeout. This is fixed in 0.15 version with the fix #2866 Receive, Querier: Fixed leaks on receive and querier Store API Series, which were leaking on errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants