Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error: slice bounds out of range [:-6] after upgrade v1.0 #752

Closed
guyguy333 opened this issue Jun 10, 2021 · 12 comments · Fixed by #763
Closed

panic: runtime error: slice bounds out of range [:-6] after upgrade v1.0 #752

guyguy333 opened this issue Jun 10, 2021 · 12 comments · Fixed by #763
Assignees

Comments

@guyguy333
Copy link

guyguy333 commented Jun 10, 2021

Describe the bug

I upgraded tempo to v1.0 (thanks for the release, amazing product 👍). First boot after upgrade was fine. Now I'm having this error en restart and I can no longer start tempo:

[...]
level=info ts=2021-06-10T08:05:43.391436169Z caller=main.go:94 msg="Starting Tempo" version="(version=1.0.0, branch=HEAD, revision=b92405378)"
level=info ts=2021-06-10T08:05:43.392640963Z caller=server.go:239 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-06-10T08:05:44.310411806Z caller=frontend.go:29 msg="creating tripperware in query frontend to shard queries"
ts=2021-06-10T08:05:44Z level=info msg="OTel Shim Logger Initialized" component=tempo
level=warn ts=2021-06-10T08:05:44.312534002Z caller=modules.go:145 msg="Worker address is empty in single binary mode. Attempting automatic worker configuration. If queries are unresponsive consider configuring the worker explicitly." address=127.0.0.1:9095
level=info ts=2021-06-10T08:05:44.312759423Z caller=worker.go:113 msg="Starting querier worker connected to query-frontend" frontend=127.0.0.1:9095
ts=2021-06-10T08:05:44Z level=info msg="No sampling strategies provided or URL is unavailable, using defaults" component=tempo
ts=2021-06-10T08:05:44Z level=info msg="Starting GRPC server on endpoint 0.0.0.0:55680" component=tempo
level=info ts=2021-06-10T08:05:44.313354175Z caller=module_service.go:59 msg=initialising module=store
level=info ts=2021-06-10T08:05:44.313415725Z caller=module_service.go:59 msg=initialising module=server
level=info ts=2021-06-10T08:05:44.313576065Z caller=module_service.go:59 msg=initialising module=memberlist-kv
level=info ts=2021-06-10T08:05:44.313622115Z caller=module_service.go:59 msg=initialising module=query-frontend
level=info ts=2021-06-10T08:05:44.313620395Z caller=module_service.go:59 msg=initialising module=overrides
level=info ts=2021-06-10T08:05:44.313696956Z caller=module_service.go:59 msg=initialising module=compactor
level=info ts=2021-06-10T08:05:44.313703076Z caller=module_service.go:59 msg=initialising module=ring
level=info ts=2021-06-10T08:05:44.313719756Z caller=compactor.go:98 msg="waiting for compaction ring to settle" waitDuration=1m0s
level=info ts=2021-06-10T08:05:44.313804716Z caller=ring.go:271 msg="ring doesn't exist in KV store yet"
level=info ts=2021-06-10T08:05:44.313837016Z caller=module_service.go:59 msg=initialising module=ingester
level=info ts=2021-06-10T08:05:44.313840656Z caller=client.go:247 msg="value is nil" key=collectors/ring index=1
level=info ts=2021-06-10T08:05:44.313852036Z caller=ingester.go:320 msg="beginning wal replay"
level=info ts=2021-06-10T08:05:44.313899407Z caller=module_service.go:59 msg=initialising module=querier
level=info ts=2021-06-10T08:05:44.313950537Z caller=module_service.go:59 msg=initialising module=distributor
level=info ts=2021-06-10T08:05:44.313959637Z caller=wal.go:93 msg="beginning replay" file=521ddd7b-d4d9-4e60-8272-85842bacefcc:single-tenant:v2:none:v1 size=3690496
level=info ts=2021-06-10T08:05:44.314419638Z caller=worker.go:193 msg="adding connection" addr=127.0.0.1:9095
panic: runtime error: slice bounds out of range [:-6]

goroutine 423 [running]:
github.com/grafana/tempo/tempodb/encoding/v2.unmarshalPageFromReader(0x29277a0, 0xc000121548, 0x29556f0, 0x39af8a0, 0xc040dfe280, 0x25a, 0x25f, 0x10, 0xc0417ff150, 0xc00007a070)
/drone/src/tempodb/encoding/v2/page.go:84 +0x428
github.com/grafana/tempo/tempodb/encoding/v2.(*dataReader).NextPage(0xc040e70d40, 0x0, 0x0, 0x0, 0x0, 0xc0417ff150, 0x10, 0x10, 0x0, 0x10)
/drone/src/tempodb/encoding/v2/data_reader.go:71 +0x9d
github.com/grafana/tempo/tempodb/encoding/v1.(*dataReader).NextPage(0xc040e7fb00, 0x0, 0x0, 0x0, 0x10, 0xc0417ff150, 0x0, 0x10, 0x0, 0x2920ea0)
/drone/src/tempodb/encoding/v1/data_reader.go:85 +0x5e
github.com/grafana/tempo/tempodb/wal.newAppendBlockFromFile(0xc040e123cf, 0x3d, 0xc000143010, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0)
/drone/src/tempodb/wal/append_block.go:107 +0x3bc
github.com/grafana/tempo/tempodb/wal.(*WAL).RescanBlocks(0xc00073e9a0, 0x2922800, 0xc000363ef0, 0x2, 0x0, 0x0, 0xc040f0a720, 0xc040e7f500)
/drone/src/tempodb/wal/wal.go:94 +0x425
github.com/grafana/tempo/modules/ingester.(*Ingester).replayWal(0xc0008a5600, 0x7fd89d59a2e8, 0x10)
/drone/src/modules/ingester/ingester.go:322 +0x177
github.com/grafana/tempo/modules/ingester.(*Ingester).starting(0xc0008a5600, 0x29669c0, 0xc040e70d00, 0xc040f0c001, 0xc040e6f9c0)
/drone/src/modules/ingester/ingester.go:98 +0x2f
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc000940f00)
/drone/src/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:157 +0x46a
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
/drone/src/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo

Expected behavior

No crash

Environment:

  • Infrastructure: Kubernetes with latest helm chart
  • Deployment tool: helm

Additional Context

I increased grpc_server_max_recv_msg_size: 41943040 on server config, not sure it's related.

@joe-elliott
Copy link
Member

This sounds like a corrupt wal. Please remove 521ddd7b-d4d9-4e60-8272-85842bacefcc:single-tenant:v2:none:v1 from the wal folder on this ingester and it should be fine.

It's possible it's just a one off corruption or due to a bug in Tempo. We are not seeing this, but I will leave this open in case it persists (or others have similar problems).

@guyguy333
Copy link
Author

Thanks @joe-elliott, it fixed the issue

@guyguy333
Copy link
Author

I got the same issue this morning. We stop tempo pod during night and starts it early the morning. Maybe there is a clean issue during stop ?

@joe-elliott
Copy link
Member

How are you stopping the pod? If sending a SIGTERM and letting it close on its own it should not do this. If SIGKILL or maybe SIGTERM and then SIGKILL after a timeout then it makes sense you'd be seeing something like this.

@guyguy333
Copy link
Author

guyguy333 commented Jun 11, 2021

Pod is stopped as we reduce pool size during night, so node running tempo is drained (SchedulingDisabled then removed). I guess in that case SIGTERM closing is called by K8S because of node drain. Maybe termination is too long and K8S then call SIGKILL.

@joe-elliott
Copy link
Member

I don't know for certain, but I believe drains follow all of the standard rules for shutting down a pod (including terminationGrace period). We regularly see ingesters get terminated and have not seen this issue.

Things you can try:

  • Up terminationGracePeriodSeconds
  • Check kubelet/master plane logs to confirm/disconfirm it is getting SIGKILLED
  • Set the wal compression type. This defaults to "none", but we use "snappy" at grafana. Maybe we don't see this bug due to compression?
storage:
  trace:
    wal: snappy

@ghost
Copy link

ghost commented Jun 14, 2021

I'm having the same issue running Tempo on Docker for Windows after restarting the PC.

@joe-elliott
Copy link
Member

@alexander-klang @guyguy333

Is the number always [:-6]? Reviewing the code I've noticed that 6 bytes is the header size and if a page were to be written with a totalLength of 0 somehow it would attempt to use this -6 index into the slice.

@joe-elliott joe-elliott self-assigned this Jun 14, 2021
@joe-elliott
Copy link
Member

Also, if anyone could share the shutdown logs of the ingester before this issue is seen that would be helpful.

@joe-elliott
Copy link
Member

joe-elliott commented Jun 14, 2021

We have confirmation from @Whyeasy that kubectl delete po <ingester-pod> is not showing this issue for him and that this issue occurs on both snappy and uncompressed wals. This makes me feel more confident this is a wal corruption issue due to non-graceful shutdown, but the shutdown logs of an ingester could help confirm that.

I will submit a PR with guard code that prevents the panic and replays the wal up to the point of corruption and we can cut a release with that.

@guyguy333
Copy link
Author

@joe-elliott I confirm I got -6 each time I got the issue (I didn't reproduce anymore).

@joe-elliott
Copy link
Member

This has been released as 1.0.1:

https://github.com/grafana/tempo/releases/tag/v1.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants