panic: runtime error: slice bounds out of range [:-6] after upgrade v1.0 #752

guyguy333 · 2021-06-10T08:10:02Z

Describe the bug

I upgraded tempo to v1.0 (thanks for the release, amazing product 👍). First boot after upgrade was fine. Now I'm having this error en restart and I can no longer start tempo:

[...]
level=info ts=2021-06-10T08:05:43.391436169Z caller=main.go:94 msg="Starting Tempo" version="(version=1.0.0, branch=HEAD, revision=b92405378)"
level=info ts=2021-06-10T08:05:43.392640963Z caller=server.go:239 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-06-10T08:05:44.310411806Z caller=frontend.go:29 msg="creating tripperware in query frontend to shard queries"
ts=2021-06-10T08:05:44Z level=info msg="OTel Shim Logger Initialized" component=tempo
level=warn ts=2021-06-10T08:05:44.312534002Z caller=modules.go:145 msg="Worker address is empty in single binary mode. Attempting automatic worker configuration. If queries are unresponsive consider configuring the worker explicitly." address=127.0.0.1:9095
level=info ts=2021-06-10T08:05:44.312759423Z caller=worker.go:113 msg="Starting querier worker connected to query-frontend" frontend=127.0.0.1:9095
ts=2021-06-10T08:05:44Z level=info msg="No sampling strategies provided or URL is unavailable, using defaults" component=tempo
ts=2021-06-10T08:05:44Z level=info msg="Starting GRPC server on endpoint 0.0.0.0:55680" component=tempo
level=info ts=2021-06-10T08:05:44.313354175Z caller=module_service.go:59 msg=initialising module=store
level=info ts=2021-06-10T08:05:44.313415725Z caller=module_service.go:59 msg=initialising module=server
level=info ts=2021-06-10T08:05:44.313576065Z caller=module_service.go:59 msg=initialising module=memberlist-kv
level=info ts=2021-06-10T08:05:44.313622115Z caller=module_service.go:59 msg=initialising module=query-frontend
level=info ts=2021-06-10T08:05:44.313620395Z caller=module_service.go:59 msg=initialising module=overrides
level=info ts=2021-06-10T08:05:44.313696956Z caller=module_service.go:59 msg=initialising module=compactor
level=info ts=2021-06-10T08:05:44.313703076Z caller=module_service.go:59 msg=initialising module=ring
level=info ts=2021-06-10T08:05:44.313719756Z caller=compactor.go:98 msg="waiting for compaction ring to settle" waitDuration=1m0s
level=info ts=2021-06-10T08:05:44.313804716Z caller=ring.go:271 msg="ring doesn't exist in KV store yet"
level=info ts=2021-06-10T08:05:44.313837016Z caller=module_service.go:59 msg=initialising module=ingester
level=info ts=2021-06-10T08:05:44.313840656Z caller=client.go:247 msg="value is nil" key=collectors/ring index=1
level=info ts=2021-06-10T08:05:44.313852036Z caller=ingester.go:320 msg="beginning wal replay"
level=info ts=2021-06-10T08:05:44.313899407Z caller=module_service.go:59 msg=initialising module=querier
level=info ts=2021-06-10T08:05:44.313950537Z caller=module_service.go:59 msg=initialising module=distributor
level=info ts=2021-06-10T08:05:44.313959637Z caller=wal.go:93 msg="beginning replay" file=521ddd7b-d4d9-4e60-8272-85842bacefcc:single-tenant:v2:none:v1 size=3690496
level=info ts=2021-06-10T08:05:44.314419638Z caller=worker.go:193 msg="adding connection" addr=127.0.0.1:9095
panic: runtime error: slice bounds out of range [:-6]

goroutine 423 [running]:
github.com/grafana/tempo/tempodb/encoding/v2.unmarshalPageFromReader(0x29277a0, 0xc000121548, 0x29556f0, 0x39af8a0, 0xc040dfe280, 0x25a, 0x25f, 0x10, 0xc0417ff150, 0xc00007a070)
/drone/src/tempodb/encoding/v2/page.go:84 +0x428
github.com/grafana/tempo/tempodb/encoding/v2.(*dataReader).NextPage(0xc040e70d40, 0x0, 0x0, 0x0, 0x0, 0xc0417ff150, 0x10, 0x10, 0x0, 0x10)
/drone/src/tempodb/encoding/v2/data_reader.go:71 +0x9d
github.com/grafana/tempo/tempodb/encoding/v1.(*dataReader).NextPage(0xc040e7fb00, 0x0, 0x0, 0x0, 0x10, 0xc0417ff150, 0x0, 0x10, 0x0, 0x2920ea0)
/drone/src/tempodb/encoding/v1/data_reader.go:85 +0x5e
github.com/grafana/tempo/tempodb/wal.newAppendBlockFromFile(0xc040e123cf, 0x3d, 0xc000143010, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0)
/drone/src/tempodb/wal/append_block.go:107 +0x3bc
github.com/grafana/tempo/tempodb/wal.(*WAL).RescanBlocks(0xc00073e9a0, 0x2922800, 0xc000363ef0, 0x2, 0x0, 0x0, 0xc040f0a720, 0xc040e7f500)
/drone/src/tempodb/wal/wal.go:94 +0x425
github.com/grafana/tempo/modules/ingester.(*Ingester).replayWal(0xc0008a5600, 0x7fd89d59a2e8, 0x10)
/drone/src/modules/ingester/ingester.go:322 +0x177
github.com/grafana/tempo/modules/ingester.(*Ingester).starting(0xc0008a5600, 0x29669c0, 0xc040e70d00, 0xc040f0c001, 0xc040e6f9c0)
/drone/src/modules/ingester/ingester.go:98 +0x2f
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc000940f00)
/drone/src/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:157 +0x46a
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
/drone/src/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

To Reproduce
Steps to reproduce the behavior:

Start Tempo

Expected behavior

No crash

Environment:

Infrastructure: Kubernetes with latest helm chart
Deployment tool: helm

Additional Context

I increased grpc_server_max_recv_msg_size: 41943040 on server config, not sure it's related.

The text was updated successfully, but these errors were encountered:

joe-elliott · 2021-06-10T13:25:30Z

This sounds like a corrupt wal. Please remove 521ddd7b-d4d9-4e60-8272-85842bacefcc:single-tenant:v2:none:v1 from the wal folder on this ingester and it should be fine.

It's possible it's just a one off corruption or due to a bug in Tempo. We are not seeing this, but I will leave this open in case it persists (or others have similar problems).

guyguy333 · 2021-06-10T17:45:12Z

Thanks @joe-elliott, it fixed the issue

guyguy333 · 2021-06-11T10:26:16Z

I got the same issue this morning. We stop tempo pod during night and starts it early the morning. Maybe there is a clean issue during stop ?

joe-elliott · 2021-06-11T12:34:35Z

How are you stopping the pod? If sending a SIGTERM and letting it close on its own it should not do this. If SIGKILL or maybe SIGTERM and then SIGKILL after a timeout then it makes sense you'd be seeing something like this.

guyguy333 · 2021-06-11T13:03:57Z

Pod is stopped as we reduce pool size during night, so node running tempo is drained (SchedulingDisabled then removed). I guess in that case SIGTERM closing is called by K8S because of node drain. Maybe termination is too long and K8S then call SIGKILL.

joe-elliott · 2021-06-11T15:11:12Z

I don't know for certain, but I believe drains follow all of the standard rules for shutting down a pod (including terminationGrace period). We regularly see ingesters get terminated and have not seen this issue.

Things you can try:

Up terminationGracePeriodSeconds
Check kubelet/master plane logs to confirm/disconfirm it is getting SIGKILLED
Set the wal compression type. This defaults to "none", but we use "snappy" at grafana. Maybe we don't see this bug due to compression?

storage:
  trace:
    wal: snappy

ghost · 2021-06-14T11:59:56Z

I'm having the same issue running Tempo on Docker for Windows after restarting the PC.

joe-elliott · 2021-06-14T12:58:13Z

@alexander-klang @guyguy333

Is the number always [:-6]? Reviewing the code I've noticed that 6 bytes is the header size and if a page were to be written with a totalLength of 0 somehow it would attempt to use this -6 index into the slice.

joe-elliott · 2021-06-14T14:03:10Z

Also, if anyone could share the shutdown logs of the ingester before this issue is seen that would be helpful.

joe-elliott · 2021-06-14T14:24:12Z

We have confirmation from @Whyeasy that kubectl delete po <ingester-pod> is not showing this issue for him and that this issue occurs on both snappy and uncompressed wals. This makes me feel more confident this is a wal corruption issue due to non-graceful shutdown, but the shutdown logs of an ingester could help confirm that.

I will submit a PR with guard code that prevents the panic and replays the wal up to the point of corruption and we can cut a release with that.

guyguy333 · 2021-06-14T14:38:08Z

@joe-elliott I confirm I got -6 each time I got the issue (I didn't reproduce anymore).

joe-elliott · 2021-06-14T18:31:05Z

This has been released as 1.0.1:

https://github.com/grafana/tempo/releases/tag/v1.0.1

joe-elliott self-assigned this Jun 14, 2021

joe-elliott mentioned this issue Jun 14, 2021

Guard against negative dataLength while reading a page #763

Merged

3 tasks

joe-elliott closed this as completed in #763 Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: runtime error: slice bounds out of range [:-6] after upgrade v1.0 #752

panic: runtime error: slice bounds out of range [:-6] after upgrade v1.0 #752

guyguy333 commented Jun 10, 2021 •

edited

Loading

joe-elliott commented Jun 10, 2021

guyguy333 commented Jun 10, 2021

guyguy333 commented Jun 11, 2021

joe-elliott commented Jun 11, 2021

guyguy333 commented Jun 11, 2021 •

edited

Loading

joe-elliott commented Jun 11, 2021

ghost commented Jun 14, 2021

joe-elliott commented Jun 14, 2021

joe-elliott commented Jun 14, 2021

joe-elliott commented Jun 14, 2021 •

edited

Loading

guyguy333 commented Jun 14, 2021

joe-elliott commented Jun 14, 2021

panic: runtime error: slice bounds out of range [:-6] after upgrade v1.0 #752

panic: runtime error: slice bounds out of range [:-6] after upgrade v1.0 #752

Comments

guyguy333 commented Jun 10, 2021 • edited Loading

joe-elliott commented Jun 10, 2021

guyguy333 commented Jun 10, 2021

guyguy333 commented Jun 11, 2021

joe-elliott commented Jun 11, 2021

guyguy333 commented Jun 11, 2021 • edited Loading

joe-elliott commented Jun 11, 2021

ghost commented Jun 14, 2021

joe-elliott commented Jun 14, 2021

joe-elliott commented Jun 14, 2021

joe-elliott commented Jun 14, 2021 • edited Loading

guyguy333 commented Jun 14, 2021

joe-elliott commented Jun 14, 2021

guyguy333 commented Jun 10, 2021 •

edited

Loading

guyguy333 commented Jun 11, 2021 •

edited

Loading

joe-elliott commented Jun 14, 2021 •

edited

Loading