Panic on block read in querier #2987

gebn · 2023-10-05T10:45:43Z

Describe the bug
This looks like another crash on corrupt data in segmentio/parquet-go, this time when reading.

grafana/tempo:2.2.2 official Docker image

Expected behavior
No crash, and an error to be returned.

Environment:

Infrastructure: Kubernetes

Additional Context

level=error ts=2023-10-05T10:33:17.447036816Z caller=frontend_processor.go:69 msg="error processing requests" address=11.2.26.104:9095 err="rpc error: code = Canceled desc = 5 errors: context canceled; context canceled; context cancel
ed; context canceled; context canceled"
level=info ts=2023-10-05T10:33:17.449977566Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=9d22b20e-cf15-4d3f-b6f5-ee6ab9f6ea63 found=false
level=info ts=2023-10-05T10:33:17.450423773Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=1e62336e-9419-4894-85ec-2671ca9236e0 found=false
level=info ts=2023-10-05T10:33:17.453207662Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=70c1e0dc-412f-4145-a52c-5670fe6f5c21 found=false
level=info ts=2023-10-05T10:33:17.453711417Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6efa857f-1ab7-45bf-ac4a-4f468e19bf3a found=false
level=info ts=2023-10-05T10:33:17.453966892Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6e3ff854-1408-46e8-9b2b-9e32150eead1 found=false
level=info ts=2023-10-05T10:33:17.454321399Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6a1fec72-36f2-4b3c-bdac-4ce1b205762e found=false
level=info ts=2023-10-05T10:33:17.454608092Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6eba7ea1-452f-4357-81dc-02478d60b697 found=false
level=info ts=2023-10-05T10:33:17.45491022Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=722bf777-4d57-408b-b03b-473aff1e16de found=false
panic: runtime error: makeslice: len out of range

goroutine 12665 [running]:
github.com/segmentio/parquet-go.(*bufferPool).newBuffer(0x3cec060, 0xffffffffb7e70dc5, 0x1000)
        /drone/src/vendor/github.com/segmentio/parquet-go/buffer.go:340 +0x33
github.com/segmentio/parquet-go.(*bufferPool).get(0x3cec060?, 0xffffffffb7e70dc5?)
        /drone/src/vendor/github.com/segmentio/parquet-go/buffer.go:362 +0xdf
github.com/segmentio/parquet-go.(*Column).decompress(0xc0001aab60, {0xc00a800000, 0x224ed43a, 0x22a359ed}, 0x3b0bba8?)
        /drone/src/vendor/github.com/segmentio/parquet-go/column.go:515 +0x3e
github.com/segmentio/parquet-go.(*Column).decodeDictionary(0xc0001aab60, {0x30?}, 0x2371300?, 0x3b0bcd8?)
        /drone/src/vendor/github.com/segmentio/parquet-go/column.go:744 +0xfb
github.com/segmentio/parquet-go.(*filePages).readDictionaryPage(0xc0011683c0, 0xc002316690?, 0xc002316690?)
        /drone/src/vendor/github.com/segmentio/parquet-go/file.go:615 +0x3f
github.com/segmentio/parquet-go.(*filePages).ReadPage(0xc0011683c0)
        /drone/src/vendor/github.com/segmentio/parquet-go/file.go:550 +0x205
github.com/segmentio/parquet-go.(*multiPages).ReadPage(0xc00153f5a0)
        /drone/src/vendor/github.com/segmentio/parquet-go/multi_row_group.go:230 +0x3f
github.com/segmentio/parquet-go.readPages({0x2a330c0, 0xc00153f5a0}, 0xc000ba2a80, 0xc002075030, 0xc002118e40)
        /drone/src/vendor/github.com/segmentio/parquet-go/page.go:208 +0xc9
created by github.com/segmentio/parquet-go.(*asyncPages).init
        /drone/src/vendor/github.com/segmentio/parquet-go/page.go:150 +0x16a

The text was updated successfully, but these errors were encountered:

joe-elliott · 2023-10-06T19:02:59Z

panic: runtime error: makeslice: len out of range

is an interesting error. the value passed would either have to be negative or very large. given the panic in the previous issue involving a nearly 4GB slice i'm wondering if you have a write pattern that is testing the size of allowed dictionaries in parquet go.

i'd be interested in the results of this command:

go run ./cmd/tempo-cli analyse blocks <tenant id> --backend="???" --bucket="???"

we use this command to help determine which fields to make dedicated columns in vparquet3. it would reveal if you have an attribute or attributes that are blowing up the size of your dictionaries.

gebn · 2023-10-06T20:06:26Z

Is there a binary release of tempo-cli anywhere? I'd like to help, but unfortunately I can't compile this due to AGPL.

joe-elliott · 2023-10-10T11:28:48Z

The .tar.gz's on our release page contain a compiled copy of tempo-cli:

https://github.com/grafana/tempo/releases

However, this command was added recently and won't be in 2.2.3. We are talking about cutting a 2.3 which you could use to run the analysis.

joe-elliott · 2023-10-16T13:48:20Z

We have seen this issue internally now and are digging deeper.

joe-elliott · 2023-10-18T17:29:11Z

Ok, this manifests itself differently on our side, but I believe this issue as well as #2731 are related to: parquet-go/parquet-go#79

These occur when Tempo severely underestimates the size of your traces and writes very large row groups that overflow sizes in the parquet-go library. This most likely way this can happen is if Tempo is consuming traces with lots of longer string data in attributes.

To alleviate this you should try reducing your target row group size:

storage:
    trace:
        block:
            parquet_row_group_size_bytes: 100000000  // 100MB default

gebn · 2023-10-19T17:20:55Z

Thanks @joe-elliott . I've deployed this with a target group size of 10MB, and upgraded to 2.2.3 at the same time. I'll keep an eye on our ingesters.

joe-elliott · 2023-10-19T19:51:09Z

Out of curiosity do you have some very large or high cardinality attribute values? This is what we've observed in our cloud environment creating this issue.

I'd also check the meta.json of your blocks. You can divide the size field by the totalRecords field to determine the average row group size. If the issue is similar to what we're seeing you'll notice that the actual average row group size is significantly larger than the configured target size.

gebn · 2023-10-19T23:38:04Z

Definitely, some applications are fond of adding a stack trace attribute to error spans. Unfortunately having everything in the trace UI is rather convenient, so clients have a subconscious bias against using logs, even when they would be more suitable. We definitely want to drop attributes over a certain length, however limiting cardinality is much harder. Different teams may use a given non-semconv attribute in different ways.

I took a look at a few meta.jsons in our bucket. The result of that division ranged from 155MB to 350MB. Here's the 350MB block:

{
  "format": "vParquet2",
  "blockID": "4fe5d3b5-6def-44dc-a3c8-9b69e6dcc4a9",
  "minID": "SVShJPNZyECpSsvIJ24fMg==",
  "maxID": "///oyKrsAYnLGzK1kTGXqw==",
  "tenantID": "single-tenant",
  "startTime": "2023-10-17T20:16:01Z",
  "endTime": "2023-10-17T20:41:18Z",
  "totalObjects": 949402,
  "size": 1050823950,
  "compactionLevel": 2,
  "encoding": "none",
  "indexPageSize": 0,
  "totalRecords": 3,
  "dataEncoding": "",
  "bloomShards": 12,
  "footerSize": 48425
}

joe-elliott · 2023-10-20T00:08:40Z

however limiting cardinality is much harder.

Yeah. I wouldn't ask your teams to make changes. It's just that high entropy data compresses poorly and would be more likely to trip this issue.

Here's the 350MB block:

Yup. This is basically what we were seeing as well. If you are on defaults then you are configured for a 100MB row group size. This is obviously quite a bit larger than that. I think you're definitely experiencing the same thing.

While we get a proper fix in, the following will help alleviate it:

Reduce the configured row group size until your actual row groups average 50 to 100 MB
2.3 (coming soon) should have the hacky fix linked as well as hopefully the parquet fix which will prevent Tempo from writing these corrupt blocks
Once 2.3 comes out, I'd also turn on vParquet3 and pull these large columns out of your main attribute column. It will speed up search and reduce the chances of this issue.

gebn · 2023-10-20T10:19:36Z

Thanks, that all sounds positive. Should the parquet_row_group_size_bytes override be reverted when #3038 is merged, or do they work in tandem? I'll empty our bucket when 2.3 arrives - we're still in the PoC phase.

pull these large columns out of your main attribute column

Will this take the form of a set of attributes to not index/store differently? My worry is what happens when a new app decides to put long strings in a different attribute, and we crash or have latency issues until we hunt it down. Thinking out loud, a configured set of indexed attributes would be fine - we could publish that, along with length limits.

joe-elliott · 2023-10-20T12:08:42Z

Will this take the form of a set of attributes to not index/store differently? My worry is what happens when a new app decides to put long strings in a different attribute, and we crash or have latency issues until we hunt it down.

Yes. There is also a CLI command you can use to scan blocks and help you determine what your largest columns are. We are looking at ways to automate this in the future, but that won't be in 2.3.

tempo-cli analyse block ...
tempo-cli analyse blocks ...

cc @stoewer

gebn · 2023-10-23T08:36:03Z

@joe-elliott That looks great - just waiting for the precompiled binaries to be available in 2.3!

joe-elliott · 2023-10-23T11:36:26Z

2.3-rc.0 is here: https://github.com/grafana/tempo/releases/tag/v2.3.0-rc.0

if you're interested in checking it out early

gebn · 2023-11-13T15:28:06Z

No compactor crashes for the past 2h with v2.3, vParquet3, parquet_row_group_size_bytes: 10_000_000, and a new bucket! I'll keep an eye on it.

I'll also try to acquire the v2.3 tar.gz internally to run tempo-cli. This may take some time.

github-actions · 2024-01-13T00:04:00Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

amirkkn · 2024-01-17T14:06:07Z

We receive the similar error for half of our queries which we are running in Grafana and gives us portion of traces. We use S3 bucket as backend and currently for POC, we run single querier pod.


tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-17T13:53:44.046250653Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = context canceled"

We upgraded to the latest version, but it didn't help us. We have currently no clues about the reason. Could you please guide me about this error? Thanks in advance!

joe-elliott · 2024-01-17T15:42:51Z

@amirkkn Some questions:

Are you using 2.3.1?
Can you share the querier logs?
Have you tried reducing row group size?
Have you run the analyze blocks command to determine if you have large attributes?

gebn · 2024-01-17T20:56:10Z

Just confirming we haven't observed this since v2.3 and vParquet3, though it may be because we haven't run into #2731 since, so there are no corrupt blocks to trip up the querier.

amirkkn · 2024-01-18T07:58:38Z

@amirkkn Some questions:

Are you using 2.3.1?

Can you share the querier logs?

Have you tried reducing row group size?

Have you run the analyze blocks command to determine if you have large attributes?

Yes, the running version is 2.3.1. I noticed the issue doesn't occure when we exclude the healthcheck tracing which is running each 10s. We have springboot applications running. I haven't analysed with cli. Actually we send the same resource attributes from OTEL agent for all. I don't see any errors in compactor or other components.

Here are some logs:

tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:35:14.680278145Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=green createdAt=2024-01-18T07:31:16.470071432Z metas=8 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:35:14.710239047Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=blue createdAt=2024-01-18T07:31:16.678207285Z metas=1 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:35:14.710278958Z caller=poller.go:131 msg="blocklist poll complete" seconds=0.167189891
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:40:14.697597907Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=green createdAt=2024-01-18T07:36:16.399216177Z metas=8 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:40:14.745481608Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=blue createdAt=2024-01-18T07:36:16.554437477Z metas=1 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:40:14.745523309Z caller=poller.go:131 msg="blocklist poll complete" seconds=0.202553383
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:15.316553765Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = 3 errors: context canceled; context canceled; context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:15.316374961Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = 3 errors: context canceled; context canceled; context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:20.983094922Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:20.983975799Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:20.983830746Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = 5 errors: context canceled; context canceled; context canceled; context canceled; context canceled"

joe-elliott · 2024-01-18T12:54:41Z

Perhaps understanding the content of the healthcheck would help. Can you post the logs of the panic?

github-actions · 2024-03-19T00:03:22Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

stoewer assigned ie-pham and stoewer and unassigned stoewer and ie-pham Oct 17, 2023

joe-elliott mentioned this issue Oct 19, 2023

Improves the size estimation when cutting new rows #3038

Merged

3 tasks

stoewer mentioned this issue Oct 25, 2023

Update parquet-go dependency #3068

Merged

3 tasks

joe-elliott mentioned this issue Oct 26, 2023

[release-v2.3] Update parquet-go dependency #3076

Merged

3 tasks

JerryWithLongName mentioned this issue Nov 24, 2023

Tempo 2.3.0, Ingester: panic: runtime error: slice bounds out of range [4294966737:509:] #3177

Closed

github-actions bot added the stale Used for stale issues / PRs label Jan 13, 2024

github-actions bot removed the stale Used for stale issues / PRs label Jan 18, 2024

github-actions bot added the stale Used for stale issues / PRs label Mar 19, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic on block read in querier #2987

Panic on block read in querier #2987

gebn commented Oct 5, 2023

joe-elliott commented Oct 6, 2023 •

edited

Loading

gebn commented Oct 6, 2023

joe-elliott commented Oct 10, 2023

joe-elliott commented Oct 16, 2023

joe-elliott commented Oct 18, 2023

gebn commented Oct 19, 2023 •

edited

Loading

joe-elliott commented Oct 19, 2023

gebn commented Oct 19, 2023

joe-elliott commented Oct 20, 2023

gebn commented Oct 20, 2023 •

edited

Loading

joe-elliott commented Oct 20, 2023 •

edited

Loading

gebn commented Oct 23, 2023

joe-elliott commented Oct 23, 2023

gebn commented Nov 13, 2023

github-actions bot commented Jan 13, 2024

amirkkn commented Jan 17, 2024

joe-elliott commented Jan 17, 2024

gebn commented Jan 17, 2024

amirkkn commented Jan 18, 2024 •

edited

Loading

joe-elliott commented Jan 18, 2024

github-actions bot commented Mar 19, 2024

Panic on block read in querier #2987

Panic on block read in querier #2987

Comments

gebn commented Oct 5, 2023

joe-elliott commented Oct 6, 2023 • edited Loading

gebn commented Oct 6, 2023

joe-elliott commented Oct 10, 2023

joe-elliott commented Oct 16, 2023

joe-elliott commented Oct 18, 2023

gebn commented Oct 19, 2023 • edited Loading

joe-elliott commented Oct 19, 2023

gebn commented Oct 19, 2023

joe-elliott commented Oct 20, 2023

gebn commented Oct 20, 2023 • edited Loading

joe-elliott commented Oct 20, 2023 • edited Loading

gebn commented Oct 23, 2023

joe-elliott commented Oct 23, 2023

gebn commented Nov 13, 2023

github-actions bot commented Jan 13, 2024

amirkkn commented Jan 17, 2024

joe-elliott commented Jan 17, 2024

gebn commented Jan 17, 2024

amirkkn commented Jan 18, 2024 • edited Loading

joe-elliott commented Jan 18, 2024

github-actions bot commented Mar 19, 2024

joe-elliott commented Oct 6, 2023 •

edited

Loading

gebn commented Oct 19, 2023 •

edited

Loading

gebn commented Oct 20, 2023 •

edited

Loading

joe-elliott commented Oct 20, 2023 •

edited

Loading

amirkkn commented Jan 18, 2024 •

edited

Loading