Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on block read in querier #2987

Closed
gebn opened this issue Oct 5, 2023 · 21 comments
Closed

Panic on block read in querier #2987

gebn opened this issue Oct 5, 2023 · 21 comments
Labels
stale Used for stale issues / PRs

Comments

@gebn
Copy link
Contributor

gebn commented Oct 5, 2023

Describe the bug
This looks like another crash on corrupt data in segmentio/parquet-go, this time when reading.

grafana/tempo:2.2.2 official Docker image

Expected behavior
No crash, and an error to be returned.

Environment:

  • Infrastructure: Kubernetes

Additional Context

level=error ts=2023-10-05T10:33:17.447036816Z caller=frontend_processor.go:69 msg="error processing requests" address=11.2.26.104:9095 err="rpc error: code = Canceled desc = 5 errors: context canceled; context canceled; context cancel
ed; context canceled; context canceled"
level=info ts=2023-10-05T10:33:17.449977566Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=9d22b20e-cf15-4d3f-b6f5-ee6ab9f6ea63 found=false
level=info ts=2023-10-05T10:33:17.450423773Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=1e62336e-9419-4894-85ec-2671ca9236e0 found=false
level=info ts=2023-10-05T10:33:17.453207662Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=70c1e0dc-412f-4145-a52c-5670fe6f5c21 found=false
level=info ts=2023-10-05T10:33:17.453711417Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6efa857f-1ab7-45bf-ac4a-4f468e19bf3a found=false
level=info ts=2023-10-05T10:33:17.453966892Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6e3ff854-1408-46e8-9b2b-9e32150eead1 found=false
level=info ts=2023-10-05T10:33:17.454321399Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6a1fec72-36f2-4b3c-bdac-4ce1b205762e found=false
level=info ts=2023-10-05T10:33:17.454608092Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=6eba7ea1-452f-4357-81dc-02478d60b697 found=false
level=info ts=2023-10-05T10:33:17.45491022Z caller=tempodb.go:331 org_id=single-tenant msg="searching for trace in block" findTraceID=6a1a9539a8780720e0a4efc956af861b block=722bf777-4d57-408b-b03b-473aff1e16de found=false
panic: runtime error: makeslice: len out of range

goroutine 12665 [running]:
github.com/segmentio/parquet-go.(*bufferPool).newBuffer(0x3cec060, 0xffffffffb7e70dc5, 0x1000)
        /drone/src/vendor/github.com/segmentio/parquet-go/buffer.go:340 +0x33
github.com/segmentio/parquet-go.(*bufferPool).get(0x3cec060?, 0xffffffffb7e70dc5?)
        /drone/src/vendor/github.com/segmentio/parquet-go/buffer.go:362 +0xdf
github.com/segmentio/parquet-go.(*Column).decompress(0xc0001aab60, {0xc00a800000, 0x224ed43a, 0x22a359ed}, 0x3b0bba8?)
        /drone/src/vendor/github.com/segmentio/parquet-go/column.go:515 +0x3e
github.com/segmentio/parquet-go.(*Column).decodeDictionary(0xc0001aab60, {0x30?}, 0x2371300?, 0x3b0bcd8?)
        /drone/src/vendor/github.com/segmentio/parquet-go/column.go:744 +0xfb
github.com/segmentio/parquet-go.(*filePages).readDictionaryPage(0xc0011683c0, 0xc002316690?, 0xc002316690?)
        /drone/src/vendor/github.com/segmentio/parquet-go/file.go:615 +0x3f
github.com/segmentio/parquet-go.(*filePages).ReadPage(0xc0011683c0)
        /drone/src/vendor/github.com/segmentio/parquet-go/file.go:550 +0x205
github.com/segmentio/parquet-go.(*multiPages).ReadPage(0xc00153f5a0)
        /drone/src/vendor/github.com/segmentio/parquet-go/multi_row_group.go:230 +0x3f
github.com/segmentio/parquet-go.readPages({0x2a330c0, 0xc00153f5a0}, 0xc000ba2a80, 0xc002075030, 0xc002118e40)
        /drone/src/vendor/github.com/segmentio/parquet-go/page.go:208 +0xc9
created by github.com/segmentio/parquet-go.(*asyncPages).init
        /drone/src/vendor/github.com/segmentio/parquet-go/page.go:150 +0x16a
@joe-elliott
Copy link
Member

joe-elliott commented Oct 6, 2023

panic: runtime error: makeslice: len out of range

is an interesting error. the value passed would either have to be negative or very large. given the panic in the previous issue involving a nearly 4GB slice i'm wondering if you have a write pattern that is testing the size of allowed dictionaries in parquet go.

i'd be interested in the results of this command:

go run ./cmd/tempo-cli analyse blocks <tenant id> --backend="???" --bucket="???"

we use this command to help determine which fields to make dedicated columns in vparquet3. it would reveal if you have an attribute or attributes that are blowing up the size of your dictionaries.

@gebn
Copy link
Contributor Author

gebn commented Oct 6, 2023

Is there a binary release of tempo-cli anywhere? I'd like to help, but unfortunately I can't compile this due to AGPL.

@joe-elliott
Copy link
Member

The .tar.gz's on our release page contain a compiled copy of tempo-cli:

https://github.com/grafana/tempo/releases

However, this command was added recently and won't be in 2.2.3. We are talking about cutting a 2.3 which you could use to run the analysis.

@joe-elliott
Copy link
Member

We have seen this issue internally now and are digging deeper.

@stoewer stoewer assigned ie-pham and stoewer and unassigned stoewer and ie-pham Oct 17, 2023
@joe-elliott
Copy link
Member

Ok, this manifests itself differently on our side, but I believe this issue as well as #2731 are related to: parquet-go/parquet-go#79

These occur when Tempo severely underestimates the size of your traces and writes very large row groups that overflow sizes in the parquet-go library. This most likely way this can happen is if Tempo is consuming traces with lots of longer string data in attributes.

To alleviate this you should try reducing your target row group size:

storage:
    trace:
        block:
            parquet_row_group_size_bytes: 100000000  // 100MB default

@gebn
Copy link
Contributor Author

gebn commented Oct 19, 2023

Thanks @joe-elliott . I've deployed this with a target group size of 10MB, and upgraded to 2.2.3 at the same time. I'll keep an eye on our ingesters.

@joe-elliott
Copy link
Member

Out of curiosity do you have some very large or high cardinality attribute values? This is what we've observed in our cloud environment creating this issue.

I'd also check the meta.json of your blocks. You can divide the size field by the totalRecords field to determine the average row group size. If the issue is similar to what we're seeing you'll notice that the actual average row group size is significantly larger than the configured target size.

@gebn
Copy link
Contributor Author

gebn commented Oct 19, 2023

Definitely, some applications are fond of adding a stack trace attribute to error spans. Unfortunately having everything in the trace UI is rather convenient, so clients have a subconscious bias against using logs, even when they would be more suitable. We definitely want to drop attributes over a certain length, however limiting cardinality is much harder. Different teams may use a given non-semconv attribute in different ways.

I took a look at a few meta.jsons in our bucket. The result of that division ranged from 155MB to 350MB. Here's the 350MB block:

{
  "format": "vParquet2",
  "blockID": "4fe5d3b5-6def-44dc-a3c8-9b69e6dcc4a9",
  "minID": "SVShJPNZyECpSsvIJ24fMg==",
  "maxID": "///oyKrsAYnLGzK1kTGXqw==",
  "tenantID": "single-tenant",
  "startTime": "2023-10-17T20:16:01Z",
  "endTime": "2023-10-17T20:41:18Z",
  "totalObjects": 949402,
  "size": 1050823950,
  "compactionLevel": 2,
  "encoding": "none",
  "indexPageSize": 0,
  "totalRecords": 3,
  "dataEncoding": "",
  "bloomShards": 12,
  "footerSize": 48425
}

@joe-elliott
Copy link
Member

however limiting cardinality is much harder.

Yeah. I wouldn't ask your teams to make changes. It's just that high entropy data compresses poorly and would be more likely to trip this issue.

Here's the 350MB block:

Yup. This is basically what we were seeing as well. If you are on defaults then you are configured for a 100MB row group size. This is obviously quite a bit larger than that. I think you're definitely experiencing the same thing.

While we get a proper fix in, the following will help alleviate it:

  • Reduce the configured row group size until your actual row groups average 50 to 100 MB
  • 2.3 (coming soon) should have the hacky fix linked as well as hopefully the parquet fix which will prevent Tempo from writing these corrupt blocks
  • Once 2.3 comes out, I'd also turn on vParquet3 and pull these large columns out of your main attribute column. It will speed up search and reduce the chances of this issue.

@gebn
Copy link
Contributor Author

gebn commented Oct 20, 2023

Thanks, that all sounds positive. Should the parquet_row_group_size_bytes override be reverted when #3038 is merged, or do they work in tandem? I'll empty our bucket when 2.3 arrives - we're still in the PoC phase.

pull these large columns out of your main attribute column

Will this take the form of a set of attributes to not index/store differently? My worry is what happens when a new app decides to put long strings in a different attribute, and we crash or have latency issues until we hunt it down. Thinking out loud, a configured set of indexed attributes would be fine - we could publish that, along with length limits.

@joe-elliott
Copy link
Member

joe-elliott commented Oct 20, 2023

Will this take the form of a set of attributes to not index/store differently? My worry is what happens when a new app decides to put long strings in a different attribute, and we crash or have latency issues until we hunt it down.

Yes. There is also a CLI command you can use to scan blocks and help you determine what your largest columns are. We are looking at ways to automate this in the future, but that won't be in 2.3.

tempo-cli analyse block ...
tempo-cli analyse blocks ...

cc @stoewer

@gebn
Copy link
Contributor Author

gebn commented Oct 23, 2023

@joe-elliott That looks great - just waiting for the precompiled binaries to be available in 2.3!

@joe-elliott
Copy link
Member

2.3-rc.0 is here: https://github.com/grafana/tempo/releases/tag/v2.3.0-rc.0

if you're interested in checking it out early

@gebn
Copy link
Contributor Author

gebn commented Nov 13, 2023

No compactor crashes for the past 2h with v2.3, vParquet3, parquet_row_group_size_bytes: 10_000_000, and a new bucket! I'll keep an eye on it.

I'll also try to acquire the v2.3 tar.gz internally to run tempo-cli. This may take some time.

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Jan 13, 2024
@amirkkn
Copy link

amirkkn commented Jan 17, 2024

We receive the similar error for half of our queries which we are running in Grafana and gives us portion of traces. We use S3 bucket as backend and currently for POC, we run single querier pod.


tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-17T13:53:44.046250653Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = context canceled"

We upgraded to the latest version, but it didn't help us. We have currently no clues about the reason. Could you please guide me about this error? Thanks in advance!

@joe-elliott
Copy link
Member

@amirkkn Some questions:

  • Are you using 2.3.1?
  • Can you share the querier logs?
  • Have you tried reducing row group size?
  • Have you run the analyze blocks command to determine if you have large attributes?

@gebn
Copy link
Contributor Author

gebn commented Jan 17, 2024

Just confirming we haven't observed this since v2.3 and vParquet3, though it may be because we haven't run into #2731 since, so there are no corrupt blocks to trip up the querier.

@github-actions github-actions bot removed the stale Used for stale issues / PRs label Jan 18, 2024
@amirkkn
Copy link

amirkkn commented Jan 18, 2024

@amirkkn Some questions:

  • Are you using 2.3.1?
  • Can you share the querier logs?
  • Have you tried reducing row group size?
  • Have you run the analyze blocks command to determine if you have large attributes?

Yes, the running version is 2.3.1. I noticed the issue doesn't occure when we exclude the healthcheck tracing which is running each 10s. We have springboot applications running. I haven't analysed with cli. Actually we send the same resource attributes from OTEL agent for all. I don't see any errors in compactor or other components.

Here are some logs:

tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:35:14.680278145Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=green createdAt=2024-01-18T07:31:16.470071432Z metas=8 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:35:14.710239047Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=blue createdAt=2024-01-18T07:31:16.678207285Z metas=1 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:35:14.710278958Z caller=poller.go:131 msg="blocklist poll complete" seconds=0.167189891
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:40:14.697597907Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=green createdAt=2024-01-18T07:36:16.399216177Z metas=8 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:40:14.745481608Z caller=poller.go:194 msg="successfully pulled tenant index" tenant=blue createdAt=2024-01-18T07:36:16.554437477Z metas=1 compactedMetas=0
tracing-tempo-querier-6597bb54fc-xlm8q querier level=info ts=2024-01-18T07:40:14.745523309Z caller=poller.go:131 msg="blocklist poll complete" seconds=0.202553383
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:15.316553765Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = 3 errors: context canceled; context canceled; context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:15.316374961Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = 3 errors: context canceled; context canceled; context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:20.983094922Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:20.983975799Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = context canceled"
tracing-tempo-querier-6597bb54fc-xlm8q querier level=error ts=2024-01-18T07:42:20.983830746Z caller=frontend_processor.go:69 msg="error processing requests" address=10.x.x.x:9095 err="rpc error: code = Canceled desc = 5 errors: context canceled; context canceled; context canceled; context canceled; context canceled"

@joe-elliott
Copy link
Member

Perhaps understanding the content of the healthcheck would help. Can you post the logs of the panic?

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Mar 19, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
Status: Done
Development

No branches or pull requests

5 participants