-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic on block read in querier #2987
Comments
is an interesting error. the value passed would either have to be negative or very large. given the panic in the previous issue involving a nearly 4GB slice i'm wondering if you have a write pattern that is testing the size of allowed dictionaries in parquet go. i'd be interested in the results of this command:
we use this command to help determine which fields to make dedicated columns in vparquet3. it would reveal if you have an attribute or attributes that are blowing up the size of your dictionaries. |
Is there a binary release of tempo-cli anywhere? I'd like to help, but unfortunately I can't compile this due to AGPL. |
The https://github.com/grafana/tempo/releases However, this command was added recently and won't be in 2.2.3. We are talking about cutting a 2.3 which you could use to run the analysis. |
We have seen this issue internally now and are digging deeper. |
Ok, this manifests itself differently on our side, but I believe this issue as well as #2731 are related to: parquet-go/parquet-go#79 These occur when Tempo severely underestimates the size of your traces and writes very large row groups that overflow sizes in the parquet-go library. This most likely way this can happen is if Tempo is consuming traces with lots of longer string data in attributes. To alleviate this you should try reducing your target row group size:
|
Thanks @joe-elliott . I've deployed this with a target group size of 10MB, and upgraded to 2.2.3 at the same time. I'll keep an eye on our ingesters. |
Out of curiosity do you have some very large or high cardinality attribute values? This is what we've observed in our cloud environment creating this issue. I'd also check the meta.json of your blocks. You can divide the |
Definitely, some applications are fond of adding a stack trace attribute to error spans. Unfortunately having everything in the trace UI is rather convenient, so clients have a subconscious bias against using logs, even when they would be more suitable. We definitely want to drop attributes over a certain length, however limiting cardinality is much harder. Different teams may use a given non-semconv attribute in different ways. I took a look at a few {
"format": "vParquet2",
"blockID": "4fe5d3b5-6def-44dc-a3c8-9b69e6dcc4a9",
"minID": "SVShJPNZyECpSsvIJ24fMg==",
"maxID": "///oyKrsAYnLGzK1kTGXqw==",
"tenantID": "single-tenant",
"startTime": "2023-10-17T20:16:01Z",
"endTime": "2023-10-17T20:41:18Z",
"totalObjects": 949402,
"size": 1050823950,
"compactionLevel": 2,
"encoding": "none",
"indexPageSize": 0,
"totalRecords": 3,
"dataEncoding": "",
"bloomShards": 12,
"footerSize": 48425
} |
Yeah. I wouldn't ask your teams to make changes. It's just that high entropy data compresses poorly and would be more likely to trip this issue.
Yup. This is basically what we were seeing as well. If you are on defaults then you are configured for a 100MB row group size. This is obviously quite a bit larger than that. I think you're definitely experiencing the same thing. While we get a proper fix in, the following will help alleviate it:
|
Thanks, that all sounds positive. Should the
Will this take the form of a set of attributes to not index/store differently? My worry is what happens when a new app decides to put long strings in a different attribute, and we crash or have latency issues until we hunt it down. Thinking out loud, a configured set of indexed attributes would be fine - we could publish that, along with length limits. |
Yes. There is also a CLI command you can use to scan blocks and help you determine what your largest columns are. We are looking at ways to automate this in the future, but that won't be in 2.3.
cc @stoewer |
@joe-elliott That looks great - just waiting for the precompiled binaries to be available in 2.3! |
2.3-rc.0 is here: https://github.com/grafana/tempo/releases/tag/v2.3.0-rc.0 if you're interested in checking it out early |
No compactor crashes for the past 2h with v2.3, vParquet3, I'll also try to acquire the v2.3 tar.gz internally to run |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
We receive the similar error for half of our queries which we are running in Grafana and gives us portion of traces. We use S3 bucket as backend and currently for POC, we run single querier pod.
We upgraded to the latest version, but it didn't help us. We have currently no clues about the reason. Could you please guide me about this error? Thanks in advance! |
@amirkkn Some questions:
|
Just confirming we haven't observed this since v2.3 and vParquet3, though it may be because we haven't run into #2731 since, so there are no corrupt blocks to trip up the querier. |
Yes, the running version is 2.3.1. I noticed the issue doesn't occure when we exclude the healthcheck tracing which is running each 10s. We have springboot applications running. I haven't analysed with cli. Actually we send the same resource attributes from OTEL agent for all. I don't see any errors in compactor or other components. Here are some logs:
|
Perhaps understanding the content of the healthcheck would help. Can you post the logs of the panic? |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
Describe the bug
This looks like another crash on corrupt data in
segmentio/parquet-go
, this time when reading.grafana/tempo:2.2.2 official Docker image
Expected behavior
No crash, and an error to be returned.
Environment:
Additional Context
The text was updated successfully, but these errors were encountered: