-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tempo 2.3.0, Ingester: panic: runtime error: slice bounds out of range [4294966737:509:] #3177
Comments
something like issue :#2987, #3038, #2690, #2804 @joe-elliott |
Interesting. It seems you tripped both the old issue and found a new (similar) one:
Hopefully this is due to similar circumstances and we can fix it using similar techniques. Previously this occurred when an attribute had a lot of very large, distinct values. Things like Re-reading your log messages above I'm seeing a few failed flushes and then the panic. My guess is that you were tripping the original bug when flushing in memory traces to the WAL. This was causing the ingester to attempt larger and larger flushes to the WAL which results in the second overflow issue. To research and understand better:
To fix:
Thanks for the report! Please let us know how the analysis/fix above works. We should definitely still work to get the bug fixed but hopefully this will make things work for you in the short term. |
hi joe @joe-elliott I tried to config row group to 0.1MB and 1MB and 10MB and 1GB while reduce trace_idle_period, flush_check_period to 1s, but it doesn't work. In our test, max size of one attribute is 1KB and one trace contains about 100 attributes. Will you please give some other advise? Another thing I use s3 as storage, but I can't find WAL dir in both local and s3. Does it meet expectations? Will this bug be fix in Tempo2.3.1 or fix in 2.3.0? and when will it be released? |
Can you use the tempo-cli as suggested above? Are you cutting any blocks successfully or is Tempo constantly failing with the errors above? Can you investigate your traces for anything that would seem exceptional? Can you share a sample trace? What volume are your operating at?
There should not be a wal folder in s3, but there by default the ingester should create one at
Since we have not identified the issue, I can't give any time frame on fixing it. |
I use same trace data to carry out my test. When qps is low this bug won't happen and when qps raise to 30k/s the panic happens.
My test trace:
I find /var/tempo/wal dir in ingester container but not exist in my server when use s3 to be storage. When I use local to be storage this dir can be find in my server. Maybe some config lose in my value.yaml.
OK, waiting for your good news. |
30k/s is quite low. I'm sure there is something exceptional about the test trace you are repeatedly sending, but I need to see it to help.
I'm unsure what you mean by server, but it does sound like the correct wal folder is present in the container. Can you share a complete file listing of the block folder in the wal. There should be a bunch of files named sequentially like
Can you share the complete test trace? Can you share a block that was cut successfully? Questions from before: Can you use the tempo-cli as suggested above? Are you cutting any blocks successfully or is Tempo constantly failing with the errors above? Can you investigate your traces for anything that would seem exceptional? We are not seeing this issue and I am not going to be able to help you unless you can get some of this information back to us. |
This is the most important part of my test code, I think it will help you. it wrote by golang.
var testStr = func childSpan(ctx context.Context, depth int) {
}
/var/tempo/wal $ ls 76c26e49-976e-495f-a6a8-7e1ec9a66dff+single-tenant+vParquet3/ -al
/var/tempo/wal $ ls blocks/single-tenant/0a3b4ba0-b738-497e-9900-e6615a6b577a/ -al
I think most of blocks are cut successfully |
So you are writing 50 span traces. Each span has 3KB+ of random-ish attributes. This is almost certainly hitting a different dictionary buffer under/overflow. Try adding dedicated columns for https://grafana.com/docs/tempo/latest/operations/dedicated_columns/ Your block size is large, but not that problematic. The real issue is that your test pattern is blowing up the parquet dictionaries. |
hi joe I try to add aaa, bbb, ccc to dedicated columns, but panic still happens~
storage: global_overrides:
level=info ts=2023-11-30T12:58:20.950741563Z caller=flush.go:253 msg="completing block" userid=single-tenant blockID=e7455dbd-140f-4c27-8122-d9c5c3f9172a goroutine 4499 [running]:
/var/tempo/wal $ ls a5e51aaa-a485-4337-b297-b108c8f0d81d+single-tenant+vParquet3/ -al /var/tempo/wal-bos/blocks/single-tenant $ ls a5e51aaa-a485-4337-b297-b108c8f0d81d/ -al |
Reviewing your log messages I believe that a recent fix to prevent integer overflow in dictionaries is preventing complete traces from being written to the wal:
This causes the complete traces to continue to build until a different overflow is hit which results in the observed panic. We should fix this issue, but it is definitely only triggered due to your abnormal, synthetic workload. Try reducing the following settings and see if it's fixed:
This will force tempo to more aggressively flush traces and possibly prevent the dictionaries from growing too large. cc @stoewer wdyt? |
I think flush_check_period and trace_idle_period both have default value 10s... should i config them again? |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
Describe the bug
Logs from the Tempo Ingester
`level=warn ts=2023-11-23T14:31:28.732522384Z caller=tcp_transport.go:440 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.233.92.22:7946 err="dial tcp 10.233.92.22:7946: i/o timeout"
level=warn ts=2023-11-23T14:31:33.733599827Z caller=tcp_transport.go:440 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.233.92.22:7946 err="dial tcp 10.233.92.22:7946: i/o timeout"
level=error ts=2023-11-23T14:32:29.950035793Z caller=flush.go:165 org_id=single-tenant msg="failed to cut traces" err="error closing writer: writing dictionary page of row group colum 48: page size limit exceeded: 2546705126>2147483647"
level=error ts=2023-11-23T14:32:58.767688474Z caller=flush.go:165 org_id=single-tenant msg="failed to cut traces" err="error closing writer: writing dictionary page of row group colum 48: page size limit exceeded: 3489495148>2147483647"
panic: runtime error: slice bounds out of range [4294966737:509:]
goroutine 897 [running]:
github.com/parquet-go/parquet-go.(*byteArrayPage).index(...)
/drone/src/vendor/github.com/parquet-go/parquet-go/page.go:983
github.com/parquet-go/parquet-go.(*byteArrayDictionary).lookupString(0x201ae05?, {0xc000804c04?, 0x201e815?, 0xb05554?}, {{0xc3f9b8acf8?, 0x425425?, 0x0?}})
/drone/src/vendor/github.com/parquet-go/parquet-go/dictionary_amd64.go:86 +0xff
github.com/parquet-go/parquet-go.(*byteArrayDictionary).Bounds(0xc0005c5650, {0xc000804a00, 0xe1, 0x19e})
/drone/src/vendor/github.com/parquet-go/parquet-go/dictionary.go:764 +0x235
github.com/parquet-go/parquet-go.(*indexedPage).Bounds(0xc041374390)
/drone/src/vendor/github.com/parquet-go/parquet-go/dictionary.go:1283 +0x8e
github.com/parquet-go/parquet-go.(*repeatedPage).Bounds(0xc3f9b8b368?)
/drone/src/vendor/github.com/parquet-go/parquet-go/page.go:420 +0x5d
github.com/parquet-go/parquet-go.(*writerColumn).recordPageStats(0xc000284500, 0x2ea?, 0xc2006b7170, {0x2dc6300, 0xc200e3e230})
/drone/src/vendor/github.com/parquet-go/parquet-go/writer.go:1560 +0xce
github.com/parquet-go/parquet-go.(*writerColumn).writeDataPage(0xc000284500, {0x2dc6300, 0xc200e3e230})
/drone/src/vendor/github.com/parquet-go/parquet-go/writer.go:1458 +0x70b
github.com/parquet-go/parquet-go.(*writerColumn).flush(0xc000284500)
/drone/src/vendor/github.com/parquet-go/parquet-go/writer.go:1193 +0xa5
github.com/parquet-go/parquet-go.(*GenericWriter[...]).Write.func1(0xc3f9b8b750?)
/drone/src/vendor/github.com/parquet-go/parquet-go/writer.go:183 +0xe5
github.com/parquet-go/parquet-go.(*writer).writeRows(0xc04124a420, 0x1, 0xc3f9b8b748)
/drone/src/vendor/github.com/parquet-go/parquet-go/writer.go:1000 +0xac
github.com/parquet-go/parquet-go.(*GenericWriter[...]).Write(0x655f627f655f627f?, {0xc2005b3790?, 0x1f4?, 0xc200dca360?})
/drone/src/vendor/github.com/parquet-go/parquet-go/writer.go:175 +0x5b
github.com/grafana/tempo/tempodb/encoding/vparquet3.(*walBlock).AppendTrace(0xc0420ee180, {0xc2e4aafc00, 0x10, 0x1f4}, 0x1990105?, 0xf9b8bb10?, 0xc3?)
/drone/src/tempodb/encoding/vparquet3/wal_block.go:338 +0x117
github.com/grafana/tempo/tempodb/encoding/vparquet3.(*walBlock).Append(0xc0420ee180, {0xc2e4aafc00, 0x10, 0x1f4}, {0xc261c10000?, 0x11856?, 0x0?}, 0x0?, 0x0?)
/drone/src/tempodb/encoding/vparquet3/wal_block.go:325 +0xc5
github.com/grafana/tempo/modules/ingester.(*instance).writeTraceToHeadBlock(0xc04124a2c0, {0xc2e4aafc00, 0x10, 0x1f4}, {0xc261c10000, 0x11856, 0x11856}, 0x655f6298?, 0x655f6298?)
/drone/src/modules/ingester/instance.go:555 +0x118
github.com/grafana/tempo/modules/ingester.(*instance).CutCompleteTraces(0xc04124a2c0, 0x44e390?, 0x4?)
/drone/src/modules/ingester/instance.go:243 +0x21f
github.com/grafana/tempo/modules/ingester.(*Ingester).sweepInstance(0xc04173e000, 0xc04124a2c0, 0x0)
/drone/src/modules/ingester/flush.go:163 +0x45
github.com/grafana/tempo/modules/ingester.(*Ingester).sweepAllInstances(0xc041223ea0?, 0x44?)
/drone/src/modules/ingester/flush.go:157 +0x45
github.com/grafana/tempo/modules/ingester.(*Ingester).loop(0xc04173e000, {0x2daa640, 0xc041622280})
/drone/src/modules/ingester/ingester.go:138 +0x95
github.com/grafana/dskit/services.(*BasicService).main(0xc0410f9860)
/drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:190 +0x1b4
created by github.com/grafana/dskit/services.(*BasicService).StartAsync.func1 in goroutine 816
/drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:119 +0x105
`
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment:
Additional Context
Config like:
storage:
trace:
block:
version: vParquet3
parquet_row_group_size_bytes: 100000000
global_overrides:
ingestion_rate_limit_bytes: 2000000000
max_traces_per_user: 100000
max_bytes_per_trace: 200000000
The text was updated successfully, but these errors were encountered: