Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on block write in compactor #2731

Open
gebn opened this issue Jul 31, 2023 · 2 comments
Open

Panic on block write in compactor #2731

gebn opened this issue Jul 31, 2023 · 2 comments
Labels
keepalive Label to exempt Issues / PRs from stale workflow type/bug Something isn't working

Comments

@gebn
Copy link
Contributor

gebn commented Jul 31, 2023

Describe the bug

Following on from #2690, this is currently only affecting our compactor, however it could also affect other components depending on volume. Presumably the input blocks are valid, otherwise they would have failed to write as well. 4294952965 is suspiciously close to 4 GiB.

I think the first step here is to add recover logic to print out the offending input block IDs.

Environment:

  • Infrastructure: Kubernetes
  • Version: latest tag (cfb6e5429)

Additional Context

panic: runtime error: slice bounds out of range [4294952965:41691:]

goroutine 4841 [running]:
github.com/segmentio/parquet-go.(*byteArrayPage).index(...)
        /drone/src/vendor/github.com/segmentio/parquet-go/page.go:983
github.com/segmentio/parquet-go.(*byteArrayDictionary).lookupString(0xc04ec7e9a0?, {0xc04490e904?, 0xc014c801fb?, 0xc000de8a01?}, {{0xc04ec7eb90?, 0x1300000000?, 0xc014c80227?}})
        /drone/src/vendor/github.com/segmentio/parquet-go/dictionary_amd64.go:86 +0xfa
github.com/segmentio/parquet-go.(*byteArrayDictionary).Bounds(0xc0002b7880, {0xc044908000, 0x830b, 0x1e46e})
        /drone/src/vendor/github.com/segmentio/parquet-go/dictionary.go:764 +0x254
github.com/segmentio/parquet-go.(*indexedPage).Bounds(0xc004762a50)
        /drone/src/vendor/github.com/segmentio/parquet-go/dictionary.go:1283 +0x96
github.com/segmentio/parquet-go.(*repeatedPage).Bounds(0x20079e0?)
        /drone/src/vendor/github.com/segmentio/parquet-go/page.go:420 +0x69
github.com/segmentio/parquet-go.(*writerColumn).recordPageStats(0xc000862f00, 0x196ed?, 0xc018d8dbf0, {0x2aa0200, 0xc018f2ee10})
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:1312 +0xdc
github.com/segmentio/parquet-go.(*writerColumn).writeDataPage(0xc000862f00, {0x2aa0200, 0xc018f2ee10})
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:1213 +0x6e5
github.com/segmentio/parquet-go.(*writerColumn).flush(0xc000862f00)
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:951 +0xb2
github.com/segmentio/parquet-go.(*writerColumn).writeRows(0xc000862f00, {0xc010cec000?, 0x1312d0?, 0xc04ec7f5c0?})
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:1084 +0xc5
github.com/segmentio/parquet-go.(*writer).WriteRows.func1(0xc04ec7f700?, 0x1952a2b?)
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:716 +0x170
github.com/segmentio/parquet-go.(*writer).writeRows(0xc0007fac60, 0x1, 0xc04ec7f650)
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:758 +0xbb
github.com/segmentio/parquet-go.(*writer).WriteRows(0x10?, {0xc04ec7f6e8?, 0x3?, 0x0?})
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:695 +0x58
github.com/segmentio/parquet-go.(*Writer).WriteRows(...)
        /drone/src/vendor/github.com/segmentio/parquet-go/writer.go:157
github.com/segmentio/parquet-go.(*GenericWriter[...]).WriteRows(0xc018fe6e50?, {0xc04ec7f6e8?, 0x2aa1b60?, 0xc04ec7f700?})
        /drone/src/vendor/github.com/segmentio/parquet-go/writer_go18.go:176 +0x2b
github.com/grafana/tempo/tempodb/encoding/vparquet2.(*streamingBlock).AddRaw(0xc0002fee00, {0xc005c502d0, 0x10, 0x10}, {0xc04c73c000?, 0x8e, 0x2a91db8?}, 0x7da150?, 0xc0?)
        /drone/src/tempodb/encoding/vparquet2/create.go:148 +0x85
github.com/grafana/tempo/tempodb/encoding/vparquet2.(*Compactor).Compact(0xc000bad800, {0x2a873e8, 0xc0009ae780}, {0x2a6e200, 0xc000790af0}, {0x2a97b60, 0xc0007da140}, 0xc0003f4740, {0xc00078ade0, 0x4, ...})
        /drone/src/tempodb/encoding/vparquet2/compactor.go:174 +0xbca
github.com/grafana/tempo/tempodb.(*readerWriter).compact(0xc000badd40, {0x2a87340, 0xc0002aa5f0}, {0xc00078ade0?, 0x4, 0x4}, {0xc000c17e30, 0xd})
        /drone/src/tempodb/compactor.go:233 +0xa6a
github.com/grafana/tempo/tempodb.(*readerWriter).doCompaction(0xc000badd40, {0x2a87340, 0xc0002aa5f0})
        /drone/src/tempodb/compactor.go:143 +0x54c
github.com/grafana/tempo/tempodb.(*readerWriter).compactionLoop(0xc000badd40, {0x2a87340, 0xc0002aa5f0})
        /drone/src/tempodb/compactor.go:81 +0xd2
created by github.com/grafana/tempo/tempodb.(*readerWriter).EnableCompaction
        /drone/src/tempodb/tempodb.go:402 +0x30c
@joe-elliott
Copy link
Member

ok, let's try to get our facts in line to help diagnose this.

  • You see regular panics on compaction writes?
  • The blocks that are being read at the time of the compaction panic are fine or corrupt?
  • The panic is deterministic? i.e. it happens every time that Tempo attempts to compact these blocks?

I think the first step here is to add recover logic to print out the offending input block IDs.

Compactors only do one compaction job at a time. The offending block IDs would have been logged just before this panic.

4294952965 is suspiciously close to 4 GiB.

Good catch. I'm wondering if your compaction job is attempting to write a page that is larger than an internal max in parquet-go. Are you aware of anything exceptional about your data? Can you share your data so we can investigate?

@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Oct 1, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023
@joe-elliott joe-elliott added type/bug Something isn't working keepalive Label to exempt Issues / PRs from stale workflow and removed stale Used for stale issues / PRs labels Oct 17, 2023
@joe-elliott joe-elliott reopened this Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Label to exempt Issues / PRs from stale workflow type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants