-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sidecar and Compact: Out of order block uploads lead to overlaps and Compact critical errors #2753
Comments
Interestingly, I think we upload with the order of blocks presented in filesystem, so from oldest, but we might want to double-check. 👍 We will look into this, this week, thanks. Any further pointers if you can "reliably" reproduce overlaps, let us know:
This sounds like amazing pointer, it will help us a lot. |
As per Line 417 in 748740d
I improved readability a bit here: #2764
Yes, that's the plan, I think there is issue about it.
Ah, this is actually the root cause indeed. It's here: Line 315 in 748740d
This is because if single block cannot be uploaded we want at least partial upload. I can agree this is not the best decision as Will fix on master |
In mean time you can enable vertical compaction, as it works well so far (although experimental). As a hack solution you can do it now in 0.12.2 with |
Fixed #2765 |
It should be part of In terms of ULID for partial uploads, we have this ticket: #2470 |
Thanos, Prometheus and Golang version used:
thanos, version 0.12.2 (branch: HEAD, revision: 52e10c6)
build user: root@c1a6cf60f03d
build date: 20200430-16:24:03
go version: go1.13.6
prometheus, version 2.18.1 (branch: HEAD, revision: ecee9c8abfd118f139014cb1b174b08db3f342cf)
build user: root@2117a9e64a7e
build date: 20200507-16:51:47
go version: go1.14.2
Object Storage Provider:
Google Cloud Storage
What happened:
Compact ran into an overlapped block error:
Looking at the source system, it appears that it was was isolated from the internet due to an upstream network failure for several days. Once the network was accessible, it began uploading blocks, but failed on 01E9XB4ECJ3NPQ08PBS7XHAXZ6 at 07:52:19. That block was later uploaded at 08:05:51:
In between the first upload failure of 01E9XB4ECJ3NPQ08PBS7XHAXZ6 and the retry, compact compacted the blocks that were present:
Compact is configured with
--consistency-delay=2h --delete-delay=6h
It appears that a transient upload error of one block among many, that were several days old resulted in a race between an uploading sidecar and compact where a compacted block was formed before all blocks that should have been included in it were successfully uploaded.
This overlap issue seems to be at odds with the documented expectation in https://thanos.io/operating/troubleshooting.md/#overlaps
What you expected to happen:
I would expect compact to wait for
consistency-delay
after a block was uploaded to the object store to make it available for compaction. It appears to rely on the ULID to calculate age rather than the object creation time of metadata.json, which is problematic when the block age does not correspond to its age in the object store.How to reproduce it (as minimally and precisely as possible):
Set up thanos compact to run against an objstore bucket. Set up a prometheus install with several hours of uncompacted blocks already present. Move one block out of the prometheus data directory. Start thanos sidecar so that it uploads all existing blocks. Watch compact compact the several hours old blocks. Move reserved block back into the prometheus directory and restart thanos sidecar. It should then upload and compact will halt.
The text was updated successfully, but these errors were encountered: