Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compactor: Irregular compaction and downsampling #7127

Open
PrayagS opened this issue Feb 9, 2024 · 9 comments
Open

compactor: Irregular compaction and downsampling #7127

PrayagS opened this issue Feb 9, 2024 · 9 comments

Comments

@PrayagS
Copy link

PrayagS commented Feb 9, 2024

Seeing irregularity in the compactor's algorithm.

In particular, I see the following inconsistencies,

  • Not compacting uniformly. Block ranges are being chosen randomly.
  • Not downsampling to 1h resolution.

Version: v0.32.5
Configuration:

        - compact
        - --wait
        - --log.level=info
        - --log.format=logfmt
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --data-dir=/var/thanos/compact
        - --retention.resolution-raw=30d
        - --retention.resolution-5m=90d
        - --retention.resolution-1h=180d
        - --delete-delay=48h
        - --compact.cleanup-interval=30m
        - --compact.concurrency=6
        - --downsample.concurrency=6
        - --block-files-concurrency=60
        - --compact.blocks-fetch-concurrency=60
        - --block-meta-fetch-concurrency=60
        - --deduplication.replica-label=prometheus_replica
        - --debug.max-compaction-level=4
        - |-
          --selector.relabel-config=
            - action: keep
              source_labels: ["prometheus"]
              regex: "monitoring/prom-op-kube-prometheus-st-prometheus"

Resources assigned:

        resources:
          requests:
            cpu: 6500m
            memory: 40Gi

There are no errors in the log but only warnings which say,

empty chunks happened, skip series

I could find similar issues (#3711, #3721) but no recent activity there.

Attaching a screenshot of the current status of all the blocks,
image

@GiedriusS
Copy link
Member

How large are your index files?

@PrayagS
Copy link
Author

PrayagS commented Feb 13, 2024

@GiedriusS The size of index file for fresh blocks from the sidecar is ~2GB.

Once they're compacted into 2d blocks, the index size becomes ~20GB.

@GiedriusS
Copy link
Member

Yeah, that might be a problem. Maybe you could do a recursive listing of files and check if you have any no compaction marks inside of your remote object storage?

@PrayagS
Copy link
Author

PrayagS commented Feb 14, 2024

@GiedriusS Thanks a lot for pointing that out. I had looked around for deletion markers but missed the no compaction markers.

I can see a good amount of my blocks have that marker because the compacted block's index size would exceed 64GB (issue #1424).

What's the fix here? Should I decrease the upload interval of the sidecar from 2h to something less?

@PrayagS
Copy link
Author

PrayagS commented Aug 17, 2024

@GiedriusS Bumping this issue.

I have recently set up new servers which are functionally sharded so block sizes are much smaller. Blocks with a 2 day range are ~500MB. Not seeing any no-compaction markers as well.

Still I'm seeing this issue where, after creating a bunch of 2d blocks, the next block is a block with a range of 9 days. Shouldn't it be a block with a range of 14 days?

And I've also reached a state where downsampling to 1h resolution is not happening. Screenshot of block state below,

image

The only warning logs I see are the following two messages which seem unrelated to the issue,

  • requested to mark for deletion, but file already exists; this should not happen; investigate
  • empty chunks happened, skip series

The metrics for compaction backlog and downsampling backlog are both zero as of now so it doesn't seem like it's waiting for planned compactions to complete before starting downsampling.

Please let me know if any other data is needed from my end. TIA!

@PrayagS
Copy link
Author

PrayagS commented Aug 17, 2024

And I've also reached a state where downsampling to 1h resolution is not happening.

Ignore this since it makes sense that downsampling won't start unless the whole stream has had one complete compaction iteration. And that isn't the case here since this is just a subset of the stream.

@anarcher
Copy link

anarcher commented Oct 4, 2024

I think I'm in a similar situation. By any chance, do you see a difference between raw data and the 5m down-sampling resolution when querying, like the following?
image

@PrayagS
Copy link
Author

PrayagS commented Oct 5, 2024

do you see a difference between raw data and the 5m down-sampling resolution when querying, like the following?

Not really. The trend in both the raw data and the downsampled data is similar. Maybe you should check the status of your blocks.

Also, we have query engine set to Prometheus and partial response disabled. Not sure if that affects your results here.

@anarcher
Copy link

anarcher commented Oct 7, 2024

Not downsampling to 1h resolution.

I'm not entirely sure, but could it be that the 1-hour downsampling blocks are not being created because the --debug.max-compaction-level=4 setting prevents the creation of 14-day blocks?

https://github.com/thanos-io/thanos/blob/main/cmd/thanos/compact.go#L53-L61

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants