Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update retention and concurrency for Thanos #461

Merged
merged 1 commit into from
Aug 6, 2024

Conversation

schwesig
Copy link
Contributor

@schwesig schwesig commented Jun 28, 2024


This PR addresses the retention rate issues as discussed in nerc-project/operations#618 (comment) (having more than 30d raw etc.).
The changes include updating the retention and concurrency settings for the Thanos Compactor to enhance observability and metrics performance.
We will stay with the defaults where possible, adding remarks with the defaults to better understand the next changes or possible errors.

Changes to focus on the needs for class, cost, and invoice analysis, as for future predictions:

  • Updated retentionResolutionRaw from 30d to 90d (quarterly high details for deep analysis, especially GPUs)
  • Updated retentionResolution5m from 90d to 360d (for cost, usage, and invoices; 15 minutes could be enough, but is not a default option)
  • Set retentionResolution1h to 0d (retain forever, following the default and recommendation)
  • Added blockDuration, cleanupInterval, deleteDelay, retentionInLocal, consistencyDelay, compactConcurrency, and downsampleConcurrency settings: even if staying in the default, making the options visible in case of possible future changes)

These changes aim to optimize data retention & resolution for needed use cases and ensure better performance.

References:

  1. Thanos Compact Component
  2. Recommendations for Running Thanos and Prometheus
  3. Red Hat Advanced Cluster Management Observability

@schwesig schwesig added the bug Something isn't working label Jun 28, 2024
@schwesig schwesig self-assigned this Jun 28, 2024
@schwesig schwesig added the enhancement New feature or request label Jun 28, 2024
Copy link
Member

@larsks larsks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is some disagreement between the PR description and the commit message (and do we want to include any of those doc links in the commit message?). There is also a typo in the commit message ("information").

@schwesig
Copy link
Contributor Author

schwesig commented Jul 1, 2024

The is some disagreement between the PR description and the commit message (and do we want to include any of those doc links in the commit message?). There is also a typo in the commit message ("information").

@larsks, thanks for the feedback. I was too quick and dirty on this, and it was too clear in my head, not so much for the uninvolved reader, though :-)
This topic is more complicated because we had to solve both the use cases and the connection issues. I added some more details.

@schwesig schwesig requested a review from larsks July 1, 2024 14:43
Copy link
Member

@larsks larsks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

This PR addresses the retention rate issues as discussed in nerc-project/operations#618 (comment) (having more than 30d raw etc.).
The changes include updating the retention and concurrency settings for the Thanos Compactor to enhance observability and metrics performance.
We will stay with the defaults where possible, adding remarks with the defaults to better understand the next changes or possible errors.

Changes to focus on the needs for class, cost, and invoice analysis, as for future predictions:
- Updated `retentionResolutionRaw` from 30d to 90d (quarterly high details for deep analysis, especially GPUs)
- Updated `retentionResolution5m` from 90d to 360d (for cost, usage, and invoices; 15 minutes could be enough, but is not a default option)
- Set `retentionResolution1h` to 0d (retain forever, following the default and recommendation)
- Added `blockDuration`, `cleanupInterval`, `deleteDelay`, `retentionInLocal`, `consistencyDelay`, `compactConcurrency`, and `downsampleConcurrency` settings: even if staying in the default, making the options visible in case of possible future changes)

These changes aim to optimize data retention & resolution for needed use cases and ensure better performance.

References:
1. [Thanos Compact Component](https://thanos.io/tip/components/compact.md/)
2. [Recommendations for Running Thanos and Prometheus](https://zapier.com/blog/five-recommendations-when-running-thanos-and-prometheus/)
3. [Red Hat Advanced Cluster Management Observability](https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/observability/customizing-observability#adding-advanced-config:~:text=is%20not%20displayed.-,4.3.%C2%A0Adding%20advanced%20configuration%20for%20retention,-Add%20the%20advanced)

Signed-off-by: ​/Thor(sten)?/ Schwesig <89909507+schwesig@users.noreply.github.com>
@larsks larsks force-pushed the 20240628_thanos_retention branch from a494d61 to 02436f5 Compare July 2, 2024 16:01
@larsks
Copy link
Member

larsks commented Jul 2, 2024

I rebased this on main to lose the merge commit.

@schwesig schwesig merged commit 0241f65 into OCP-on-NERC:main Aug 6, 2024
2 checks passed
@schwesig schwesig deleted the 20240628_thanos_retention branch August 19, 2024 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants