Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boltdb shipper index list cache improvements #6054

Conversation

sandeepsukhani
Copy link
Contributor

What this PR does / why we need it:
To reduce the number of list calls we make to the object store when using boltdb-shipper, we have a cache in place. The cache stays valid for 1 min and is refreshed when a list call is made and it finds that cache needs to be rebuilt. The list call could take a couple of seconds to finish in a large cluster which can add up to the query latency when downloading index at query time.

This PR adds a parameter to list calls to skip the list cache and directly load the objects list from the object store. We will skip the cache only while doing query time download of index which is not too common, we see just up to 8k query time download operations in a day in our largest Loki cluster.

The other change this PR does is to detect the staleness of the list cache. Before explaining the problem, let us keep few things in mind:

  • While running compaction, it would first upload the newly created(compacted) file and then remove the source files. So it is possible that a list call can see the newly compacted file with the old source files while the compactor is still removing them.
  • When we see a 404 error while downloading the index, we just ignore it assuming the compactor would have removed it during compaction.

Now the problem is if compaction happened just after we cached the list of objects, the sync operation on the recently compacted table would try to download old source files that were compacted away but we would just ignore missing files as mentioned above. This means we won't have downloaded either of source index files and the new compacted file, still not know too us due to stale cache. This PR adds a check in sync code to see if we skipped downloading all the files, if so then we would retry the sync after force refreshing the index list cache.

Checklist

  • Tests updated

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sandeepsukhani sandeepsukhani merged commit 2758dc6 into grafana:main May 2, 2022
grafanabot pushed a commit that referenced this pull request May 3, 2022
* bypass index list cache when doing query time downloading of index

* detect and refresh stale index list cache during sync

(cherry picked from commit 2758dc6)
trevorwhitney pushed a commit that referenced this pull request May 3, 2022
* bypass index list cache when doing query time downloading of index

* detect and refresh stale index list cache during sync

(cherry picked from commit 2758dc6)

Co-authored-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
sandeepsukhani added a commit to sandeepsukhani/loki that referenced this pull request Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants