Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/cache: drop updates to existing items and dedupe get/set methods #1142

Merged
merged 1 commit into from
May 16, 2019
Merged

Conversation

abursavich
Copy link
Contributor

@abursavich abursavich commented May 14, 2019

  1. When an item is successfully updated, metrics and cache size tracking added the new version of the item but did not remove the old version of the item.

    For example, with a fresh cache, if you set the same item m times with each version having entry size n, the cache would think it had m items with current size m*n instead of 1 item with current size n.

    If you then emptied the cache by evicting every item (only one), the cache would think it had m-1 items with current size (m-1)*n, instead of 0 items with current size 0.

    This would allow the cache to get stuck in an eviction loop which persists even after the cache has been emptied.

  2. When an item is unsuccessfully updated in the cache because the new version is too big, the old version was left in the cache.

Changes

If an item exists in the cache, remove it and then re-add it. It would be possible to update the metrics in place, skipping what is counted as an eviction, but the code would be more complicated and brittle.

Verification

Added tests which previously failed and now pass.

Fixes #651

@abursavich abursavich changed the title pkg/store/cache: fix several bugs when replacing an existing item pkg/store/cache: fix several bugs when updating an existing item May 14, 2019
@GiedriusS
Copy link
Member

GiedriusS commented May 14, 2019

Hmm, makes sense that we could do better when modifying an item but looking at the code and logically this should never happen AFAICT. It seems to me that the problem is that at first we check https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1421 if there is an entry in the cache for that, and only then we add it https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1493 to the cache. So it seems like some locking is missing there. What do you think? (cc @bwplotka)

@abursavich
Copy link
Contributor Author

AFAICT, the "check" and "set" is not atomic so there may be many concurrent "checks" that miss the same items in the cache and then independently load and "set" them.

@GiedriusS
Copy link
Member

Yes so that's my point - there's no reason to perform all of this work of removing and adding the same item again if we can put some locking around that part to make it atomic. What is your opinion on that?

@bwplotka
Copy link
Member

Wow, have you found the actual fix?? Awesome, cannot wait for some time to dive into it and review! ❤️ Thanks!

@abursavich
Copy link
Contributor Author

I think the cache should support setting the same key multiple times. Since it goes out of its way to support concurrent access, I'm less inclined to accept an argument in favor of pushing the problem onto the caller.

The lookup to remove the existing item is O(1), so it's not so bad. You can always add a singleflight in the caller later if you want to prevent duplicated work.

@bwplotka
Copy link
Member

Awesome spotting of this edge case. As you said, Get & Set are not atomic , but we missed that during implementation and testing.

For the fix. You current approach of always removing makes sense. It is simple and effective, but not that efficient as we are removing from LRU something was added milisecond before. Let's see if that's the best what we can do.

Let's agree on one invariant for this: I think this cache items are immutable. So there is nothing like old version of cache or new. Updating is not needed for the sake of it.

I like the @GiedriusS idea of making it atomic, or use singleflight. We check for hit in cache then if not we agreggate + paritition + fetch + split + put in to cache. Quite many things in between. However making it atomic/singleflight might be very beneficial if 100 requests are touching the same postings. However on what thing to lock/singleflight on? Block? Particular posting (too fine grained)? We should look on it create a ticket and design, but let's fix this issue in simpler form like proposed here.

What about simple alternative to this - just try to GetPosting again just before Set in lock. O(1) as well. And this will ensure proper behaviour, but will still fetch postigns unncecessary. Your solution will also help if our invariant is wrong and cache is malformed.

Thoughts @GiedriusS @abursavich ? Happy to discuss offline on slack as well (:

When an item was successfully updated, metrics and cache size tracking added the new version of the item but did not remove the old version of the item, so it was over-counted.

For example, with a fresh cache, if you set the same item m times with each version having entry size n, the cache would think it had m items with current size m*n instead of 1 item with current size n.

If you then emptied the cache by evicting every item (only one), the cache would think it had m-1 items with current size (m-1)*n, instead of 0 items with current size 0.

This would allow the cache to get stuck in an eviction loop which persists even after the cache has been emptied.
@abursavich abursavich changed the title pkg/store/cache: fix several bugs when updating an existing item store/cache: drop updates to existing items and dedupe get/set methods May 15, 2019
@abursavich
Copy link
Contributor Author

I changed it to make cache items immutable by aborting the set if the key already exists (instead of replacing the existing value).

I also refactored the get and set methods to reduce duplication.

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think that this is awesome 👍 thanks for this ❤️

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perfect. Thanks!

I feel like we could remove the check for ensureFit spinning forever, but not harm in it I guess for now (:

LGTM!

@bwplotka bwplotka merged commit 01f95bb into thanos-io:master May 16, 2019
@bwplotka
Copy link
Member

Forgot about change log 🤦‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines
3 participants