Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hashset feature] Scan shortcut fix #1217

Open
wants to merge 2 commits into
base: hashset
Choose a base branch
from

Conversation

SoftlyRaining
Copy link

This is targeting the hashset branch and will update the hashset PR when merged. This fixes the issue I saw where scan can miss elements when we're rehashing: #1186 (comment)

The UT fails without my fix, and passes with my fix 😁

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 48.27586% with 30 lines in your changes missing coverage. Please review.

Project coverage is 70.34%. Comparing base (5129254) to head (9bf0c39).

Files with missing lines Patch % Lines
src/hashset.c 48.27% 30 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           hashset    #1217      +/-   ##
===========================================
- Coverage    70.77%   70.34%   -0.44%     
===========================================
  Files          115      115              
  Lines        64714    63803     -911     
===========================================
- Hits         45804    44880     -924     
- Misses       18910    18923      +13     
Files with missing lines Coverage Δ
src/hashset.c 40.52% <48.27%> (+0.55%) ⬆️

... and 14 files with indirect coverage changes

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally you convinced me that the current logic has a problem!

This PR solves the issue indeed. I'm inclined to merge it. I just posted some suggestions.

Letting rehash follow probing chains can add a lot of latency to individual commands though. I'm not sure that's worse than adding very little latency to each scan call.

Some context: When debugging a test case of AOF rewrite, I observed probing chains of length 6000. This is when reading an AOF file. Actually I believe all non-empty buckets of the table were in the same probing chain.

Reading an AOF file means reading commands and executing them. The AOF file create by AOF rewrite has the commands stored per key, and the keys are in iterator order = probing order. Do you see the problem?

All commands that hash to the same bucket in a large table come together in the beginning of the AOF file, creating huge probing chains in small tables, before growing kicks in. A SET command takes 300 times longer than normal. I tried to let rehashing follow probing chains, but this got even worse spikes, with long probe chains + rehashing a very long chains.

Here's the test case: https://github.com/valkey-io/valkey/pull/1178/files#r1814492725

If we could let the iterator walk in some other order (such as index++) we could avoid this, but it doesn't completely solve the problem either. I guess there's simply a chance of getting very large probe chains and this cases a severe degradation.

So what can we do? Maybe keep track of not only the number of 'everfulls', but the max lengths of probe chains, and grow the table if it's too long? This might help.

But I think the best solution is Madelyn's idea about chaining instead of probing.

I think we need to implement it. It has other benefits too, like

  • we can have higher fill factor and we can avoid rehashing indefinitely when there's a fork running;
  • scan doesn't need to follow chains wrapping around zero, so it won't return duplicates.

src/hashset.c Show resolved Hide resolved
hash = idx;
} else {
hash = hashElement(s, element);
uint8_t prev_bucket_everfull = s->tables[0][prevCursor(idx, expToMask(s->bucket_exp[0]))].everfull;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing this requires a memory access that's not useful in the growing case. Growing is normally the most performance sensitive situation.

My guess is that it's cheaper to check it only when we need it...

if (s->bucket_exp[1] <= s->bucket_exp[0] && ...here...)

On the other hand, the previous bucket has most likely been rehashed just before, so it's in the L1 cache already. Anyway, I'm not sure if one is better than the other.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I realized we only actually need to check when we're first starting a rehash - in all other cases we know that we ended rehashStep with a bucket without everfull set :)

Comment on lines -1263 to +1282
/* Mask the start cursor to the bigger of the tables, so we can detect if we
/* Mask the start cursor to the smaller of the tables, so we can detect if we
* come back to the start cursor and break the loop. It can happen if enough
* tombstones (in both tables while rehashing) make us continue scanning. */
cursor = cursor & (expToMask(s->bucket_exp[0]) | expToMask(s->bucket_exp[1]));
cursor &= expToMask(s->bucket_exp[0]);
if (hashsetIsRehashing(s)) {
cursor &= expToMask(s->bucket_exp[1]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, I guess there's a small increased risk of returning duplicates.

Time 1:

  • Table size: 16 buckets.
  • Scan order: 0 -> 8 -> 4 -> 12 -> 2 -> 10 -> 6 -> 14 -> 1 -> ... -> 15.
  • SCAN 0 returned new cursor 8.

Time 2:

  • Table sizes 16 and 2. Shrinking started. Scan order: 0 -> 1.
  • User continues previous scan with SCAN 8.
  • Old implementation: SCAN 8 returns elements from small table bucket 0 (masked with small table mask) and large table bucket 8 (masked with large table mask). Large table bucket 0 can safely be skipped.
  • New implementation: SCAN 8 returns elements from small table bucket 0 (masked with small table mask) and large table bucket 0 and 8 (expansion of cursor masked with small table mask).
  • Both implementaions: Return new cursor 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see! Yes, that makes sense. I think it'll be fixed if I don't modify cursor here and instead apply the masks to start_cursor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe. But it doesn't matter much, especially if we want to change probing to chaining anyway... Let's not spend too much effort on this.

src/hashset.c Show resolved Hide resolved
src/hashset.c Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants