Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rehash functionality #2266

Closed
wants to merge 2 commits into from
Closed

Fix rehash functionality #2266

wants to merge 2 commits into from

Conversation

MattAlp
Copy link
Contributor

@MattAlp MattAlp commented Feb 21, 2021

Working on #2263- Hash#rehash currently recalculates hashes, but does not remove duplicate as described in the spec. This PR addresses this by unlinking duplicate entries within a hash when a hash collision and equal values are detected.

@graalvmbot
Copy link
Collaborator

Hello Matthew Alp, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address matthew -(dot)- alp -(at)- shopify -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

@eregon eregon added the shopify label Feb 22, 2021
@graalvmbot
Copy link
Collaborator

Matthew Alp has signed the Oracle Contributor Agreement (based on email address matthew -(dot)- alp -(at)- shopify -(dot)- com) so can contribute to this repository.

@eregon
Copy link
Member

eregon commented Feb 22, 2021

Could you untag the spec and commit that change to show which spec now passes?

jt untag spec/ruby/core/hash/rehash_spec.rb 

hashNode.execute(PackedArrayStrategy.getKey(store, n), compareByIdentity));
}
}
PackedArrayStrategy.promoteToBuckets(getContext(), hash, store, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fairly inefficient, it means rehash will represent a Hash with #entries<=3 less efficiently than before. It would be best to keep it as a packed array representation if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't sure the best way to solve this - multiple nested loops? Could get pretty verbose when exploded? Or a temporary set to see if a value has already been included? Seemed like that was basically the bucket strategy anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nested loop seems OK, there is already no @ExplodeLoop on this method anyway.
I think what's important is not so much the speed of rehash but that the resulting Hash still uses the more efficient packed array representation for other calls later on.

entry.getKey(),
entry.getHashed(),
bucketEntry.getKey(),
bucketEntry.getHashed())) { // If the bucket contains a single entry, we never set the flag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you avoid this duplicated call by changing the loop above to be a do/while loop?

@@ -890,6 +885,7 @@ protected RubyHash rehashBuckets(RubyHash hash,
Arrays.fill(entries, null);

Entry entry = hash.firstInSequence;
Entry previousEntry = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previousEntry typically means the previous Entry in lookup, not in sequence, could you clarify by renaming it to previousInSequence?

bucketEntry.getKey(),
bucketEntry.getHashed())) { // If the bucket contains a single entry, we never set the flag
if (previousEntry != null) {
previousEntry.setNextInSequence(entry.getNextInSequence());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about entry.getNextInSequence().getPreviousInSequence(), won't that still refer to entry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use BucketsStrategy.removeFromSequenceChain here, then it should be fine.

@@ -13,7 +13,7 @@ Bug fixes:
* Fix `Thread.handle_interrupt` to defer non-pure interrupts until the end of the `handle_interrupt` block (#2219).
* Clear and restore errinfo on entry and normal return from methods in C extensions (#2227).
* Fix extra whitespace in squiggly heredoc with escaped newline (#2238, @wildmaples and @norswap).

* Fix `Hash#rehash` to remove duplicate keys after modifications (#2266, @MattAlp)
Compatibility:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken empty line here.

@eregon eregon linked an issue Feb 22, 2021 that may be closed by this pull request
@bjfish
Copy link
Contributor

bjfish commented Feb 22, 2021

Could you untag the spec and commit that change to show which spec now passes?

jt untag spec/ruby/core/hash/rehash_spec.rb 

I think it would be nice if the spec was updated to also cover the bucket hash rehash if it's not currently covering it.

@chrisseaton
Copy link
Collaborator

We made progress on this today.

@chrisseaton
Copy link
Collaborator

@MattAlp do you have time to work on this soon? Not a problem if you don't! Don't feel pressured! But we'd probably have to move it on a bit ourselves if you don't soon.

@chrisseaton
Copy link
Collaborator

I think it would be nice if the spec was updated to also cover the bucket hash rehash if it's not currently covering it.

I worry about baking in TruffleRuby implementation details to the spec.

Two better ideas:

  1. put it in spec/truffle
  2. rely on running the specs in a loop and using a chaos node to randomise storage to fix this

@eregon
Copy link
Member

eregon commented Apr 6, 2021

I worry about baking in TruffleRuby implementation details to the spec.

I wouldn't worry about that. Consider ruby/spec a test suite, not a specification.
From a test suite point of view most Ruby implementations have different storage strategies for small and large hashes, so one could just use e.g. a Hash with 10 elements to test that and mention in the spec description or a in comment it's specifically to test large hashes which might have a different representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Hash#rehash
5 participants