Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the raw feature and make RawTable private #546

Merged
merged 1 commit into from
Aug 23, 2024

Conversation

Amanieu
Copy link
Member

@Amanieu Amanieu commented Aug 22, 2024

This will give more freedom for the internal implementation details of hashbrown to evolve without the need for regular releases with breaking changes.

All existing users of RawTable should migrate to the HashTable API which is entirely safe while providing the same flexibility as RawTable.

This also removes the following features which were only exposed under RawTable:

  • RawTable::iter_hash
  • RawIter::reflect_insert and RawIter::reflect_remove
  • RawTable::clone_from_with_hasher
  • RawTable::insert_no_grow and RawTable::try_insert_no_grow
  • RawTable::allocation_info
  • RawTable::try_with_capacity(_in)
  • HashMap::raw_table(_mut) and HashSet::raw_table(_mut)

If anyone was previously relying on this functionaly, please raise a comment. It may be possible to re-introduce it as a safe API in HashTable and/or HashMap.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 22, 2024

cc @clarfonthey

This will give more freedom for the internal implementation details of
hashbrown to evolve without the need for regular releases with breaking
changes.

All existing users of `RawTable` should migrate to the `HashTable` API
which is entirely safe while providing the same flexibility as
`RawTable`.

This also removes the following features which were only exposed under
`RawTable`:
- `RawTable::iter_hash`
- `RawIter::reflect_insert` and `RawIter::reflect_remove`
- `RawTable::clone_from_with_hasher`
- `RawTable::insert_no_grow` and `RawTable::try_insert_no_grow`
- `RawTable::allocation_info`
- `RawTable::try_with_capacity(_in)`
- `HashMap::raw_table(_mut)` and `HashSet::raw_table(_mut)`
@Dandandan
Copy link

Dandandan commented Aug 23, 2024

I think https://github.com/Apache/arrow-datafusion

uses:

  • RawTable::try_insert_no_grow
  • RawTable::allocation_info

@clarfonthey
Copy link
Contributor

clarfonthey commented Aug 23, 2024

Since you're already doing this, would you mind pushing a release after too so rust-lang/rust#128711 can go through? Please and thank you. <3

IMHO the sooner we do this, the better, since it gives people time to migrate away from the API before we start following up on the threat to improve the code. And, since it's a published crate, it's not a big deal, since the old (non-breaking) versions will continue to work correctly.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 23, 2024

I think Apache/arrow-datafusion uses:

Datafusion seem to actually require the full functionality of RawTable since it directly stores bucket indices in a BinaryHeap so that it can delete arbitrary elements later (and rebuilds the heap on rehash to keep the indices valid).

This isn't something that can be supported on HashTable since it relies on low-level implementation details of RawTable such as the fact that buckets are never moved except in a rehash. Perhaps in the future the new raw API that @clarfonthey is planning (see #545) will be publicly exposed, but in the meantime datafusion can just stay on the previous hashbrown version.

@Dandandan
Copy link

FYI @alamb @avantgardnerio

@alamb
Copy link

alamb commented Aug 23, 2024

We have also been discussing implementing our own custom hash table in apache/datafusion#7095, so perhaps this would be another potential reason to pursue that idea

I agree that using the low level details of how the hash table in hashbrown is implemented is not ideal (e.g. it constrains how hashbrown can version releases)

@Amanieu I wonder if you would consider a feature flag on hashbrown like hashbrown_unstable (following the model of tokio_unstable). That would let DataFusion keep using hashbrown internals (and it would be on us to deal with any changes during upgrade)

@avantgardnerio
Copy link

This will require work in DataFusion to fix TopK aggregations, and then after it is fixed it will cause significant (40% IIRC) performance regressions.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 23, 2024

I don't think that's a good idea because every minor release of hashbrown could break users of your crate.

However you can just keep using the 0.14 version of hashbrown which will still have the raw feature. It's only being removed in 0.15.

@alamb
Copy link

alamb commented Aug 23, 2024

However you can just keep using the 0.14 version of hashbrown which will still have the raw feature. It's only being removed in 0.15.

Another option is for us to to fork the crate which might be reasonable if we can get more performance by doing so

@Dandandan
Copy link

I think we should create some tickets in DataFusion for moving our usage towards HashTable / own hashtable (e.g. better vectorized) / forked code.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 23, 2024

@bors r+

@bors
Copy link
Contributor

bors commented Aug 23, 2024

📌 Commit 26ef4a1 has been approved by Amanieu

It is now in the queue for this repository.

@bors
Copy link
Contributor

bors commented Aug 23, 2024

⌛ Testing commit 26ef4a1 with merge aa1411b...

@bors
Copy link
Contributor

bors commented Aug 23, 2024

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing aa1411b to master...

@bors bors merged commit aa1411b into rust-lang:master Aug 23, 2024
24 checks passed
@the-mikedavis
Copy link
Contributor

I was using RawTable::iter_hash for a multi-map-like type (kind of like HashMap<K, Vec<T>>) and I don't believe there's a way to look up multiple entries with the same key from HashTable<(K, V)>: HashTable::find stops after the first matching element. Would it be possible to introduce something safe in HashTable to take its place? I can share more code if you'd like more context and maybe work on a PR if you think it's a reasonable addition.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 28, 2024

I think iter_hash makes sense to expose on HashTable. Feel free to send a PR, you can do a partial revert of this one to get iter_hash back in RawTable.

bors added a commit that referenced this pull request Sep 2, 2024
Add `HashTable::iter_hash`, `HashTable::iter_hash_mut`

This is a follow-up to #546 ([comment](#546 (comment))). `iter_hash` from the old raw API can be useful for reading from a "bag" / "multi map" type which allows duplicate key-value pairs. Exposing it safely in `HashTable` takes a fairly small wrapper around `RawIterHash`. This PR partially reverts #546 to restore `RawTable::iter_hash` and its associated types.
Amanieu added a commit to Amanieu/hashbrown that referenced this pull request Sep 12, 2024
This was previously removed from `RawTable` in rust-lang#546. This is now added
as a public API on `HashMap`, `HashSet` and `HashTable`.
@xacrimon
Copy link

xacrimon commented Sep 14, 2024

@Amanieu dashmap relies on RawTable functionality. With this merged we're be effectively locked to 0.14 forever.

@Amanieu
Copy link
Member Author

Amanieu commented Sep 14, 2024

What specific functionality do you need that isn't available though HashTable which is the replacement for RawTable?

@cuviper
Copy link
Member

cuviper commented Sep 14, 2024

I took a brief look, and the main roadblock that I see will be iterators and entry structs that currently contain RwLock guards with corresponding RawIter / Bucket / InsertSlot. With HashTable, those second parts all have lifetimes that would need to borrow through the guard, so they become self-referential.

@clarfonthey
Copy link
Contributor

Added a comment to #545 mentioning to look into dashmap's internals when I go about ripping out the raw table API. It may be the case that we need to offer some kind of lifetime-erased version of the various types to get it to work, but I'm hoping we can get around that.

bors added a commit that referenced this pull request Sep 18, 2024
Re-introduce a way to get the allocation size of a table

This was previously removed from `RawTable` in #546. This is now added as a public API on `HashMap`, `HashSet` and `HashTable`.

Fixes #238
Fixes #506
@Amanieu
Copy link
Member Author

Amanieu commented Oct 1, 2024

I took a brief look, and the main roadblock that I see will be iterators and entry structs that currently contain RwLock guards with corresponding RawIter / Bucket / InsertSlot. With HashTable, those second parts all have lifetimes that would need to borrow through the guard, so they become self-referential.

I thought about this a bit and fundamentally it's not a problem with HashTable but rather a limitation of the MutexGuard API. I think you can work around this by using a Mutex<()> and an UnsafeCell<HashTable> where you manage the locking manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants