Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try seize as a replacement of crossbeam-epoch? #385

Open
tatsuya6502 opened this issue Jan 20, 2024 · 13 comments
Open

Try seize as a replacement of crossbeam-epoch? #385

tatsuya6502 opened this issue Jan 20, 2024 · 13 comments
Assignees

Comments

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jan 20, 2024

Try seize crate as a replacement of crossbeam-epoch crate.

moka cache uses crossbeam-epoch in its lock-free concurrent hash table called cht. crossbeam-epoch is an epoch-based garbage collector and helps to implement high-performance concurrent data structures like cht. However, we have been struggling for years with the spec of crossbeam-epoch, which does not guarantee to run the destructor of the pointed objects.

https://docs.rs/crossbeam-epoch/0.9.18/crossbeam_epoch/struct.Guard.html#method.defer_destroy

There is no guarantee when exactly the destructor will be executed. The only guarantee is that it won’t be executed until all currently pinned threads get unpinned. In theory, the destructor might never run, but the epoch-based garbage collection will make an effort to execute it reasonably soon.

Many of moka cache users expect the cached entries to be dropped as soon as they are evicted from the cache, of course. However, this crossbeam-epoch spec causes these evicted entries to be dropped with some delays or never dropped.

We have added mitigations for it (notes) and they mostly work except performance degradation. However, there are still some corner cases which seems difficult to solve/mitigate. It might be the time to try alternatives like seize. seize is based on Hyaline reclamation scheme and used in a concurrent hash table called flurry.

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Jan 20, 2024

Unlike crossbeam-epoch, seize provides the followings:

  1. Configurable batch_size: Collector::batch_size
    • This controls the tradeoff between throughput and memory efficiency.
    • Maybe moka cache should allow user to change this value when creating an instance of cache.
  2. Protection against stalled threads: Collector::epoch_frequency
    • This ensures evicted cache entries to be eventually dropped.
  3. No global collector:
    • Each moka cache instance will have a dedicated seize collector.
    • This ensures all cached entries to be dropped when the cache instance is dropped.

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Jan 21, 2024

I am doing some experiments in a separate repository.

@tatsuya6502 tatsuya6502 self-assigned this Jan 21, 2024
@tatsuya6502
Copy link
Member Author

Memo:

Ideally, I want a concurrent memory reclamation library to provide the followings:

  1. Give us a control whether this differed destruction request should be buffered in a thread local storage or not.
    • We want some of them not to be buffered, while others to be buffered.
  2. Provide an API to trigger a GC.
  3. When a GC runs, pull differed destruction requests from the local buffers of all registered threads,
    • instead of letting each threads to push them to the global queue.

@ibraheemdev
Copy link

ibraheemdev commented Jul 1, 2024

Hi! I happened to come across this issue and can share some details about seize.

One of the biggest benefits of seize over crossbeam-epoch is predictable latency. Epoch based reclamation has the issue that you have to check whether it is safe to free objects every once in a while when pinning/unpinning a threads, and this check is very expensive as you have to access shared memory from other threads. This means you run into a fundamental tradeoff between latency and memory efficiency. Seize on the other hand only performs this check once when retiring a batch (by pushing the objects to other threads), and uses reference counting to free memory.

This is an important consideration for a cache as you want an optimal balance between performance and memory usage.

Give us a control whether this differed destruction request should be buffered in a thread local storage or not.

This is something I've been thinking about. The problem is that seize requires at least as many objects as the number of active threads to be able to retire a batch, as each object acts as a node in a local thread linked list. Skipping the thread-local buffer would require allocating a large chunk of memory to perform the retirement process, which may be confusing.

What you can do is try to reclaim by calling Guard::flush after large objects are retired.

@ibraheemdev
Copy link

I also recently released papaya, a hash table built on seize. I'd be interested in porting moka to use papaya and see the results, I'd expect significant performance improvements compared to the internal cht, especially for readers. Is that something you'd be open to?

@tatsuya6502
Copy link
Member Author

@ibraheemdev

Is that something you'd be open to?

Thanks. Yes, I am open to it. I was not aware of papaya, but at first glance, it sounds like a good idea to use it in moka!

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Jan 16, 2025

I briefly checked the papaya API for the first time.

  • A HashMap can have a custom seize collector via the builder. (nice)
    • moka cache should expose this feature to the users.
  • sieze::Guard::flush
    • moka cache should expose a similar method to the users.
    • The cache will internally call self.map.guard().flush()?
  • It just got the Equivalent trait. (not yet released)v0.1.8
  • As of v0.1.8, the MSRV is 1.72.0.

In some use cases, it is very important to customize the collector and/or to be able to call flush. User may store large values in the cache and want to drop them as soon as they are removed from the cache:

#125 (comment)

Our cache entries are quite large and not that many so having about ~100 of them hanging around might look like a memory leak to us.

quickwit-oss/quickwit#1735

Unfortunately they do not fit our need.

They target a highly concurrent small k,v use case, and do not strictly respect the limits they are assigned. All data is stored, and eventually GCed by a background task.

In our use case, we get/store very large expensive objects, at a pace of few hundreds per seconds.

@tatsuya6502
Copy link
Member Author

It just got the Equivalent trait. (not yet released)v0.1.8

I am adding support for the Equivalent trait to moka:

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Jan 18, 2025

CC: @ibraheemdev

I started experimenting with papaya in the moka cache using a separate repository.

So far, so good.

  • Replaced the main concurrent hash table in sync caches with papaya.
    • There are four concurrent hash tables used in a cache:
      • The main hash table to store cached entries.
      • One for the waiter locks used by the get_with and entry and_compute methods.
      • One for the key locks used by the eviction listener.
      • One for the predicate store used by the invalidate_entries_if.
  • TODOs
    • Need to replace crossbeam_epoch::pin().flush() with something like self.map.guard().flush().
    • Implement and re-enable the iterator.
    • Remove redundant hash calculations (if any).
    • Do the same things for the future cache.

Some thoughts:

  • I got a good impression of papaya:
    • It has nice APIs.
    • The GC algorithm of seize looks promising.
  • Currently, the resize mode is hard-corded to Blocked, otherwise I get very poor benchmark results from mokabench including single-threaded ones.
  • Need the ability of shrink the capacity.
  • Nice to have the remove_entry_if method in papaya to remove an entry with key based on a predicate.
    • The moka cache's eviction policy relies heavily on this method.
    • Currently, I use the compute method instead of remove_entry_if (code). But it is not ideal because it requires an owned key, instead of a borrowed key.

I need to think about how to transition the user to a papaya based implementation.

  1. Provide a crate feature to switch between the current cht and papaya implementations.
    • e.g. legacy-cht, which will be a default feature. In the future, it will become non-default and then eventually removed.
  2. Or, do the same thing at the runtime by using a builder option.
    • Choose a concurrent hash table implementation per cache instance.

Perhaps do 2. as it gives more flexibility to the users?

@ibraheemdev
Copy link

Exciting!

Currently, the resize mode is hard-corded to Blocked, otherwise I get very poor benchmark results from mokabench including single-threaded ones.

Yeah, this is somewhat expected in microbenchmarks. However, it can be useful in some applications where writes are rare and latency is more important than throughput. Ideally this would be exposed as a configuration option.

Need the ability of shrink the capacity.

We can do this dynamically based on the number of live entries in a table when resizing. Is there a heuristic that you use currently to determine when to shrink?

Nice to have the remove_entry_if method in papaya to remove an entry with key based on a predicate.

Ah yeah using a borrowed key here would be nice. I opened an issue here: ibraheemdev/papaya#53.

I need to think about how to transition the user to a papaya based implementation.

The feature approach sounds good. Ideally papaya doesn't have any downsides over the current implementation so there shouldn't be any burden on the user to decide the internal datastructure.


Another thought I had is that it might be possible for moka to set epoch_frequency to None by default. This can greatly improve performance, and if seize exposed a compile-time feature it could also remove a usize of metadata per entry in the map. The main downside is that you lose protection against stalled threads (a thread that is holding a guard for a long time and preventing reclamation of new objects that are being allocated). However, I'm not sure how useful this is in practice, because moka doesn't hold guards for very long and just clones values. The only time a guard is held for a long time, from what I can tell, is for iterators, which are likely to access new values anyways. I'll leave this up to you but I think there should at least be a compile-time feature for this due to the memory usage saving. I opened an issue for this here ibraheemdev/seize#33.

I also opened an issue about force flushing a guard (ibraheemdev/seize#34), because this seems like it might be important to some users.

@tatsuya6502
Copy link
Member Author

Thanks for the feedback and opening the issues!

Need the ability of shrink the capacity.

We can do this dynamically based on the number of live entries in a table when resizing. Is there a heuristic that you use currently to determine when to shrink?

papaya expands the capacity when the number of live entries exceeds half of the capacity, and then it doubles the capacity. So the next allocation will start at 1/4 (25%) full. How about shrinking the capacity when the number of live entries is less than 1/8 (12.5%) or 1/16 (6.25%) of the capacity? Also, I think it should respect the initial capacity when shrinking the capacity; the table should not be shrunk below the initial capacity.

So the new heuristic would be:

let next_capacity = match cfg!(papaya_stress) {
    true => ...,

    // Double the table capacity if we are at least 50% (1/2) full.
    false if self.len() >= (table.len() >> 1) => table.len() << 1,

    // Halve the table capacity if we are at most 12.5% (1/8) full,
    // but don't halve below the minimum capacity.
    false if self.len() <= (table.len() >> 3) => table.min_cap().max(table.len() >> 1),

    // Otherwise, keep the table capacity the same.
    false => table.len(),
};
  • If the table was created with an initial capacity of 0, the minimum capacity min_cap is 1.
  • Otherwise, the min_cap is the initial capacity.
  • The min_cap can be updated when the reserve method is called with an additional capacity.

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Jan 19, 2025

Another thought I had is that it might be possible for moka to set epoch_frequency to None by default. This can greatly improve performance, and if seize exposed a compile-time feature it could also remove a usize of metadata per entry in the map. The main downside is that you lose protection against stalled threads (a thread that is holding a guard for a long time and preventing reclamation of new objects that are being allocated). However, I'm not sure how useful this is in practice, because moka doesn't hold guards for very long and just clones values. The only time a guard is held for a long time, from what I can tell, is for iterators, which are likely to access new values anyways. I'll leave this up to you but I think there should at least be a compile-time feature for this due to the memory usage saving. I opened an issue for this here ibraheemdev/seize#33.

Thanks for the suggestion. I implemented enough stuff to moka sync cache to run some micro benchmarks. I compared the elapse time to process a dataset (cache trace) by the original moka with the cht and with papaya. I set None to the papaya epoch_frequency for this run.

The results are below. moka with papaya clearly outperforms the one with the original cht. Great job!

mokabench1-2025-01-18.png

mokabench2-2025-01-18.png

mokabench3-2025-01-18.png

  • Ran on a Linux x86_64 PC on Intel Core i7-12700F. (20 hyperthreads)
    • I also tried macOS arm64 on 8-core Apple M1 chip, and got similar results.
  • For a fair comparison, I removed all the crossbeam_epoch::pin().flush() calls in the original moka. (commit)
  • The configurations of the papaya HashMap:
    • The resize mode: Blocked
    • The epoch frequency: None

@ibraheemdev
Copy link

papaya expands the capacity when the number of live entries exceeds half of the capacity, and then it doubles the capacity. So the next allocation will start at 1/4 (25%) full. How about shrinking the capacity when the number of live entries is less than 1/8 (12.5%) or 1/16 (6.25%) of the capacity? Also, I think it should respect the initial capacity when shrinking the capacity; the table should not be shrunk below the initial capacity.

Shrinking at 1/8 seems reasonable. My only concern was that in a workload where you continually insert and clear the map, shrinking would lead to unnecessary extra growth. I suppose that respecting the initial capacity is a solution for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants