[WIP] Mustardwatching epoch- and pointer-based reclamation #221

jeehoonkang · 2018-11-07T19:12:52Z

Here is my note on my experiment on mixing epoch-based reclamation (EBR) and pointer-based reclamation (e.g. hazard pointers, HP). My code is here: https://github.com/jeehoonkang/crossbeam/tree/snowflake/crossbeam-epoch Currently it's neither tested nor documented, unfortunately... Contributions or any form---code, documentation, comments, feedbacks---are very welcome!

Motivation

For safe memory reclamation (SMR) in concurrent data structures, a thread advertises (for experts: synchronizes reads-after-writes) that it is accessing some objects ("hazards") so that the other thread should not deallocate them. In the design of SMR schemes, the granularity of hazards is one of the most important design choices the creator should make. Roughly speaking, there are two representative choices: epoch-based reclamation (EBR) and pointer-based reclamation (e.g. hazard pointers, HP).

EBR is coarse-grained in that a thread advertises the epoch (think: timestamp) in which it is accessing the shared memory. The idea is that the garbage that is thrown in old epochs are no longer accessible from any thread and is safe to deallocate. An epoch can be incremented only if all the threads agreed to release all the pointers to the shared memory acquired in the previous epoch. EBR is usually fast because a thread needs to advertise its epoch only. On the other hand, it may not collect garbages in a timely fashion because of the coarse granularity: a thread may hold an epoch and disrupt garbage collection indefinitely. Specifically, EBR doesn't work well if (1) there exist long-lived pointers to the shared memory (e.g. map/set or cache); or (2) there are a lot of threads so that each thread cannot easily make progress. Previously long-lived pointers are handled with reference counting, which is sub-optimal because it writes to the memory even in the read path. On the other hand, the second case usually happens when the number of threads exceeds that of CPU cores, so its problem can be mitigated by using thread pools.

On the other hand, HP is fine-grained in that a thread advertises the pointers to the hazardous objects (so the name "hazard pointers"), and it can collect garbages quite aggressively thanks to the fine granularity. However, the problem is that it's often slow because a thread needs to advertise its hazard often and large.

Now time for mustardwatching! We want to take the advtange of mixing both approaches: using EBR when we can properly increment epochs, and using HP otherwise. I implemented a hybrid of EBR and HP on top of crossbeam-epoch. By doing so, we can efficiently support long-lived pointers while retaining the benefits of EBR, by simply turning the pointers into hazard pointers. The corresponding API is Guard::defend().

Performance

It's performance is comparable with the master branch in the absence of hazard pointers. Here's a comparison of the results of cargo +nightly bench:

name                     control ns/iter  variable ns/iter  diff ns/iter   diff %  speedup
 multi_alloc_defer_free   2,060,746        2,177,730              116,984    5.68%   x 0.95                                                                                                                 
 multi_defer              1,329,516        1,451,704              122,188    9.19%   x 0.92                                                                                                                 
 multi_flush              12,558,094       28,366,437          15,808,343  125.88%   x 0.44                                                                                                                 
 multi_pin                4,283,460        4,174,085             -109,375   -2.55%   x 1.03                                                                                                                 
 single_alloc_defer_free  34               36                           2    5.88%   x 0.94                                                                                                                 
 single_defer             17               24                           7   41.18%   x 0.71
 single_flush             110              608                        498  452.73%   x 0.18
 single_pin               7                8                            1   14.29%   x 0.88

I fully expected that flush() becomes slower: now we're checking hazard pointers in addition to epochs, which this benchmark . The performance of pin() is similar. The performance of defer() drops a little bit, but from maual inspection of generated assemblies I think it's unavoidable.

Related Work: Snowflake

I took a lot of inspiration from Microsoft Research's Snowflake (so the branch name), but my implementation differs from Snowflake in that:

My implementation properly supports EBR and protects short-lived pointers much more efficiently, while Snowflake uses only the idea of EBR to optimize HP and it's not exposed to the users.
My implementation doesn't support ejection of ill-behaved threads from the protocol, which guarantees robust garbage collection (i.e. a ill-behaved thread cannot block the deallocation of arbitrarily many resources).

Roughly speaking, my implementation is EBR + HP, while Snowflake is HP (boosted with EBR idea) + ejection mechanism. I've tried to design EBR + HP + ejection mechanism, but I believe EBR and ejection doesn't come along well.

The text was updated successfully, but these errors were encountered:

Vtec234 · 2018-11-08T16:22:19Z

Wow, this is really cool! I skimmed the code a bit and something that initially stood out is that it keeps hazards in the same bags as EBR deferred objects, which then requires a filter to ignore hazardous objects when dropping bags due to epoch expiration. Perhaps it would be possible to instead store hazards in a separate list, and give them a different type, (e.g. HazardBags), such that epoch bags can still be dropped safely without searching for hazards? Or am I missing something that prevents this?

jeehoonkang · 2018-11-08T16:37:55Z

@Vtec234 Thanks for reading the code and giving a comment! Yes, I'm also thinking about that. Actually putting deferred functions and deferred deallocation together inside an enum type (Garbage) incurs overhead quite a lot. Splitting them might be beneficial for performance.

ghost · 2018-12-03T12:05:35Z

This seems great! I feel very optimistic about this approach and believe with some work we could almost completely eliminate the overhead of hazard pointers.

mjp41 · 2018-12-03T13:15:56Z

I've tried to design EBR + HP + ejection mechanism, but I believe EBR and ejection doesn't come along well.

I think you are correct that if you are using EBR to protect the objects, not just to guarantee consistency of the hazard pointers, then you cannot add ejection.

P.S. Happy to chat about the .NET version we prototyped.

jeehoonkang · 2019-06-16T09:04:24Z

A status update: I implemented a series of patches related to this issue in this branch.

Using small epoch numbers: the Snowflake paper basically describes how to ensure safety with only 5 epochs (wrapping around). The commit is implementing that. Now the epochs can be fit in only 3 bits!
Supporting hazard pointers along with EBR. That's basically reimplementing what's described in this issue. It supports both epoch- and HP-protected accesses to shared memory. It now uses bloom filters. What's interesting about epochs is they are now tagged (as 3 bits) in pointers to bloom filters.
Supporting hazard pointers and ejection. It's dropping the support for epoch-protected accesses to shared memory, but hazard pointers are still managed with epochs. Instead, it implements an ejection mechanism which may remove a (non-cooperating) thread from EBR in a lock-free manner. As a result, it's robust (think: spatially non-blocking).

As @mjp41 suggsted, it seems EBR + ejection is impossible. (Though some non-portable schemes achieve this by e.g. investigating other thread's register files and stacks.) That's the reason why I couldn't make a scheme with EBR + HP + ejection.

Currently they're not performing very well, and I'm trying to optimize them. I think the first and second patches are worth merging, if they're suitably optimized, because they'll better support long-lived pointers than the current version. But the third patch is not suitable for being merged in crossbeam-epoch, because it changes the API a lot.

By August, I'm planning to write a paper on this, and to write a crossbeam RFC for merging the patches. I will keep reporting the status!

jeehoonkang · 2019-08-09T14:44:24Z

@tomtomjhj and I just made an article on "supporting hazard pointers and ejection": https://cp.kaist.ac.kr/gc/ Comments and feedbacks are very welcome!

glaebhoerl · 2019-08-16T13:51:18Z

@jeehoonkang Very interesting, thanks for sharing!

Two questions occurred to me:

At one point you write:

We compare the performance of PEBR with that of the NR (noreclamation) and EBR implementations of Crossbeam. We do not compare it with that of PBR schemes because EBR is known to outperform them by a large margin [51]

And later:

Dice et. al. proposes a variant of HP that uses a compiler fence for shield protection (Shield::protect), which frequent, and a process-wide memory fence for reclamation (collect), which is less frequent. As a result, unlike HP, their scheme is fast.

What is the relationship between these two remarks? Is EBR also much faster than the Dice et al. version of PBR? (Or might that one be worth comparing against?)
What criteria are used to determine when a thread should be ejected?

jeehoonkang · 2019-08-18T13:57:07Z

@glaebhoerl Thank you for your interest in our article.

Yes, it would have been much better to compare the performance of PEBR and that of Dice et al.'s version of HP. @tomtomjhj and I will work on it soon. Thank you for the suggestion!
In crossbeam's terminology, we "try_advance" once every a few calls of defer_destroy (retire in the article), and we force_advance if try_advance fails for a pre-defined number of times.

cynecx · 2020-02-22T03:48:20Z

@jeehoonkang Any updates on this?

jeehoonkang · 2020-02-22T13:04:45Z

We just got noticed that this work will be published at PLDI 2020 :) But I'm still not sure how we can upstream our effort to Crossbeam (this repository).

mjp41 · 2020-02-22T14:51:30Z

@jeehoonkang Congratulations on the PLDI paper.

cynecx · 2020-02-22T17:45:26Z

@jeehoonkang That’s awesome! Is there any particular reason it won’t be possible to upstream your efforts to crossbeam? Is it possible to get involved somehow? (I read somewhere that optimizations are still possible).

Firstyear · 2020-07-27T23:59:55Z

It would be really good to have this or hazard pointers in crossbeam, especially for use cases where you have to hold a guard for a longer period (which can cause issues in EBR). :)

After some thinking and some testing, this seems like it was just always a bad idea unfortunately! I think that this basically ends up just always leading to deadlock, so this patch attempts to work around this behavior by avoiding acquiring any tokens whenever there's an inherited jobserver. As to why I think this is good long-term behavior, it's probably useful to dig into what's happening right now. Currently if sccache ever creates it's own jobserver it should be properly acquiring tokens from it and the processes spawned should be configured/return tokens and such. This, as far as I know, isn't the case to worry about. Instead, we're exclusively worried about two situations: one where the server itself inherits a jobserver and one where the client sccache process inherits a jobserver. In the former case things go very wrong very quickly. All clients, as part of the build system, typically request a jobserver token before running any code. This means that the client process acquired a token *and* the server process is going to attempt to acquire a token. For example consider a jobserver of 4 tokens. If our server is sitting idle we may spawn 4 processes, acquiring four tokens. All our clients now request the server to do some work, which *also* requires four more tokens to proceed, so deadlock! The next case we're worried about is when the server has its own jobserver but the clients inherited theirs from the ambient build system. While this doesn't happen in `make` unless explicitly specified, this happens commonly in Cargo. Here the client *also* requests a jobserver token to spawn a process for unhandled compiles. This is mostly done to make the code a bit cleaner for now but results in the same deadlock we had before. So all in all, all signs point to acquiring tokens when you inherited a jobserver as a bad idea. This commit changes the jobserver in sccache to, when inheriting from the environment, never acquire tokens. This means that all the tokens acquired to spawn processes are pseudo transferred to the server as the server does all the work instead of the client. We still configure all subprocesses to have the fds, however. Hopefully this... Closes crossbeam-rs#221

ghost added feature feedback wanted design labels Nov 8, 2018

jeehoonkang mentioned this issue Aug 4, 2019

Use after free in Michael-Scott queue? #238

Closed

jeehoonkang mentioned this issue Jan 4, 2021

Publish version 1.0 #503

Open

taiki-e added the crossbeam-epoch label Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Mustardwatching epoch- and pointer-based reclamation #221

[WIP] Mustardwatching epoch- and pointer-based reclamation #221

jeehoonkang commented Nov 7, 2018 •

edited

Loading

Vtec234 commented Nov 8, 2018 •

edited

Loading

jeehoonkang commented Nov 8, 2018

ghost commented Dec 3, 2018

mjp41 commented Dec 3, 2018 •

edited

Loading

jeehoonkang commented Jun 16, 2019 •

edited

Loading

jeehoonkang commented Aug 9, 2019

glaebhoerl commented Aug 16, 2019

jeehoonkang commented Aug 18, 2019

cynecx commented Feb 22, 2020

jeehoonkang commented Feb 22, 2020

mjp41 commented Feb 22, 2020

cynecx commented Feb 22, 2020 •

edited

Loading

Firstyear commented Jul 27, 2020

[WIP] Mustardwatching epoch- and pointer-based reclamation #221

[WIP] Mustardwatching epoch- and pointer-based reclamation #221

Comments

jeehoonkang commented Nov 7, 2018 • edited Loading

Motivation

Performance

Related Work: Snowflake

Vtec234 commented Nov 8, 2018 • edited Loading

jeehoonkang commented Nov 8, 2018

ghost commented Dec 3, 2018

mjp41 commented Dec 3, 2018 • edited Loading

jeehoonkang commented Jun 16, 2019 • edited Loading

jeehoonkang commented Aug 9, 2019

glaebhoerl commented Aug 16, 2019

jeehoonkang commented Aug 18, 2019

cynecx commented Feb 22, 2020

jeehoonkang commented Feb 22, 2020

mjp41 commented Feb 22, 2020

cynecx commented Feb 22, 2020 • edited Loading

Firstyear commented Jul 27, 2020

jeehoonkang commented Nov 7, 2018 •

edited

Loading

Vtec234 commented Nov 8, 2018 •

edited

Loading

mjp41 commented Dec 3, 2018 •

edited

Loading

jeehoonkang commented Jun 16, 2019 •

edited

Loading

cynecx commented Feb 22, 2020 •

edited

Loading