-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New atomic reference counting algorithm #116173
base: master
Are you sure you want to change the base?
Conversation
I don't think rustc itself is a good benchmark, since it doesn't use |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Awesome. Super excited to see how this performs outside of the microbenchmarks in our paper. Please feel free to reach out if you want help. |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
New atomic reference counting algorithm This implements a new 'wait-free' atomic reference counting algorithm based on the paper [Wait-Free Weak Reference Counting](https://www.microsoft.com/en-us/research/uploads/prod/2023/04/preprint-644bdf17da97d.pdf) by `@mjp41,` `@sylvanc` and `@bensimner.` The paper contains a way to implement atomic reference counting *wait free*, for five operations: 1. Release weak (= Weak::drop) 2. Release strong (= Arc::drop) 3. Acquire weak (= Weak::clone and Arc::downgrade) 4. Acquire strong (= Arc::clone) 5. Acquire strong from weak (= Weak::upgrade) `Weak::upgrade` must increase the strong count if it is nonzero. Unfortunately, processors do not have native 'increment if nonzero' instructions. Therefore, it is usually implemented with a CAS loop to increment the strong count only if it is not zero. The paper shows a way to avoid this CAS loop to make it wait-free. By reserving the least significant bit in the strong counter (by shifting the counter one bit to the left), we can use that extra bit to indicate the final 'dropped' state in which the strong counter is permanently zero. Then `Weak::upgrade` can be implemented as a `fetch_add(2)`, which leaves the 'permanently zero' bit untouched. This however does mean that Arc::drop must now do an additional operation. Not only must it decrement the strong counter, it must also use a CAS operation to set the 'permanently zero' bit if the counter is zero. The paper also shows an optimized version of the algorithm in which an additional bit in the strong counter is reserved to indicate whether there have ever been any weak pointers. When this bit is not set, some steps can be skipped. However, the algorithm from the paper is unfortunately not something we can directly use in for Rust's standard `Arc` and `Weak`, because we have more operations: 1. Weak::drop 2. Arc::drop 3. Weak::clone 4. Arc::clone 5. Weak::upgrade 7. Arc::downgrade 8. Arc::get_mut 9. Arc::try_unwrap 10. Arc::into_inner Specifically `Arc::get_mut` requires locking the weak counter to temporarily block `Arc::downgrade` from completing. We cannot implement `Arc::downgrade` as just `Weak::clone`. Our `Arc::downgrade` implementation is implemented as a CAS loop to increment the weak counter if it doesn't hold the special 'locked' (usize::MAX) value, similar to how Weak::upgrade uses a CAS loop (which is what the paper solves). I have extended the algorithm by also reserving the lowest bit in the weak counter to represent the 'weak counter locked' state. That way, `Arc::downgrade` can be implemented as a `fetch_add(2)` rather than a CAS loop. It will still have to spin in case of a concurrent call to `Arc::get_mut`, but in absence of `Arc::get_mut`, this makes Arc and Weak wait-free. The paper shows some promising benchmarking results. I have not benchmarked this change yet.
Once the The rust-timer bot will also run some benchmarks on rustc itself, but those are unlikely to show anything interesting, because rustc doesn't have any meaningful usage of Weak::upgrade. |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@m-ou-se this is really cool to see. Mostly minor, but I think downgrade
is incorrect. Please let me know, if you want more explanation of the +2, it is quite subtle, and not covered well in our paper.
cur = this.inner().weak.load(Relaxed); | ||
continue; | ||
} | ||
let prev = this.inner().weak.fetch_add(ONE_WEAK, Acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where this is locked, you will loop and re-execute this line, which will add multiple weak references.
Also, the paper checks if it is weak == 0
and if it observes this adds two. This is really important, and not part of your implementation. Without that sequence, it is possible to reach weak == 0
multiple times, which is very bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where this is locked, you will loop and re-execute this line, which will add multiple weak references.
When the low bit of the weak pointer is set, the rest of the bits are meaningless (just like in the strong counter in your paper). So it doesn't matter that fetch_add is repeated, as long as it leaves the least significant bit set.
Also, the paper checks if it is weak == 0 and if it observes this adds two. This is really important, and not part of your implementation. Without that sequence, it is possible to reach weak == 0 multiple times, which is very bad.
My code does that, but as a separate step: a few lines below here, in the if prev == 0
, I add an extra ONE_WEAK
, so we raise the weak counter by two in total if it was zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where this is locked, you will loop and re-execute this line, which will add multiple weak references.
When the low bit of the weak pointer is set, the rest of the bits are meaningless (just like in the strong counter in your paper). So it doesn't matter that fetch_add is repeated, as long as it leaves the least significant bit set.
Nice. That's great. I hadn't got that bit. Thanks for explaining.
Also, the paper checks if it is weak == 0 and if it observes this adds two. This is really important, and not part of your implementation. Without that sequence, it is possible to reach weak == 0 multiple times, which is very bad.
My code does that, but as a separate step: a few lines below here, in the
if prev == 0
, I add an extraONE_WEAK
, so we raise the weak counter by two in total if it was zero.
Sorry I see you increase by two, but doing in two increments was my concern. I was convinced it had to be done in one go, but now I'm trying to write out the counter example and can't find one. I am now convinced it is okay.
At some point, I'll try to extend @bensimner's proof to this variation.
Thanks for your detailed review! I will try to address your comments Edit: found a few minutes to respond right now. ^^ |
Co-authored-by: Matthew Parkinson <mjp41@users.noreply.github.com>
The values are the same, but I used the wrong constants. So, this doesn't change any behaviour. It was just very confusing to read.
Finished benchmarking commit (2104764): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 631.784s -> 629.276s (-0.40%) |
This will probably not yield many gains compared to the regressions as long as there is any overhead on the NO_WEAK path (when we never create any weak refs) as most users of Arc don't create weaks. I am working on an approach that gets rid of all overhead on the fastpath while maintaining the wait-free nature of this impl (as an experiment) |
Which overhead are you referring to? Doesn't this already take a fast path basically everywhere for the 'no weak' case? |
These benchmark results look okay. The only potentially significant regression is the compilation time of ripgrep, but that extra time is spent in LLVM and not in rustc itself. So those are unrelated to the performance of Arc. That just means that the new Arc code might result in llvm spending a bit more time optimizing/transforming the code. |
Maybe this concern is completely irrational but i would assume because of the frequent usage of weakless Arcs the additional bit op (which is just 1 additional cycle per strong decrement) could add up to outweigh the benefits from improving the weak impl, but maybe i am just wrong about that |
Maybe we could try to benchmark this with the parallel frontend enabled (I'm not sure if it actually uses Arc heavily though). CC @SparrowLii. |
Nvm i saw now that that op can just be applied at compile time (probably). I was talking about the shift in the drop part, if the optimizer understands that it can just modify the RHS of the comparison instead of actually applying the shift to the LHS value then there won't be any additional ops needed. (the only additional thing on the fastpath with simple new + clone + drop operations will be the mov for the argument to drop_slow but that is probably negligible as it will only be executed once the Arc is dropped. |
So far, I have not found that |
Yeah, that compiles down to a single comparison just fine: https://godbolt.org/z/TvMTq5fez (I wrote it as a shift because I think that shows the intent more clearly.) |
Alright, sorry for the confusion 😅 |
☔ The latest upstream changes (presumably #115546) made this pull request unmergeable. Please resolve the merge conflicts. |
@m-ou-se out of interest what is the bar for this making it into Rust? Do you need an application that shows a win, or just no one showing a lose? |
@mjp41 I think we should continue to merge if we can find some real world (or at least realistic) code that has a clear win. If we only show no performance regressions but no improvements, I don't think we should merge it. This change makes it a bit harder to maintain the Arc code for future maintainers, so it has to be worth it. @yijunyu sent me a list of all crates on crates.io that call Listflattiverse_connector-36.1.1 So, all crates whose performance might be increased by this PR should all be in this list. It'd be useful to find one or more crates in this list that heavily rely on Weak::upgrade in a hot path. Another option would be to write some (realistic?) benchmarks that we can include in the standard library's test suite and/or the runtime benchmark suite of rustc-perf. |
Ping from triage: |
This implements a new 'wait-free' atomic reference counting algorithm based on the paper Wait-Free Weak Reference Counting by @mjp41, @sylvanc and @bensimner.
The paper contains a way to implement atomic reference counting wait free, for five operations:
Weak::upgrade
must increase the strong count if it is nonzero. Unfortunately, processors do not have native 'increment if nonzero' instructions. Therefore, it is usually implemented with a CAS loop to increment the strong count only if it is not zero.The paper shows a way to avoid this CAS loop to make it wait-free. By reserving the least significant bit in the strong counter (by shifting the counter one bit to the left), we can use that extra bit to indicate the final 'dropped' state in which the strong counter is permanently zero. Then
Weak::upgrade
can be implemented as afetch_add(2)
, which leaves the 'permanently zero' bit untouched.This however does mean that Arc::drop must now do an additional operation. Not only must it decrement the strong counter, it must also use a CAS operation to set the 'permanently zero' bit if the counter is zero.
The paper also shows an optimized version of the algorithm in which an additional bit in the strong counter is reserved to indicate whether there have ever been any weak pointers. When this bit is not set, some steps can be skipped.
However, the algorithm from the paper is unfortunately not something we can directly use in for Rust's standard
Arc
andWeak
, because we have more operations:Specifically
Arc::get_mut
requires locking the weak counter to temporarily blockArc::downgrade
from completing. We cannot implementArc::downgrade
as justWeak::clone
.Our
Arc::downgrade
implementation is implemented as a CAS loop to increment the weak counter if it doesn't hold the special 'locked' (usize::MAX) value, similar to how Weak::upgrade uses a CAS loop (which is what the paper solves).I have extended the algorithm by also reserving the lowest bit in the weak counter to represent the 'weak counter locked' state. That way,
Arc::downgrade
can be implemented as afetch_add(2)
rather than a CAS loop. It will still have to spin in case of a concurrent call toArc::get_mut
, but in absence ofArc::get_mut
, this makes Arc and Weak wait-free.The paper shows some promising benchmarking results. I have not benchmarked this change yet.