-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update #40647
Conversation
@ursabot please benchmark |
Benchmark runs are scheduled for commit 3333c48. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 3333c48. There were 10 benchmark results indicating a performance regression:
The full Conbench report has more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, but personally, I think relaxed
is enough here( relaxed gurantee "atomic", acq-rel gurantees "happens-before") ?
acq-rel might be used in the case below:
Global Ctx t
std::atomic<bool> condition{false};
void Acq() {
set Ctx
condition.store(true, std::memory_order_acq);
}
void Rel() {
while (!test condition ) { yield(); }
read Ctx
}
For MemoryPool, do we really need a acq-rel here?
Does it actually change anything for x86? |
If you're really interested in reducing the contention costs for MemoryPool statistics, then I would suggest taking a look at https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html |
Nothing, unless the compiler decides to re-order the loads and the stores. Which it didn't in this case. I suspect my change regarding
Not just the contention (the least of the issues really), but the time it takes to perform all the memory operations. |
Wouldn't acq-rel cheaper than original total-ordering? |
Only on architectures with a weaker memory model (e.g. ARM). x86 guarantees that all the stores are ordered (not immediately visible, but they are ordered). [1] explains this much better than I ever could.
|
I pushed some re-ordering of loads and stores that I believe can work better on CPUs with higher latency on the memory system [1]. Note that my code updates When benchmarking on an Intel CPU, compared to the baseline, this version doesn't cost less, but it should improve the situation on CPUs that are spending more time [1] in the load issued by Annotated # Percent│ push %rbp
│ push %r15
│ push %r14
│ push %r13
│ push %r12
│ push %rbx
│ push %rax
│ mov %r8,%r13
│ mov %rcx,%rbx
│ mov %rdx,%r12
│ mov %rsi,%r15
│ mov %rdi,%r14
│ incb 0x13f6583(%rip) # 61bc853 <__TMC_END__+0x13935b>
1 5.31 │ xor %edi,%edi
│ mov %rdx,%rsi
│ → call __sanitizer_cov_trace_const_cmp8@plt
│ test %r12,%r12
│ ↓ js 6c
1 5.56 │ incb 0x13f6570(%rip) # 61bc855 <__TMC_END__+0x13935d>
│ mov %rsp,%rdi
│ mov %r12,%rsi
│ mov %rbx,%rdx
│ mov %r13,%rcx
│ → call arrow::memory_pool::internal::JemallocAllocator::AllocateAligned@plt
│ test %r14,%r14
│ ↓ je 136
│ incb 0x13f6552(%rip) # 61bc857 <__TMC_END__+0x13935f>
│ mov (%rsp),%rax
│ mov %rax,(%r14)
│ test %rax,%rax
│ ↓ je 8b
│ incb 0x13f6541(%rip) # 61bc858 <__TMC_END__+0x139360>
│ ↓ jmp 11e
│ 6c: incb 0x13f6532(%rip) # 61bc854 <__TMC_END__+0x13935c>
│ lea typeinfo name for arrow::json::TableReader+0x22a,%rdx
│ mov %r14,%rdi
│ mov $0x4,%esi
│ → call arrow::Status::FromArgs<char const (&) [21]>@plt
│ ↓ jmp 11e
│ 8b: incb 0x13f6518(%rip) # 61bc859 <__TMC_END__+0x139361>
1 5.54 │ mov 0x40(%r15),%rbx
│ mov %r12,%rsi
4 22.34 │ lock xadd %rsi,0x48(%r15)
4 22.47 │ lock add %r12,0x50(%r15)
6 33.09 │ lock incq 0x58(%r15)
│ mov %rsi,%r13
│ add %r12,%r13
│ ↓ jo 14a
│ incb 0x13f64f1(%rip) # 61bc85b <__TMC_END__+0x139363>
│ mov %rbx,%rdi
│ mov %r13,%rsi
│ → call __sanitizer_cov_trace_cmp8@plt
│ cmp %r13,%rbx
1 5.69 │ ↓ jge 103
│ incb 0x13f64dd(%rip) # 61bc85d <__TMC_END__+0x139365>
│ d0: incb 0x13f64d8(%rip) # 61bc85e <__TMC_END__+0x139366>
│ mov %rbx,%rax
│ lock cmpxchg %r13,0x40(%r15)
│ sete %bpl
│ mov %rax,%rbx
│ mov %rax,%rdi
│ mov %r13,%rsi
│ → call __sanitizer_cov_trace_cmp8@plt
│ test %bpl,%bpl
│ ↓ jne 10b
│ cmp %r13,%rbx
│ ↓ jge 10b
│ incb 0x13f64af(%rip) # 61bc860 <__TMC_END__+0x139368>
│ ↑ jmp d0
│103: incb 0x13f64a3(%rip) # 61bc85c <__TMC_END__+0x139364>
│ ↓ jmp 111
│10b: incb 0x13f649e(%rip) # 61bc85f <__TMC_END__+0x139367>
│111: incb 0x13f649a(%rip) # 61bc861 <__TMC_END__+0x139369>
│ movq $0x0,(%r14)
│11e: incb 0x13f648e(%rip) # 61bc862 <__TMC_END__+0x13936a>
│ mov %r14,%rax
│ add $0x8,%rsp
│ pop %rbx
│ pop %r12
│ pop %r13
│ pop %r14
│ pop %r15
│ pop %rbp
│ ← ret Benchmarks (based on the old code) on a Zen 3 CPU show that the CPU can get stuck waiting for the value produced by 0.18 | lock xadd %rax,(%rdi)
80.73 | add %rsi,%rax Doing useful work that doesn't depend on From [1]:
|
971f660
to
66d4454
Compare
Which benchmark should I run? I'd like to testing on my M1 Pro |
Still not understanding that, in x86, though the instr is same ( this might help: https://darkcoding.net/software/rust-atomics-on-x86/ ) , relaxed might possible to get compiler-reordering. Besides, see cases above:
I think just relaxed is enough here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@mapleFU I paste the CSV output of the benchmarks I invoke manually, on Excel, then I make a chart based on that data. |
total_allocated_bytes_.fetch_add(size, std::memory_order_acq_rel); | ||
num_allocs_.fetch_add(1, std::memory_order_acq_rel); | ||
|
||
// If other threads are updating max_memory_ concurrently we leave the loop without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Actually this might make program a big slower but making max_memory_
more precisely? My M1 Pro benchmark might get a bit slower in some cases, I guess it might related)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sampled with perf
at 10000 Hertz and this was almost never hit by the sampler (I pasted the disassembly with counts and percentages on the main thread).
If threads are competing for max_memory_
, whoever wins and pushes the update through the memory system frees all the other threads from having to update the value. Whereas before we would have every thread racing to apply their store on the value to this cache line.
// with execution. When done, max_memory and old_bytes_allocated have | ||
// a higher chance of being available on CPU registers. This also has the | ||
// nice side-effect of putting 3 atomic stores close to each other in the | ||
// instruction stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may ask a stupid question, but also may I also ask why here not using std::memory_order_release
for the store instr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fetch_add
is not a store, it's a load+store. The ALU needs the latest value to correctly calculate the sum, that then gets stored in the memory position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://godbolt.org/z/bfKe9vG6P Seems fetch_add
without return calls lock add
, and with returning calls xadd, don't know would it load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( Maybe a sample with Release only: https://github.com/rust-lang/rust/blob/60b5ca62752ecc25d578066c8b82e1a4887267d4/library/std/src/sync/mpmc/list.rs#L241 ) I'm not sure here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the semantic level, there is a load happening. That's what the acq
in acq_rel
refers to. And at micro-architectural level, the CPU is loading the value before doing the add
in the ALU and storing the result.
The fact that lock add
and lock xadd
both guarantee that load+add+store will happen atomically is an internal concern of the compiler backend when it's doing x86 instruction selection.
The memory model of C++ the language is much richer than the actual implementation in x86. acq_rel
(and relaxed
, and seq_cst
) are the technically correct [1] orders to use in read-modify-write operations even though passing release
here would be synonym to acq_rel
on x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point here. acq
is used to synchronizes-with other inc command rather than. Though personally I think we can just sync with reader? I think we can keep it here.
- Update max_memory correctly and more efficiently - Re-order loads and stores: loads ASAP, grouped stores w/ less branching - Store the atomics of MemoryPoolStats in single cache line
f9b2494
to
e342c0b
Compare
…tions in the stats
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e3b0bd1. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Issue loads as soon as possible so the latency of waiting for memory is masked by doing other operations.
What changes are included in this PR?
memory_order_acq_rel
memory_order_acquire
/release
respectivelyUpdateAllocatedBytes
sobytes_allocated_
can be updated without waiting for the load of the old valueAre these changes tested?
By existing tests.