Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update #40647

Merged
merged 8 commits into from
Mar 26, 2024

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Mar 18, 2024

Rationale for this change

Issue loads as soon as possible so the latency of waiting for memory is masked by doing other operations.

What changes are included in this PR?

  • Make all the read-modify-write operations use memory_order_acq_rel
  • Make all the loads and stores use memory_order_acquire/release respectively
  • Statically specialize the implementation of UpdateAllocatedBytes so bytes_allocated_ can be updated without waiting for the load of the old value

Are these changes tested?

By existing tests.

@felipecrv felipecrv changed the title Memory stats atomics GH-40646: [C++] Use Acquire-Release for loads and stores on MemoryPool statistics Mar 18, 2024
@apache apache deleted a comment from github-actions bot Mar 18, 2024
@amoeba
Copy link
Member

amoeba commented Mar 18, 2024

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Mar 18, 2024

Benchmark runs are scheduled for commit 3333c48. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Copy link

Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 3333c48.

There were 10 benchmark results indicating a performance regression:

The full Conbench report has more details.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, but personally, I think relaxed is enough here( relaxed gurantee "atomic", acq-rel gurantees "happens-before") ?

acq-rel might be used in the case below:

Global Ctx t
std::atomic<bool> condition{false};

void Acq() {
  set Ctx
  condition.store(true, std::memory_order_acq);
}

void Rel() {
  while (!test condition ) { yield(); }
  read Ctx
}

For MemoryPool, do we really need a acq-rel here?

@pitrou
Copy link
Member

pitrou commented Mar 19, 2024

Does it actually change anything for x86?

@pitrou
Copy link
Member

pitrou commented Mar 19, 2024

If you're really interested in reducing the contention costs for MemoryPool statistics, then I would suggest taking a look at https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

@felipecrv
Copy link
Contributor Author

Does it actually change anything for x86?

Nothing, unless the compiler decides to re-order the loads and the stores. Which it didn't in this case.

I suspect my change regarding max_memory_ didn't lead to a cheaper sequence of instructions — I assumed max_memory_.store(allocated, seq_cst) (written as max_memory_ = allocated; in the code) was lock-prefixed and more expensive (combined with the seq_cst load) than the compare_exchange I added.

If you're really interested in reducing the contention costs for MemoryPool statistics, then I would suggest taking a look at https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

Not just the contention (the least of the issues really), but the time it takes to perform all the memory operations.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 19, 2024
@mapleFU
Copy link
Member

mapleFU commented Mar 20, 2024

Does it actually change anything for x86?

Wouldn't acq-rel cheaper than original total-ordering?

@felipecrv
Copy link
Contributor Author

Does it actually change anything for x86?

Wouldn't acq-rel cheaper than original total-ordering?

Only on architectures with a weaker memory model (e.g. ARM). x86 guarantees that all the stores are ordered (not immediately visible, but they are ordered). [1] explains this much better than I ever could.

acq-rel can enable the compiler to re-order operations if it decides that can unlock optimizations. Not the case here, so I'm manually experimenting with different orderings to mask the latency of fetch/load operations on atomic variables.

[1] https://research.swtch.com/hwmm#x86

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 20, 2024
@felipecrv
Copy link
Contributor Author

I pushed some re-ordering of loads and stores that I believe can work better on CPUs with higher latency on the memory system [1]. Note that my code updates max_memory_ correctly (I removed the comment about "no need to be rigorous"). The lock cmpxchg that I introduced never appears in profiles, so I'm keeping it. I also suspect it would be beneficial on contented workloads because we can give up on updating max_memory_ if another threads increases it before the current thread.

When benchmarking on an Intel CPU, compared to the baseline, this version doesn't cost less, but it should improve the situation on CPUs that are spending more time [1] in the load issued by bytes_allocated_.fetch_add(diff) (lock xadd instruction on x86). My hypothesis is that by not immediately using the result of the xadd, the CPU can wait for the memory system in the background while performing other operations. It also clusters all the lock-prefixed instructions together which is recommended practice.

Annotated perf report of arrow::memory_pool::internal::JemallocAllocator::AllocateAligned after these changes:

#   Percent│       push %rbp
push %r15
push %r14
push %r13
push %r12
push %rbx
push %rax
mov  %r8,%r13
mov  %rcx,%rbx
mov  %rdx,%r12
mov  %rsi,%r15
mov  %rdi,%r14
           │       incb 0x13f6583(%rip)        # 61bc853 <__TMC_END__+0x13935b>
1     5.31xor  %edi,%edi
mov  %rdx,%rsi
           │     → call __sanitizer_cov_trace_const_cmp8@plt
test %r12,%r12
           │     ↓ js   6c
1     5.56 │       incb 0x13f6570(%rip)        # 61bc855 <__TMC_END__+0x13935d>
mov  %rsp,%rdi
mov  %r12,%rsi
mov  %rbx,%rdx
mov  %r13,%rcx
           │     → call arrow::memory_pool::internal::JemallocAllocator::AllocateAligned@plt
test %r14,%r14
           │     ↓ je   136
           │       incb 0x13f6552(%rip)        # 61bc857 <__TMC_END__+0x13935f>
mov  (%rsp),%rax
mov  %rax,(%r14)
test %rax,%rax
           │     ↓ je   8b
           │       incb 0x13f6541(%rip)        # 61bc858 <__TMC_END__+0x139360>
           │     ↓ jmp  11e
           │ 6c:   incb 0x13f6532(%rip)        # 61bc854 <__TMC_END__+0x13935c>
lea  typeinfo name for arrow::json::TableReader+0x22a,%rdx
mov  %r14,%rdi
mov  $0x4,%esi
           │     → call arrow::Status::FromArgs<char const (&) [21]>@plt
           │     ↓ jmp  11e
           │ 8b:   incb 0x13f6518(%rip)        # 61bc859 <__TMC_END__+0x139361>
1     5.54mov  0x40(%r15),%rbx
mov  %r12,%rsi
4    22.34lock xadd %rsi,0x48(%r15)
4    22.47lock add  %r12,0x50(%r15)
6    33.09lock incq 0x58(%r15)
mov  %rsi,%r13
add  %r12,%r13
           │     ↓ jo   14a
           │       incb 0x13f64f1(%rip)        # 61bc85b <__TMC_END__+0x139363>
mov  %rbx,%rdi
mov  %r13,%rsi
           │     → call __sanitizer_cov_trace_cmp8@plt
cmp  %r13,%rbx
1     5.69 │     ↓ jge  103
           │       incb 0x13f64dd(%rip)        # 61bc85d <__TMC_END__+0x139365>
           │ d0:   incb 0x13f64d8(%rip)        # 61bc85e <__TMC_END__+0x139366>
mov  %rbx,%rax
lock cmpxchg %r13,0x40(%r15)
sete %bpl
mov  %rax,%rbx
mov  %rax,%rdi
mov  %r13,%rsi
           │     → call __sanitizer_cov_trace_cmp8@plt
test %bpl,%bpl
           │     ↓ jne  10b
cmp  %r13,%rbx
           │     ↓ jge  10b
           │       incb 0x13f64af(%rip)        # 61bc860 <__TMC_END__+0x139368>
           │     ↑ jmp  d0
103:   incb 0x13f64a3(%rip)        # 61bc85c <__TMC_END__+0x139364>
           │     ↓ jmp  111
10b:   incb 0x13f649e(%rip)        # 61bc85f <__TMC_END__+0x139367>
111:   incb 0x13f649a(%rip)        # 61bc861 <__TMC_END__+0x139369>
movq $0x0,(%r14)
           │11e:   incb 0x13f648e(%rip)        # 61bc862 <__TMC_END__+0x13936a>
mov  %r14,%rax
add  $0x8,%rsp
pop  %rbx
pop  %r12
pop  %r13
pop  %r14
pop  %r15
pop  %rbp
           │     ← ret

Benchmarks (based on the old code) on a Zen 3 CPU show that the CPU can get stuck waiting for the value produced by lock xadd instead of progressing:

    0.18 |    lock xadd %rax,(%rdi)
   80.73 |    add    %rsi,%rax

Doing useful work that doesn't depend on %rax (the result of lock xadd) should mask the latency of the memory load across the memory system.

From [1]:

In Zen 3, a single 32MB L3 cache pool is shared among all 8 cores in a chiplet, vs. Zen 2's two 16MB pools each shared among 4 cores in a core complex, of which there were two per chiplet. This new arrangement improves the cache hit rate as well as performance in situations that require cache data to be exchanged among cores, but increases cache latency from 39 cycles in Zen 2 to 46 clock cycles and halves per-core cache bandwidth, although both problems are partially mitigated by higher clock speeds.

[1] https://en.wikipedia.org/wiki/Zen_3#Features

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 20, 2024
@felipecrv felipecrv force-pushed the memory_stats_atomics branch from 971f660 to 66d4454 Compare March 20, 2024 02:16
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 20, 2024
@mapleFU
Copy link
Member

mapleFU commented Mar 20, 2024

Which benchmark should I run? I'd like to testing on my M1 Pro

@mapleFU
Copy link
Member

mapleFU commented Mar 20, 2024

Still not understanding that, in x86, though the instr is same ( this might help: https://darkcoding.net/software/rust-atomics-on-x86/ ) , relaxed might possible to get compiler-reordering. Besides, see cases above:

  1. https://github.com/apache/kudu/blob/647726ad6b2aab0c6a6d34e16e027debd8a827eb/src/kudu/util/high_water_mark.h#L29
  2. https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/monitoring/histogram.cc#L76

I think just relaxed is enough here.

cpp/src/arrow/memory_pool.h Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the awaiting change review Awaiting change review label Mar 20, 2024
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mapleFU
Copy link
Member

mapleFU commented Mar 25, 2024

reorderloadstores

+1 Also just curious how do you draw this

@felipecrv
Copy link
Contributor Author

+1 Also just curious how do you draw this

@mapleFU I paste the CSV output of the benchmarks I invoke manually, on Excel, then I make a chart based on that data.

total_allocated_bytes_.fetch_add(size, std::memory_order_acq_rel);
num_allocs_.fetch_add(1, std::memory_order_acq_rel);

// If other threads are updating max_memory_ concurrently we leave the loop without
Copy link
Member

@mapleFU mapleFU Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Actually this might make program a big slower but making max_memory_ more precisely? My M1 Pro benchmark might get a bit slower in some cases, I guess it might related)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sampled with perf at 10000 Hertz and this was almost never hit by the sampler (I pasted the disassembly with counts and percentages on the main thread).

If threads are competing for max_memory_, whoever wins and pushes the update through the memory system frees all the other threads from having to update the value. Whereas before we would have every thread racing to apply their store on the value to this cache line.

// with execution. When done, max_memory and old_bytes_allocated have
// a higher chance of being available on CPU registers. This also has the
// nice side-effect of putting 3 atomic stores close to each other in the
// instruction stream.
Copy link
Member

@mapleFU mapleFU Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may ask a stupid question, but also may I also ask why here not using std::memory_order_release for the store instr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_add is not a store, it's a load+store. The ALU needs the latest value to correctly calculate the sum, that then gets stored in the memory position.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godbolt.org/z/bfKe9vG6P Seems fetch_add without return calls lock add, and with returning calls xadd, don't know would it load

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the semantic level, there is a load happening. That's what the acq in acq_rel refers to. And at micro-architectural level, the CPU is loading the value before doing the add in the ALU and storing the result.

The fact that lock add and lock xadd both guarantee that load+add+store will happen atomically is an internal concern of the compiler backend when it's doing x86 instruction selection.

The memory model of C++ the language is much richer than the actual implementation in x86. acq_rel (and relaxed, and seq_cst) are the technically correct [1] orders to use in read-modify-write operations even though passing release here would be synonym to acq_rel on x86.

[1] https://en.cppreference.com/w/c/atomic/memory_order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point here. acq is used to synchronizes-with other inc command rather than. Though personally I think we can just sync with reader? I think we can keep it here.

 - Update max_memory correctly and more efficiently
 - Re-order loads and stores: loads ASAP, grouped stores w/ less branching
 - Store the atomics of MemoryPoolStats in single cache line
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 25, 2024
@felipecrv felipecrv changed the title GH-40646: [C++] Re-order loads and stores in MemoryPoolStats update GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update Mar 25, 2024
@apache apache deleted a comment from github-actions bot Mar 25, 2024
Copy link

⚠️ GitHub issue #40783 has been automatically assigned in GitHub to PR creator.

@felipecrv felipecrv merged commit e3b0bd1 into apache:main Mar 26, 2024
40 of 41 checks passed
@felipecrv felipecrv removed the awaiting change review Awaiting change review label Mar 26, 2024
@felipecrv felipecrv deleted the memory_stats_atomics branch March 26, 2024 02:14
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e3b0bd1.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants