GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update #40647

felipecrv · 2024-03-18T20:45:26Z

Rationale for this change

Issue loads as soon as possible so the latency of waiting for memory is masked by doing other operations.

What changes are included in this PR?

Make all the read-modify-write operations use memory_order_acq_rel
Make all the loads and stores use memory_order_acquire/release respectively
Statically specialize the implementation of UpdateAllocatedBytes so bytes_allocated_ can be updated without waiting for the load of the old value

Are these changes tested?

By existing tests.

GitHub Issue: [C++] Re-order loads and stores of atomics in MemoryPoolStats to mask memory latency #40783

amoeba · 2024-03-18T20:54:19Z

@ursabot please benchmark

ursabot · 2024-03-18T20:54:28Z

Benchmark runs are scheduled for commit 3333c48. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2024-03-19T01:45:52Z

Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 3333c48.

There were 10 benchmark results indicating a performance regression:

Pull Request Run on ursa-thinkcentre-m75q at 2024-03-18 23:18:21Z
- ArrayScalarKernel (C++) with params=<Subtract, DoubleType>/size:524288/inverse_null_proportion:0, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark
- CopyEmptyVector (C++) with params=<SMALL_VECTOR(int)>, source=cpp-micro, suite=arrow-small-vector-benchmark
and 8 more (see the report linked below)

The full Conbench report has more details.

mapleFU

I like this change, but personally, I think relaxed is enough here( relaxed gurantee "atomic", acq-rel gurantees "happens-before") ?

acq-rel might be used in the case below:

Global Ctx t
std::atomic<bool> condition{false};

void Acq() {
  set Ctx
  condition.store(true, std::memory_order_acq);
}

void Rel() {
  while (!test condition ) { yield(); }
  read Ctx
}

For MemoryPool, do we really need a acq-rel here?

cpp/src/arrow/memory_pool.h

pitrou · 2024-03-19T10:52:17Z

Does it actually change anything for x86?

pitrou · 2024-03-19T10:55:38Z

If you're really interested in reducing the contention costs for MemoryPool statistics, then I would suggest taking a look at https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

felipecrv · 2024-03-19T19:44:17Z

Does it actually change anything for x86?

Nothing, unless the compiler decides to re-order the loads and the stores. Which it didn't in this case.

I suspect my change regarding max_memory_ didn't lead to a cheaper sequence of instructions — I assumed max_memory_.store(allocated, seq_cst) (written as max_memory_ = allocated; in the code) was lock-prefixed and more expensive (combined with the seq_cst load) than the compare_exchange I added.

If you're really interested in reducing the contention costs for MemoryPool statistics, then I would suggest taking a look at https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

Not just the contention (the least of the issues really), but the time it takes to perform all the memory operations.

mapleFU · 2024-03-20T00:42:37Z

Does it actually change anything for x86?

Wouldn't acq-rel cheaper than original total-ordering?

felipecrv · 2024-03-20T01:29:36Z

Does it actually change anything for x86?

Wouldn't acq-rel cheaper than original total-ordering?

Only on architectures with a weaker memory model (e.g. ARM). x86 guarantees that all the stores are ordered (not immediately visible, but they are ordered). [1] explains this much better than I ever could.

acq-rel can enable the compiler to re-order operations if it decides that can unlock optimizations. Not the case here, so I'm manually experimenting with different orderings to mask the latency of fetch/load operations on atomic variables.

[1] https://research.swtch.com/hwmm#x86

felipecrv · 2024-03-20T02:04:03Z

I pushed some re-ordering of loads and stores that I believe can work better on CPUs with higher latency on the memory system [1]. Note that my code updates max_memory_ correctly (I removed the comment about "no need to be rigorous"). The lock cmpxchg that I introduced never appears in profiles, so I'm keeping it. I also suspect it would be beneficial on contented workloads because we can give up on updating max_memory_ if another threads increases it before the current thread.

When benchmarking on an Intel CPU, compared to the baseline, this version doesn't cost less, but it should improve the situation on CPUs that are spending more time [1] in the load issued by bytes_allocated_.fetch_add(diff) (lock xadd instruction on x86). My hypothesis is that by not immediately using the result of the xadd, the CPU can wait for the memory system in the background while performing other operations. It also clusters all the lock-prefixed instructions together which is recommended practice.

Annotated perf report of arrow::memory_pool::internal::JemallocAllocator::AllocateAligned after these changes:

#   Percent│       push %rbp
           │       push %r15
           │       push %r14
           │       push %r13
           │       push %r12
           │       push %rbx
           │       push %rax
           │       mov  %r8,%r13
           │       mov  %rcx,%rbx
           │       mov  %rdx,%r12
           │       mov  %rsi,%r15
           │       mov  %rdi,%r14
           │       incb 0x13f6583(%rip)        # 61bc853 <__TMC_END__+0x13935b>
1     5.31 │       xor  %edi,%edi
           │       mov  %rdx,%rsi
           │     → call __sanitizer_cov_trace_const_cmp8@plt
           │       test %r12,%r12
           │     ↓ js   6c
1     5.56 │       incb 0x13f6570(%rip)        # 61bc855 <__TMC_END__+0x13935d>
           │       mov  %rsp,%rdi
           │       mov  %r12,%rsi
           │       mov  %rbx,%rdx
           │       mov  %r13,%rcx
           │     → call arrow::memory_pool::internal::JemallocAllocator::AllocateAligned@plt
           │       test %r14,%r14
           │     ↓ je   136
           │       incb 0x13f6552(%rip)        # 61bc857 <__TMC_END__+0x13935f>
           │       mov  (%rsp),%rax
           │       mov  %rax,(%r14)
           │       test %rax,%rax
           │     ↓ je   8b
           │       incb 0x13f6541(%rip)        # 61bc858 <__TMC_END__+0x139360>
           │     ↓ jmp  11e
           │ 6c:   incb 0x13f6532(%rip)        # 61bc854 <__TMC_END__+0x13935c>
           │       lea  typeinfo name for arrow::json::TableReader+0x22a,%rdx
           │       mov  %r14,%rdi
           │       mov  $0x4,%esi
           │     → call arrow::Status::FromArgs<char const (&) [21]>@plt
           │     ↓ jmp  11e
           │ 8b:   incb 0x13f6518(%rip)        # 61bc859 <__TMC_END__+0x139361>
1     5.54 │       mov  0x40(%r15),%rbx
           │       mov  %r12,%rsi
4    22.34 │       lock xadd %rsi,0x48(%r15)
4    22.47 │       lock add  %r12,0x50(%r15)
6    33.09 │       lock incq 0x58(%r15)
           │       mov  %rsi,%r13
           │       add  %r12,%r13
           │     ↓ jo   14a
           │       incb 0x13f64f1(%rip)        # 61bc85b <__TMC_END__+0x139363>
           │       mov  %rbx,%rdi
           │       mov  %r13,%rsi
           │     → call __sanitizer_cov_trace_cmp8@plt
           │       cmp  %r13,%rbx
1     5.69 │     ↓ jge  103
           │       incb 0x13f64dd(%rip)        # 61bc85d <__TMC_END__+0x139365>
           │ d0:   incb 0x13f64d8(%rip)        # 61bc85e <__TMC_END__+0x139366>
           │       mov  %rbx,%rax
           │       lock cmpxchg %r13,0x40(%r15)
           │       sete %bpl
           │       mov  %rax,%rbx
           │       mov  %rax,%rdi
           │       mov  %r13,%rsi
           │     → call __sanitizer_cov_trace_cmp8@plt
           │       test %bpl,%bpl
           │     ↓ jne  10b
           │       cmp  %r13,%rbx
           │     ↓ jge  10b
           │       incb 0x13f64af(%rip)        # 61bc860 <__TMC_END__+0x139368>
           │     ↑ jmp  d0
           │103:   incb 0x13f64a3(%rip)        # 61bc85c <__TMC_END__+0x139364>
           │     ↓ jmp  111
           │10b:   incb 0x13f649e(%rip)        # 61bc85f <__TMC_END__+0x139367>
           │111:   incb 0x13f649a(%rip)        # 61bc861 <__TMC_END__+0x139369>
           │       movq $0x0,(%r14)
           │11e:   incb 0x13f648e(%rip)        # 61bc862 <__TMC_END__+0x13936a>
           │       mov  %r14,%rax
           │       add  $0x8,%rsp
           │       pop  %rbx
           │       pop  %r12
           │       pop  %r13
           │       pop  %r14
           │       pop  %r15
           │       pop  %rbp
           │     ← ret

Benchmarks (based on the old code) on a Zen 3 CPU show that the CPU can get stuck waiting for the value produced by lock xadd instead of progressing:

    0.18 |    lock xadd %rax,(%rdi)
   80.73 |    add    %rsi,%rax

Doing useful work that doesn't depend on %rax (the result of lock xadd) should mask the latency of the memory load across the memory system.

From [1]:

In Zen 3, a single 32MB L3 cache pool is shared among all 8 cores in a chiplet, vs. Zen 2's two 16MB pools each shared among 4 cores in a core complex, of which there were two per chiplet. This new arrangement improves the cache hit rate as well as performance in situations that require cache data to be exchanged among cores, but increases cache latency from 39 cycles in Zen 2 to 46 clock cycles and halves per-core cache bandwidth, although both problems are partially mitigated by higher clock speeds.

[1] https://en.wikipedia.org/wiki/Zen_3#Features

mapleFU · 2024-03-20T02:32:03Z

Which benchmark should I run? I'd like to testing on my M1 Pro

cpp/src/arrow/memory_pool.h

mapleFU · 2024-03-20T05:29:10Z

Still not understanding that, in x86, though the instr is same ( this might help: https://darkcoding.net/software/rust-atomics-on-x86/ ) , relaxed might possible to get compiler-reordering. Besides, see cases above:

I think just relaxed is enough here.

cpp/src/arrow/memory_pool.h

zanmato1984

+1

mapleFU · 2024-03-25T16:20:36Z

+1 Also just curious how do you draw this

felipecrv · 2024-03-25T17:18:45Z

+1 Also just curious how do you draw this

@mapleFU I paste the CSV output of the benchmarks I invoke manually, on Excel, then I make a chart based on that data.

mapleFU · 2024-03-25T17:27:37Z

cpp/src/arrow/memory_pool.h

+    total_allocated_bytes_.fetch_add(size, std::memory_order_acq_rel);
+    num_allocs_.fetch_add(1, std::memory_order_acq_rel);
+
+    // If other threads are updating max_memory_ concurrently we leave the loop without


(Actually this might make program a big slower but making max_memory_ more precisely? My M1 Pro benchmark might get a bit slower in some cases, I guess it might related)

I sampled with perf at 10000 Hertz and this was almost never hit by the sampler (I pasted the disassembly with counts and percentages on the main thread).

If threads are competing for max_memory_, whoever wins and pushes the update through the memory system frees all the other threads from having to update the value. Whereas before we would have every thread racing to apply their store on the value to this cache line.

mapleFU · 2024-03-25T17:35:39Z

cpp/src/arrow/memory_pool.h

+    // with execution. When done, max_memory and old_bytes_allocated have
+    // a higher chance of being available on CPU registers. This also has the
+    // nice side-effect of putting 3 atomic stores close to each other in the
+    // instruction stream.


I may ask a stupid question, but also may I also ask why here not using std::memory_order_release for the store instr?

fetch_add is not a store, it's a load+store. The ALU needs the latest value to correctly calculate the sum, that then gets stored in the memory position.

https://godbolt.org/z/bfKe9vG6P Seems fetch_add without return calls lock add, and with returning calls xadd, don't know would it load

( Maybe a sample with Release only: https://github.com/rust-lang/rust/blob/60b5ca62752ecc25d578066c8b82e1a4887267d4/library/std/src/sync/mpmc/list.rs#L241 ) I'm not sure here

At the semantic level, there is a load happening. That's what the acq in acq_rel refers to. And at micro-architectural level, the CPU is loading the value before doing the add in the ALU and storing the result.

The fact that lock add and lock xadd both guarantee that load+add+store will happen atomically is an internal concern of the compiler backend when it's doing x86 instruction selection.

The memory model of C++ the language is much richer than the actual implementation in x86. acq_rel (and relaxed, and seq_cst) are the technically correct [1] orders to use in read-modify-write operations even though passing release here would be synonym to acq_rel on x86.

[1] https://en.cppreference.com/w/c/atomic/memory_order

I get your point here. acq is used to synchronizes-with other inc command rather than. Though personally I think we can just sync with reader? I think we can keep it here.

- Update max_memory correctly and more efficiently - Re-order loads and stores: loads ASAP, grouped stores w/ less branching - Store the atomics of MemoryPoolStats in single cache line

…n the stats

…tions in the stats

github-actions · 2024-03-25T20:15:26Z

⚠️ GitHub issue #40783 has been automatically assigned in GitHub to PR creator.

conbench-apache-arrow · 2024-03-26T11:25:47Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e3b0bd1.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels Mar 18, 2024

felipecrv changed the title ~~Memory stats atomics~~ GH-40646: [C++] Use Acquire-Release for loads and stores on MemoryPool statistics Mar 18, 2024

apache deleted a comment from github-actions bot Mar 18, 2024

mapleFU reviewed Mar 19, 2024

View reviewed changes

cpp/src/arrow/memory_pool.h Outdated Show resolved Hide resolved

mapleFU reviewed Mar 19, 2024

View reviewed changes

cpp/src/arrow/memory_pool.h Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 19, 2024

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 20, 2024

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 20, 2024

felipecrv force-pushed the memory_stats_atomics branch from 971f660 to 66d4454 Compare March 20, 2024 02:16

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 20, 2024

mapleFU reviewed Mar 20, 2024

View reviewed changes

cpp/src/arrow/memory_pool.h Outdated Show resolved Hide resolved

wgtmac reviewed Mar 20, 2024

View reviewed changes

cpp/src/arrow/memory_pool.h Outdated Show resolved Hide resolved

github-actions bot removed the awaiting change review Awaiting change review label Mar 20, 2024

felipecrv mentioned this pull request Mar 25, 2024

[C++] Investigate using std::memory_order in MemoryPoolStats to improve performance #40646

Open

zanmato1984 approved these changes Mar 25, 2024

View reviewed changes

mapleFU reviewed Mar 25, 2024

View reviewed changes

felipecrv added 4 commits March 25, 2024 14:55

MemoryPoolStats: Split UpdateAllocatedBytes into two functions

da91799

- Update max_memory correctly and more efficiently - Re-order loads and stores: loads ASAP, grouped stores w/ less branching - Store the atomics of MemoryPoolStats in single cache line

pool-test: Consider allocations performed before tests called

d24cfdb

pool-test: Reallocations that shrink are not counted as allocations i…

99e62d1

…n the stats

pool-test: Remove stale compile guards around test case

e342c0b

felipecrv force-pushed the memory_stats_atomics branch from f9b2494 to e342c0b Compare March 25, 2024 18:11

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 25, 2024

amoeba mentioned this pull request Mar 25, 2024

[C++] Provide mechanism for disabling MemoryPoolStats at either compile- or run-time #40781

Open

fixup! pool-test: Reallocations that shrink are not counted as alloca…

5d6c7e5

…tions in the stats

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 25, 2024

felipecrv changed the title ~~GH-40646: [C++] Re-order loads and stores in MemoryPoolStats update~~ GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update Mar 25, 2024

apache deleted a comment from github-actions bot Mar 25, 2024

java: Migrate away from UpdateAllocatedBytes API

1b7260f

felipecrv requested a review from lidavidm as a code owner March 25, 2024 20:28

github-actions bot added the Component: Java label Mar 25, 2024

felipecrv requested a review from pitrou March 25, 2024 21:44

mapleFU approved these changes Mar 26, 2024

View reviewed changes

felipecrv merged commit e3b0bd1 into apache:main Mar 26, 2024
40 of 41 checks passed

felipecrv removed the awaiting change review Awaiting change review label Mar 26, 2024

felipecrv mentioned this pull request Mar 26, 2024

[C++] Re-order loads and stores of atomics in MemoryPoolStats to mask memory latency #40783

Closed

felipecrv deleted the memory_stats_atomics branch March 26, 2024 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update #40647

GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update #40647

felipecrv commented Mar 18, 2024 •

edited

Loading

amoeba commented Mar 18, 2024

ursabot commented Mar 18, 2024

conbench-apache-arrow bot commented Mar 19, 2024

mapleFU left a comment •

edited

Loading

pitrou commented Mar 19, 2024

pitrou commented Mar 19, 2024

felipecrv commented Mar 19, 2024

mapleFU commented Mar 20, 2024

felipecrv commented Mar 20, 2024

felipecrv commented Mar 20, 2024

mapleFU commented Mar 20, 2024

mapleFU commented Mar 20, 2024 •

edited

Loading

zanmato1984 left a comment

mapleFU commented Mar 25, 2024

felipecrv commented Mar 25, 2024

mapleFU Mar 25, 2024 •

edited

Loading

felipecrv Mar 25, 2024

mapleFU Mar 25, 2024 •

edited

Loading

felipecrv Mar 25, 2024

mapleFU Mar 25, 2024

mapleFU Mar 25, 2024

felipecrv Mar 25, 2024

mapleFU Mar 25, 2024

github-actions bot commented Mar 25, 2024

conbench-apache-arrow bot commented Mar 26, 2024

GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update #40647

GH-40783: [C++] Re-order loads and stores in MemoryPoolStats update #40647

Conversation

felipecrv commented Mar 18, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

amoeba commented Mar 18, 2024

ursabot commented Mar 18, 2024

conbench-apache-arrow bot commented Mar 19, 2024

mapleFU left a comment • edited Loading

Choose a reason for hiding this comment

pitrou commented Mar 19, 2024

pitrou commented Mar 19, 2024

felipecrv commented Mar 19, 2024

mapleFU commented Mar 20, 2024

felipecrv commented Mar 20, 2024

felipecrv commented Mar 20, 2024

mapleFU commented Mar 20, 2024

mapleFU commented Mar 20, 2024 • edited Loading

zanmato1984 left a comment

Choose a reason for hiding this comment

mapleFU commented Mar 25, 2024

felipecrv commented Mar 25, 2024

mapleFU Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

felipecrv Mar 25, 2024

Choose a reason for hiding this comment

mapleFU Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

felipecrv Mar 25, 2024

Choose a reason for hiding this comment

mapleFU Mar 25, 2024

Choose a reason for hiding this comment

mapleFU Mar 25, 2024

Choose a reason for hiding this comment

felipecrv Mar 25, 2024

Choose a reason for hiding this comment

mapleFU Mar 25, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 25, 2024

conbench-apache-arrow bot commented Mar 26, 2024

felipecrv commented Mar 18, 2024 •

edited

Loading

mapleFU left a comment •

edited

Loading

mapleFU commented Mar 20, 2024 •

edited

Loading

mapleFU Mar 25, 2024 •

edited

Loading

mapleFU Mar 25, 2024 •

edited

Loading