Remove useless `#[global_allocator]` from rustc and rustdoc. #92222

nnethercote · 2021-12-23T06:25:46Z

This was added in #83152, which has several errors in its comments.

This commit also fix up the comments, which are quite wrong and
misleading.

This was added in rust-lang#83152, which has several errors in its comments. This commit also fix up the comments, which are quite wrong and misleading.

nnethercote · 2021-12-23T06:28:27Z

The knowledge embodied in this PR was hard-earned: I spent more than a day trying to do some jemalloc experiments, horribly confused as to why the tikv-jemallocator crate's impl of GlobalAlloc wasn't being used even though the #[global_allocator] was present.

I have confirmed on Linux that jemalloc is still being used (when jemalloc=true is specified in config.toml) for rustc and rustdoc after this change. I did this by looking at Cachegrind profiles and seeing jemalloc symbols in them. It also makes sense that these aren't necessary, given that they weren't present before #83152 added them.

nnethercote · 2021-12-23T09:59:08Z

BTW, this PR touches on the reasons why rustc does not and cannot use jemalloc via GlobalAlloc. But if there is some hack that wouldn't work in general but would work for local experiments, I'd love to hear about it. E.g. "add a #[global_allocator] that points to tikv-jemallocator in rustc and rustc_driver and std", or something like that. That would allow quantifying the effect of sized deallocation, among other things.

alexcrichton

@bors: r+ rollup=never

I figure it's good to keep this out of rollups to ensure it's got a separate perf measurement, but I wouldn't really expect any measurement changes as a result of this.

I was gonna write up how I think we could hook into sized deallocation but turns out I already did that, yay!

alexcrichton · 2021-12-23T21:11:46Z

compiler/rustc/src/main.rs

+// A note about jemalloc: rustc uses jemalloc when built for CI and
+// distribution. The obvious way to do this is with the `#[global_allocator]`
+// mechanism. However, for complicated reasons (see
+// https://github.com/rust-lang/rust/pull/81782#issuecomment-784438001 for some


I thought I had written up a comment on some of this awhile back and lo-and-behold, thanks for linking this here!

alexcrichton · 2021-12-23T21:13:11Z

@bors: r+ rollup=never

bors · 2021-12-23T21:13:13Z

📌 Commit bb23bfc has been approved by alexcrichton

bors · 2021-12-23T22:34:21Z

⌛ Testing commit bb23bfc with merge d6d12b6...

nnethercote · 2021-12-23T23:34:32Z

Thank you for the fast review.

I was gonna write up how I think we could hook into sized deallocation but turns out I already did that, yay!

Unfortunately I'm stuck on step 1: "First probably measure the impact to see whether it's at all worth it." Hence my question above.

alexcrichton · 2021-12-24T00:11:46Z

Oh dear sorry I didn't read close enough! I think you may actually be able to hack around things with Linux symbol resolution in dynamic libraries. For example this program:

use std::ptr;
fn main() {
    let _x = Box::new(1);
}
#[no_mangle]
pub unsafe extern "C" fn __rust_alloc(size: usize, align: usize) -> *mut u8 {
    eprintln!("hello");
    ptr::null_mut()
}

yields:

$ rustc foo.rs -C prefer-dynamic -C rpath
$ ./foo
hello
memory allocation of 5 bytes failed
zsh: abort      ./foo

so I think if you define these symbols in the rustc executable it might work? Your implementation when then forward to tikv_jemallocator_sys most likely.

Note that this certainly won't work for Windows, may or may not work for macOS (unsure about the dynamic symbol resolution rules there), and is kind of a weird hack for Linux. I think this might work for experimenting but I wouldn't necessarily recommend landing it!

bors · 2021-12-24T01:54:51Z

☀️ Test successful - checks-actions
Approved by: alexcrichton
Pushing d6d12b6 to master...

rust-timer · 2021-12-24T03:23:20Z

Finished benchmarking commit (d6d12b6): comparison url.

Summary: This benchmark run did not return any relevant changes.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

nnethercote · 2021-12-24T04:59:44Z

I think you may actually be able to hack around things with Linux symbol resolution in dynamic libraries.

Your suggestion worked! Thank you. For posterity, here is my diff:

diff --git a/compiler/rustc/src/main.rs b/compiler/rustc/src/main.rs
index c80fab99496..2d8cfa2531b 100644
--- a/compiler/rustc/src/main.rs
+++ b/compiler/rustc/src/main.rs
@@ -11,6 +11,34 @@
 #[cfg(feature = "tikv-jemalloc-sys")]
 use tikv_jemalloc_sys as jemalloc_sys;

+use std::alloc::{GlobalAlloc, Layout};
+use tikv_jemallocator::Jemalloc;
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_alloc(size: usize, align: usize) -> *mut u8 {
+    Jemalloc.alloc(Layout::from_size_align_unchecked(size, align))
+}
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_alloc_zeroed(size: usize, align: usize) -> *mut u8 {
+    Jemalloc.alloc_zeroed(Layout::from_size_align_unchecked(size, align))
+}
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_dealloc(ptr: *mut u8, size: usize, align: usize) {
+    Jemalloc.dealloc(ptr, Layout::from_size_align_unchecked(size, align))
+}
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_realloc(
+    ptr: *mut u8,
+    old_size: usize,
+    align: usize,
+    new_size: usize,
+) -> *mut u8 {
+    Jemalloc.realloc(ptr, Layout::from_size_align_unchecked(old_size, align), new_size)
+}
+
 fn main() {
     // Pull in jemalloc when enabled.
     //

I used this to measure the effect of sized deallocations. It ended up being a slowdown. Here's the top part of the instruction counts table:

many-assoc-items check	full	3.39%	16.94x
many-assoc-items debug	full	3.37%	16.84x
many-assoc-items opt	full	3.36%	16.82x
derive check	full	2.51%	12.56x
repro_crate check	full	2.45%	12.23x
derive debug	full	2.43%	12.13x
repro_crate debug	full	2.31%	11.56x
repro_crate opt	full	2.31%	11.55x
deep-vector check	full	2.31%	11.55x
stm32f4 debug	full	2.16%	10.79x
match-stress-exhaustive_patterns debug	full	2.16%	10.79x
match-stress-exhaustive_patterns opt	full	2.13%	10.64x
match-stress-exhaustive_patterns check	full	2.12%	10.62x
stm32f4 opt	full	2.09%	10.47x
syn check	full	2.05%	10.23x
stm32f4 check	full	2.04%	10.18x
derive opt	full	1.96%	9.82x
encoding check	full	1.94%	9.71x
ripgrep check	full	1.94%	9.69x
regex check	full	1.93%	9.67x
webrender check	full	1.90%	9.51x
futures check	full	1.87%	9.37x
hyper-2 check	full	1.84%	9.22x

There were lots more entries in the 0.3-1.8% range after that.

Here is the top of the cachegrind diff (lightly edited) for a many-assoc-items check build, which showed the biggest slowdown:

--------------------------------------------------------------------------------
Ir
--------------------------------------------------------------------------------
163,593,248 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                    file:function
--------------------------------------------------------------------------------
108,268,032 (66.18%)  git.luolix.top-1ecc6299db9ec823/tikv-jemallocator-0.4.1/src/lib.rs:tikv_jemallocator::layout_to_flags
-72,929,492 (-44.6%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/cache_bin.h:free
 72,929,485 (44.58%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/cache_bin.h:sdallocx
 56,023,307 (34.25%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/rtree.h:sdallocx
-56,023,307 (-34.2%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/rtree.h:free
 47,350,395 (28.94%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/src/jemalloc.c:sdallocx
 38,798,730 (23.72%)  git.luolix.top-1ecc6299db9ec823/tikv-jemallocator-0.4.1/src/lib.rs:__rg_dealloc
-38,728,458 (-23.7%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/src/jemalloc.c:free
 28,254,584 (17.27%)  git.luolix.top-1ecc6299db9ec823/tikv-jemallocator-0.4.1/src/lib.rs:__rg_alloc
-24,722,761 (-15.1%)  library/std/src/sys/unix/alloc.rs:__rdl_alloc
 21,554,850 (13.18%)  compiler/rustc/src/main.rs:__rg_dealloc
  8,621,940 ( 5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/atomic.h:sdallocx
 -8,621,940 (-5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/atomic.h:free
  8,620,856 ( 5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/ticker.h:sdallocx
 -8,620,856 (-5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/ticker.h:free
 -7,791,590 (-4.76%)  library/std/src/sys/unix/alloc.rs:__rdl_alloc_zeroed
  7,063,646 ( 4.32%)  compiler/rustc/src/main.rs:__rg_alloc
 -7,063,646 (-4.32%)  library/std/src/alloc.rs:__rdl_alloc

It's clear from this that the hack worked -- you can see negative counts for free and positive counts for sdallocx, showing that the second run switched from using free to using sdallocx.

The main source of additional instructions is the layout_to_flags function within tikv-jemallocator, which imposes a cost for opting into the sized deallocation.

         .           fn layout_to_flags(align: usize, size: usize) -> c_int {
99,245,696 ( 1.97%)      if align <= ALIGNOF_MAX_ALIGN_T && align <= size {
         .                   0
         .               } else {
         .                   ffi::MALLOCX_ALIGN(align)
         .               }
 9,022,336 ( 0.18%)  }

The other significant source of additional instructions is the 47,350,395 from sdallocx itself. There's no evidence of sdallocx providing any kind of improvement over free.

alexcrichton · 2021-12-24T05:32:20Z

Nice! I'm glad the test was done first before the design in this case... A bit of a surprising result, though. Even if the layout_to_flags went away it seems like it didn't really have that huge improvements, which is unexpected!

guswynn · 2021-12-28T20:15:24Z

Nice! I'm glad the test was done first before the design in this case... A bit of a surprising result, though. Even if the layout_to_flags went away it seems like it didn't really have that huge improvements, which is unexpected!

So surprising in fact, that perhaps we should tell the jemalloc devs over at https://github.com/jemalloc/jemalloc to investigate further...

nnethercote · 2022-01-04T10:42:06Z

I used this to measure the effect of sized deallocations. It ended up being a slowdown.

I double checked this in #92548 and got very similar results on CI to what I measured locally, which was a good sanity check.

LifeIsStrange · 2022-01-04T11:41:16Z

@davidtgoldblatt @interwq
Hi dear Jemalloc devs, Rust has finally been tested using the sized_dealloc api and it surprisingly incurs a slowdown, which is why we wanted to signal this strange result to you.

interwq · 2022-01-04T19:43:54Z

Thanks for letting us know @LifeIsStrange. A few factors might explain the slowdown:

non-opt builds, i.e. does the many-assoc-items check use jemalloc-debug -- if that's the case, the slowdown is not surprising since the added sanity checks will cancel the speedup from sdallocx.
is the benchmark doing tight loop-ish workload w/o much data access? In that case, the benefit from sized dealloc is also minimal, since the speedup comes from avoiding cache-miss / metadata-lookups (that is required by the non-sized dealloc, to figure out the size). Before our internal sized-dealloc rollout, the metadata lookup instructions account for ~30% of the total free path cost in our production workload. And during the sized-dealloc rollout, we observed a matching ~30% win on the dealloc fastpath. However it won't surprise me if many benchmarks don't show the same change, since their workload may not be cache intensive.
(less likely to be the case) is the jemalloc configured with --enable-prof, and at the same time, the tests do mainly 4K aligned allocations? We consider this an edge case where the sized dealloc will have to check additional metadata to tell if the pointer was profiling-sampled (also losing the benefit of sdallocx) https://github.com/jemalloc/jemalloc/blob/dev/src/jemalloc.c#L3069

cc: @Lapenkov

nnethercote · 2022-01-04T20:53:26Z

@interwq As far as I know none of the things you suggested are happening.

rustc is fairly allocation heavy. Like any compiler, it has lots of heterogenous tree structures.

This comment has some analysis in its second half of the sized deallocation change. Much of the slowdown is caused by an additional layer of Rust code in tikv-jemallocator being run in front of each allocation and deallocation, which isn't jemalloc's fault and isn't that surprising. What is more surprising is that sdallocx appears to be no faster, and perhaps slightly worse, than free. There's no sign of fewer metadata lookups.

interwq · 2022-01-04T22:25:45Z

Agree that the sdallocx not being faster part is unexpected. Can you grab a perf annotate of sdallocx (i.e. the instruction level cost within sdallocx) when running the benchmark? That way we can know for sure if the metadata-lookup is bypassed.

nnethercote · 2022-01-04T23:20:38Z

@interwq: I have attached the Cachegrind annotations of jemalloc.c for the two runs.

Using free: je0.txt
Using sdallocx: je1.txt

Both profiles show identical paths through free_fastpath, which suggests that config_cache_oblivious is true. And I found that JEMALLOC_CACHE_OBLIVIOUS is defined in jemalloc_internal_defs.h in our build.

Does that make sense? Is setting JEMALLOC_CACHE_OBLIVIOUS reasonable?

interwq · 2022-01-04T23:38:59Z

@nnethercote : thanks for the info! Yes that explains it -- sorry that I forgot in the last release CACHE_OBLIVIOUS also plays a role here. That requirement got optimized away in jemalloc/jemalloc#1749

If you'd like, feel free to try the current jemalloc dev branch: jemalloc/jemalloc@f509703

In the meantime we are preparing the upcoming 5.3 release, which will be based on the commit above.

re: --enable-cache-oblivious: it's still enabled by default but worth testing without it -- the feature provides address / cache index randomization for large allocations (by default >=16K), at the cost of one extra physical page per large allocation. In our workload we haven't observed CPU wins from it recently, but the extra memory cost is concrete. It's enabled by default more to limit worst case behavior, i.e. when it matters on some architectures (I suspect older ones), it matters a lot.

nnethercote · 2022-01-17T02:34:04Z

If you'd like, feel free to try the current jemalloc dev branch: jemalloc/jemalloc@f509703

In the meantime we are preparing the upcoming 5.3 release, which will be based on the commit above.

I tried the current jemalloc dev branch. Compared to 5.2 it shows a small but clear reduction in instruction counts for rustc, up to 1.44%. Good to know for the future.

I also tried doing the sized deallocation thing. Results were better than before but still a net slowdown, mostly because of the cost of the layout_to_flags functions within the Rust code.

LifeIsStrange · 2022-05-06T20:24:14Z

@nnethercote @guswynn Jemalloc 5.3.0 has been officially released as stable! https://github.com/jemalloc/jemalloc/releases/tag/5.3.0
So tikv-jemalloc and the rustc version it ships should be updated accordingly :)

lqd · 2022-05-07T08:30:29Z

Note: tikv-jemalloc-sys does not yet have a release for the 5.3 line, but they do have a WIP update to the 5.3 RC. I have tried it in the draft #96790, it looks good and once the release happens we can update that to the final 5.3.0.

nnethercote · 2022-05-07T09:39:03Z

Thanks, @lqd !

Remove useless #[global_allocator] from rustc and rustdoc.

bb23bfc

This was added in rust-lang#83152, which has several errors in its comments. This commit also fix up the comments, which are quite wrong and misleading.

rust-highfive assigned alexcrichton Dec 23, 2021

rustbot added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Dec 23, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 23, 2021

nnethercote mentioned this pull request Dec 23, 2021

Use tikv-jemallocator in rustc/rustdoc in addition to jemalloc-sys when enabled. #83152

Merged

alexcrichton reviewed Dec 23, 2021

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 23, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 24, 2021

bors merged commit d6d12b6 into rust-lang:master Dec 24, 2021

rustbot added this to the 1.59.0 milestone Dec 24, 2021

nnethercote deleted the rm-global_allocator-rustc-rustdoc branch December 24, 2021 03:38

nnethercote mentioned this pull request Jan 4, 2022

[DO NOT MERGE] Test jemalloc's sized deallocation path. #92548

Closed

jyn514 mentioned this pull request Feb 24, 2022

Clippy crash (version 1.59 on macOS) rust-lang/rust-clippy#8470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove useless `#[global_allocator]` from rustc and rustdoc. #92222

Remove useless `#[global_allocator]` from rustc and rustdoc. #92222

nnethercote commented Dec 23, 2021

nnethercote commented Dec 23, 2021

nnethercote commented Dec 23, 2021

alexcrichton left a comment

alexcrichton Dec 23, 2021

alexcrichton commented Dec 23, 2021

bors commented Dec 23, 2021

bors commented Dec 23, 2021

nnethercote commented Dec 23, 2021

alexcrichton commented Dec 24, 2021

bors commented Dec 24, 2021

rust-timer commented Dec 24, 2021

nnethercote commented Dec 24, 2021

alexcrichton commented Dec 24, 2021

guswynn commented Dec 28, 2021

nnethercote commented Jan 4, 2022

LifeIsStrange commented Jan 4, 2022

interwq commented Jan 4, 2022 •

edited

Loading

nnethercote commented Jan 4, 2022 •

edited

Loading

interwq commented Jan 4, 2022

nnethercote commented Jan 4, 2022

interwq commented Jan 4, 2022

nnethercote commented Jan 17, 2022

LifeIsStrange commented May 6, 2022

lqd commented May 7, 2022

nnethercote commented May 7, 2022

Remove useless #[global_allocator] from rustc and rustdoc. #92222

Remove useless #[global_allocator] from rustc and rustdoc. #92222

Conversation

nnethercote commented Dec 23, 2021

nnethercote commented Dec 23, 2021

nnethercote commented Dec 23, 2021

alexcrichton left a comment

Choose a reason for hiding this comment

alexcrichton Dec 23, 2021

Choose a reason for hiding this comment

alexcrichton commented Dec 23, 2021

bors commented Dec 23, 2021

bors commented Dec 23, 2021

nnethercote commented Dec 23, 2021

alexcrichton commented Dec 24, 2021

bors commented Dec 24, 2021

rust-timer commented Dec 24, 2021

nnethercote commented Dec 24, 2021

alexcrichton commented Dec 24, 2021

guswynn commented Dec 28, 2021

nnethercote commented Jan 4, 2022

LifeIsStrange commented Jan 4, 2022

interwq commented Jan 4, 2022 • edited Loading

nnethercote commented Jan 4, 2022 • edited Loading

interwq commented Jan 4, 2022

nnethercote commented Jan 4, 2022

interwq commented Jan 4, 2022

nnethercote commented Jan 17, 2022

LifeIsStrange commented May 6, 2022

lqd commented May 7, 2022

nnethercote commented May 7, 2022

Remove useless `#[global_allocator]` from rustc and rustdoc. #92222

Remove useless `#[global_allocator]` from rustc and rustdoc. #92222

interwq commented Jan 4, 2022 •

edited

Loading

nnethercote commented Jan 4, 2022 •

edited

Loading