Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove useless #[global_allocator] from rustc and rustdoc. #92222

Merged

Conversation

nnethercote
Copy link
Contributor

This was added in #83152, which has several errors in its comments.

This commit also fix up the comments, which are quite wrong and
misleading.

r? @alexcrichton

This was added in rust-lang#83152, which has several errors in its comments.

This commit also fix up the comments, which are quite wrong and
misleading.
@rustbot rustbot added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Dec 23, 2021
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 23, 2021
@nnethercote
Copy link
Contributor Author

The knowledge embodied in this PR was hard-earned: I spent more than a day trying to do some jemalloc experiments, horribly confused as to why the tikv-jemallocator crate's impl of GlobalAlloc wasn't being used even though the #[global_allocator] was present.

I have confirmed on Linux that jemalloc is still being used (when jemalloc=true is specified in config.toml) for rustc and rustdoc after this change. I did this by looking at Cachegrind profiles and seeing jemalloc symbols in them. It also makes sense that these aren't necessary, given that they weren't present before #83152 added them.

@nnethercote
Copy link
Contributor Author

BTW, this PR touches on the reasons why rustc does not and cannot use jemalloc via GlobalAlloc. But if there is some hack that wouldn't work in general but would work for local experiments, I'd love to hear about it. E.g. "add a #[global_allocator] that points to tikv-jemallocator in rustc and rustc_driver and std", or something like that. That would allow quantifying the effect of sized deallocation, among other things.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bors: r+ rollup=never

I figure it's good to keep this out of rollups to ensure it's got a separate perf measurement, but I wouldn't really expect any measurement changes as a result of this.

I was gonna write up how I think we could hook into sized deallocation but turns out I already did that, yay!

// A note about jemalloc: rustc uses jemalloc when built for CI and
// distribution. The obvious way to do this is with the `#[global_allocator]`
// mechanism. However, for complicated reasons (see
// https://github.com/rust-lang/rust/pull/81782#issuecomment-784438001 for some
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had written up a comment on some of this awhile back and lo-and-behold, thanks for linking this here!

@alexcrichton
Copy link
Member

@bors: r+ rollup=never

@bors
Copy link
Contributor

bors commented Dec 23, 2021

📌 Commit bb23bfc has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 23, 2021
@bors
Copy link
Contributor

bors commented Dec 23, 2021

⌛ Testing commit bb23bfc with merge d6d12b6...

@nnethercote
Copy link
Contributor Author

Thank you for the fast review.

I was gonna write up how I think we could hook into sized deallocation but turns out I already did that, yay!

Unfortunately I'm stuck on step 1: "First probably measure the impact to see whether it's at all worth it." Hence my question above.

@alexcrichton
Copy link
Member

Oh dear sorry I didn't read close enough! I think you may actually be able to hack around things with Linux symbol resolution in dynamic libraries. For example this program:

use std::ptr;
fn main() {
    let _x = Box::new(1);
}
#[no_mangle]
pub unsafe extern "C" fn __rust_alloc(size: usize, align: usize) -> *mut u8 {
    eprintln!("hello");
    ptr::null_mut()
}

yields:

$ rustc foo.rs -C prefer-dynamic -C rpath
$ ./foo
hello
memory allocation of 5 bytes failed
zsh: abort      ./foo

so I think if you define these symbols in the rustc executable it might work? Your implementation when then forward to tikv_jemallocator_sys most likely.

Note that this certainly won't work for Windows, may or may not work for macOS (unsure about the dynamic symbol resolution rules there), and is kind of a weird hack for Linux. I think this might work for experimenting but I wouldn't necessarily recommend landing it!

@bors
Copy link
Contributor

bors commented Dec 24, 2021

☀️ Test successful - checks-actions
Approved by: alexcrichton
Pushing d6d12b6 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 24, 2021
@bors bors merged commit d6d12b6 into rust-lang:master Dec 24, 2021
@rustbot rustbot added this to the 1.59.0 milestone Dec 24, 2021
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d6d12b6): comparison url.

Summary: This benchmark run did not return any relevant changes.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

@nnethercote nnethercote deleted the rm-global_allocator-rustc-rustdoc branch December 24, 2021 03:38
@nnethercote
Copy link
Contributor Author

I think you may actually be able to hack around things with Linux symbol resolution in dynamic libraries.

Your suggestion worked! Thank you. For posterity, here is my diff:

diff --git a/compiler/rustc/src/main.rs b/compiler/rustc/src/main.rs
index c80fab99496..2d8cfa2531b 100644
--- a/compiler/rustc/src/main.rs
+++ b/compiler/rustc/src/main.rs
@@ -11,6 +11,34 @@
 #[cfg(feature = "tikv-jemalloc-sys")]
 use tikv_jemalloc_sys as jemalloc_sys;

+use std::alloc::{GlobalAlloc, Layout};
+use tikv_jemallocator::Jemalloc;
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_alloc(size: usize, align: usize) -> *mut u8 {
+    Jemalloc.alloc(Layout::from_size_align_unchecked(size, align))
+}
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_alloc_zeroed(size: usize, align: usize) -> *mut u8 {
+    Jemalloc.alloc_zeroed(Layout::from_size_align_unchecked(size, align))
+}
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_dealloc(ptr: *mut u8, size: usize, align: usize) {
+    Jemalloc.dealloc(ptr, Layout::from_size_align_unchecked(size, align))
+}
+
+#[no_mangle]
+pub unsafe extern "C" fn __rust_realloc(
+    ptr: *mut u8,
+    old_size: usize,
+    align: usize,
+    new_size: usize,
+) -> *mut u8 {
+    Jemalloc.realloc(ptr, Layout::from_size_align_unchecked(old_size, align), new_size)
+}
+
 fn main() {
     // Pull in jemalloc when enabled.
     //

I used this to measure the effect of sized deallocations. It ended up being a slowdown. Here's the top part of the instruction counts table:

many-assoc-items check full 3.39% 16.94x
many-assoc-items debug full 3.37% 16.84x
many-assoc-items opt full 3.36% 16.82x
derive check full 2.51% 12.56x
repro_crate check full 2.45% 12.23x
derive debug full 2.43% 12.13x
repro_crate debug full 2.31% 11.56x
repro_crate opt full 2.31% 11.55x
deep-vector check full 2.31% 11.55x
stm32f4 debug full 2.16% 10.79x
match-stress-exhaustive_patterns debug full 2.16% 10.79x
match-stress-exhaustive_patterns opt full 2.13% 10.64x
match-stress-exhaustive_patterns check full 2.12% 10.62x
stm32f4 opt full 2.09% 10.47x
syn check full 2.05% 10.23x
stm32f4 check full 2.04% 10.18x
derive opt full 1.96% 9.82x
encoding check full 1.94% 9.71x
ripgrep check full 1.94% 9.69x
regex check full 1.93% 9.67x
webrender check full 1.90% 9.51x
futures check full 1.87% 9.37x
hyper-2 check full 1.84% 9.22x

There were lots more entries in the 0.3-1.8% range after that.

Here is the top of the cachegrind diff (lightly edited) for a many-assoc-items check build, which showed the biggest slowdown:

--------------------------------------------------------------------------------
Ir
--------------------------------------------------------------------------------
163,593,248 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                    file:function
--------------------------------------------------------------------------------
108,268,032 (66.18%)  git.luolix.top-1ecc6299db9ec823/tikv-jemallocator-0.4.1/src/lib.rs:tikv_jemallocator::layout_to_flags
-72,929,492 (-44.6%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/cache_bin.h:free
 72,929,485 (44.58%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/cache_bin.h:sdallocx
 56,023,307 (34.25%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/rtree.h:sdallocx
-56,023,307 (-34.2%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/rtree.h:free
 47,350,395 (28.94%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/src/jemalloc.c:sdallocx
 38,798,730 (23.72%)  git.luolix.top-1ecc6299db9ec823/tikv-jemallocator-0.4.1/src/lib.rs:__rg_dealloc
-38,728,458 (-23.7%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/src/jemalloc.c:free
 28,254,584 (17.27%)  git.luolix.top-1ecc6299db9ec823/tikv-jemallocator-0.4.1/src/lib.rs:__rg_alloc
-24,722,761 (-15.1%)  library/std/src/sys/unix/alloc.rs:__rdl_alloc
 21,554,850 (13.18%)  compiler/rustc/src/main.rs:__rg_dealloc
  8,621,940 ( 5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/atomic.h:sdallocx
 -8,621,940 (-5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/atomic.h:free
  8,620,856 ( 5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/ticker.h:sdallocx
 -8,620,856 (-5.27%)  build/tikv-jemalloc-sys-c536e82b571990b4/out/build/include/jemalloc/internal/ticker.h:free
 -7,791,590 (-4.76%)  library/std/src/sys/unix/alloc.rs:__rdl_alloc_zeroed
  7,063,646 ( 4.32%)  compiler/rustc/src/main.rs:__rg_alloc
 -7,063,646 (-4.32%)  library/std/src/alloc.rs:__rdl_alloc

It's clear from this that the hack worked -- you can see negative counts for free and positive counts for sdallocx, showing that the second run switched from using free to using sdallocx.

The main source of additional instructions is the layout_to_flags function within tikv-jemallocator, which imposes a cost for opting into the sized deallocation.

         .           fn layout_to_flags(align: usize, size: usize) -> c_int {
99,245,696 ( 1.97%)      if align <= ALIGNOF_MAX_ALIGN_T && align <= size {
         .                   0
         .               } else {
         .                   ffi::MALLOCX_ALIGN(align)
         .               }
 9,022,336 ( 0.18%)  }

The other significant source of additional instructions is the 47,350,395 from sdallocx itself. There's no evidence of sdallocx providing any kind of improvement over free.

@alexcrichton
Copy link
Member

Nice! I'm glad the test was done first before the design in this case... A bit of a surprising result, though. Even if the layout_to_flags went away it seems like it didn't really have that huge improvements, which is unexpected!

@guswynn
Copy link
Contributor

guswynn commented Dec 28, 2021

Nice! I'm glad the test was done first before the design in this case... A bit of a surprising result, though. Even if the layout_to_flags went away it seems like it didn't really have that huge improvements, which is unexpected!

So surprising in fact, that perhaps we should tell the jemalloc devs over at https://github.com/jemalloc/jemalloc to investigate further...

@nnethercote
Copy link
Contributor Author

I used this to measure the effect of sized deallocations. It ended up being a slowdown.

I double checked this in #92548 and got very similar results on CI to what I measured locally, which was a good sanity check.

@LifeIsStrange
Copy link

@davidtgoldblatt @interwq
Hi dear Jemalloc devs, Rust has finally been tested using the sized_dealloc api and it surprisingly incurs a slowdown, which is why we wanted to signal this strange result to you.

@interwq
Copy link

interwq commented Jan 4, 2022

Thanks for letting us know @LifeIsStrange. A few factors might explain the slowdown:

  1. non-opt builds, i.e. does the many-assoc-items check use jemalloc-debug -- if that's the case, the slowdown is not surprising since the added sanity checks will cancel the speedup from sdallocx.
  2. is the benchmark doing tight loop-ish workload w/o much data access? In that case, the benefit from sized dealloc is also minimal, since the speedup comes from avoiding cache-miss / metadata-lookups (that is required by the non-sized dealloc, to figure out the size). Before our internal sized-dealloc rollout, the metadata lookup instructions account for ~30% of the total free path cost in our production workload. And during the sized-dealloc rollout, we observed a matching ~30% win on the dealloc fastpath. However it won't surprise me if many benchmarks don't show the same change, since their workload may not be cache intensive.
  3. (less likely to be the case) is the jemalloc configured with --enable-prof, and at the same time, the tests do mainly 4K aligned allocations? We consider this an edge case where the sized dealloc will have to check additional metadata to tell if the pointer was profiling-sampled (also losing the benefit of sdallocx) https://github.com/jemalloc/jemalloc/blob/dev/src/jemalloc.c#L3069

cc: @Lapenkov

@nnethercote
Copy link
Contributor Author

nnethercote commented Jan 4, 2022

@interwq As far as I know none of the things you suggested are happening.

rustc is fairly allocation heavy. Like any compiler, it has lots of heterogenous tree structures.

This comment has some analysis in its second half of the sized deallocation change. Much of the slowdown is caused by an additional layer of Rust code in tikv-jemallocator being run in front of each allocation and deallocation, which isn't jemalloc's fault and isn't that surprising. What is more surprising is that sdallocx appears to be no faster, and perhaps slightly worse, than free. There's no sign of fewer metadata lookups.

@interwq
Copy link

interwq commented Jan 4, 2022

Agree that the sdallocx not being faster part is unexpected. Can you grab a perf annotate of sdallocx (i.e. the instruction level cost within sdallocx) when running the benchmark? That way we can know for sure if the metadata-lookup is bypassed.

@nnethercote
Copy link
Contributor Author

@interwq: I have attached the Cachegrind annotations of jemalloc.c for the two runs.

Both profiles show identical paths through free_fastpath, which suggests that config_cache_oblivious is true. And I found that JEMALLOC_CACHE_OBLIVIOUS is defined in jemalloc_internal_defs.h in our build.

Does that make sense? Is setting JEMALLOC_CACHE_OBLIVIOUS reasonable?

@interwq
Copy link

interwq commented Jan 4, 2022

@nnethercote : thanks for the info! Yes that explains it -- sorry that I forgot in the last release CACHE_OBLIVIOUS also plays a role here. That requirement got optimized away in jemalloc/jemalloc#1749

If you'd like, feel free to try the current jemalloc dev branch: jemalloc/jemalloc@f509703

In the meantime we are preparing the upcoming 5.3 release, which will be based on the commit above.

re: --enable-cache-oblivious: it's still enabled by default but worth testing without it -- the feature provides address / cache index randomization for large allocations (by default >=16K), at the cost of one extra physical page per large allocation. In our workload we haven't observed CPU wins from it recently, but the extra memory cost is concrete. It's enabled by default more to limit worst case behavior, i.e. when it matters on some architectures (I suspect older ones), it matters a lot.

@nnethercote
Copy link
Contributor Author

If you'd like, feel free to try the current jemalloc dev branch: jemalloc/jemalloc@f509703

In the meantime we are preparing the upcoming 5.3 release, which will be based on the commit above.

I tried the current jemalloc dev branch. Compared to 5.2 it shows a small but clear reduction in instruction counts for rustc, up to 1.44%. Good to know for the future.

I also tried doing the sized deallocation thing. Results were better than before but still a net slowdown, mostly because of the cost of the layout_to_flags functions within the Rust code.

@LifeIsStrange
Copy link

@nnethercote @guswynn Jemalloc 5.3.0 has been officially released as stable! https://github.com/jemalloc/jemalloc/releases/tag/5.3.0
So tikv-jemalloc and the rustc version it ships should be updated accordingly :)

@lqd
Copy link
Member

lqd commented May 7, 2022

Note: tikv-jemalloc-sys does not yet have a release for the 5.3 line, but they do have a WIP update to the 5.3 RC. I have tried it in the draft #96790, it looks good and once the release happens we can update that to the final 5.3.0.

@nnethercote
Copy link
Contributor Author

Thanks, @lqd !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants