-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compilation of large project taking much longer after 1.84 (monomorphization) #135477
Comments
While the bisection results as discussed in the other issue will be helpful, I'm sure you know this but we won't be able to do much without some code to reproduce and analyze the issue. |
@lqd, I work with @lsunsi. Right now, I am thinking that because it uses a nightly instead of a "stable build" it does not really reproduces the same behaviour as if I was running Super thanks! |
Here, you should try to go back further than 1.83.0 to see if the difference starts to appear at some earlier point.
Interesting. Normally this should be quite unusual, because nightlies and stable have basically the same code, only gated differently. You could try using the stable release but unlocking the gating so it behaves as a nightly. With the env var If that behaves differently than the nightly that was promoted to 1.83.0 it will be quite strange. And if it behaves differently than without the env var, it will also be quite strange. Both cases would help you narrow down the source of the issue. |
And also the other way around, you could try |
Ok, I have tried again starting at 1.81.0, and Here is the full log , but the summary is:
Which is a bit strange for me, considering that 1.83.0 is "fast" and we noticed the problem only with the 1.84.0 of last week. And, probably the perception that I had that "a nigthly does not reproduce the problem like a stable" does not make sense. It was that I was not looking for nightlies "older enough". Sorry for that hallucination :) |
So is 1.83.0 fast? And what's the difference with 1.84.0? Let me try to summarize,
We know of issues from #126024 in 1.82.0 but that has since improved a lot by #132625 in 1.83.0. It could be easier if you post the timings of all the milestones, from 1.81.0 onwards. |
Nope, 1.84.0 is slower than 1.83.0 (that is why we only opened this issue now, when we updated from 1.83.0 to 1.84.0) Here are the timings for
But again, I could not bissect between 1.83.0 and 1.84.0 to find what happened, using the nightlies, because all of them were taking more than 600 secs to compile:
|
Given the likely relevant #132625 which might have made 1.83 fast again made it to 1.83 only as a beta-backport, in the nightlies you’ll only see its effect in nightlies after it was merged (november 7). So make sure to test versions around that point like Let’s assume that a different change, that’s part of 1.84, too, made it slower again. If that different change happened after #132625, there's a chance you can find it with bisections then. If not, that’s perhaps harder to identify then. Though assuming the 600+s vs 800+s difference is significant, this can probably also be found. So anyways… if you test If If you then also compare the speeds of |
Yes that’s it, I forgot that when landing backports the milestone in the PR is changed to the earlier version, making it harder to match the nightly picked by cargo-bisect-rustc. For this very PR though, you can also test its artifacts directly to see its isolated behavior instead of nightlies, with https://github.com/kennytm/rustup-toolchain-install-master, by comparing the commit where it landed ( It should have a noticeable effect either way, and then the other nightlies like steffahn described should show where things regressed more to piece together all the PRs that had a positive and negative impact in that range. |
Ok, I followed @steffahn instructions (thanks btw) and I could find interesting information. First, I ran
So, it is clear that #132625 really fixed/helped. But this test also showed that some other thing happened between nightly-2024-10-13 and nightly-2024-11-07. And this bisect executed well and pointed a regression in 662180b. So, what do you think? |
@lochetti Thank you so much for looking into this further! Great to see we are finding some more conclusive data here! Wow >4000s is quite the number… (for a project that used to compile <300s) the different slowdowns almost seem to multiply o.O Just to double-check, because my personal experience with the bisection tool is that sometimes one can easily overlook something in the automated testing because of its limited output. (And also it would be nice to rule out relevance of yet another different change): could you possibly also post the concrete timings from running with If it’s the same ~670s vs ~4115s difference as I’m not doubting the results too much, necessarily. After all, it looks like #131949 comes with known (small) compile-time performance…
and those are somewhat type&trait-heavy-ish, too, so it would fit the picture. Still, given y'all are the only ones who can find out more with your code it would be very nice to get the concrete the numbers here :-) Footnotes
|
@steffahn, sure! I am glad to help :) Here they go:
|
Now we just need someone to come up with an explanation of why #131949 can have such an impact :-) |
We can't easily come up with an explanation without code to analyze, so a reproducer, MCVE, whatever would help. Though there have been reports of other regressions in fxhash2 that caused people to downgrade (maybe like rust-lang/rustc-hash#45) -- maybe some of these hit us in certain cases, previously unseen in our benchmarks. We're due a benchmark upgrade soon so maybe we can recheck the two hashes on the updated dataset, but the new version looked like a wash perf wise, so we may not lose all that much if we reverted back to v1. |
Interesting. I wonder if @orlp has any extra insight into why the new hasher caused this large regression. |
We have broken our project down into several crates and noticed that the significant increase in compile time is in the crate where we heavily use Diesel DSL types. This is likely due to the complexity and size of the types resulting from Diesel's usage. I will attempt to create a smaller, open project simulating similar code to verify whether we can observe the difference in compile time between both commits. |
@Noratrieb Without a reproducer I can run to look at the relevant hashes & inputs I don't know. Both the old and new rustc-hashes have known bad input patterns which can create mass collisions, although the new one should be harder to hit. |
@steffahn That could be rewritten using a But I don't know all the details and I'll let other people look at it, I'm just here to solve hash function woes :) |
ah, my edit raced your new comment. anyways, I don't see how |
I did this exact change on master and it seems to make no difference at first glance. |
I'm trying this change now, can you be more specific? The Hash derive is probably hashing both fields ptr and marker, right? I tried copying the @orlp suggestion verbatim but compilation fails. Do you mind typing the exact Hash impl I should give to GenericArg? I'll try it and post results right away! |
@lsunsi, this isn’t surprising, given I already suspected that The analogous change to #[derive(Copy, Clone, PartialEq, Eq)] // <- no more `Hash` here
pub struct GenericArg<'tcx> {
ptr: NonNull<()>,
marker: PhantomData<(Ty<'tcx>, ty::Region<'tcx>, ty::Const<'tcx>)>,
}
// this is added instead
impl<'tcx> std::hash::Hash for GenericArg<'tcx> {
#[inline]
fn hash<H: std::hash::Hasher>(&self, s: &mut H) {
let mut addr = self.ptr.addr().get() as u64;
addr ^= addr >> 32;
addr = addr.wrapping_mul(0x9e3779b97f4a7c15);
addr ^= addr >> 32;
addr.hash(s);
}
} you could try if this makes a difference |
^^ wrote this before reading the last 2 replies; looks like this should exactly answer your question |
@lsunsi So it looks like there's a mass collision happening with the new rustc-hash for pointers on your machine. There are still other cases where this happens, e.g. Could you modify the impl<'tcx> std::hash::Hash for GenericArg<'tcx> {
#[inline]
fn hash<H: std::hash::Hasher>(&self, s: &mut H) {
use std::sync::{LazyLock, Mutex};
use std::io::Write;
use std::fs::File;
use std::time::SystemTime;
static HASH_LOG_FILE: LazyLock<Mutex<File>> = LazyLock::new(|| {
let ts = SystemTime::now().duration_since(SystemTime::UNIX_EPOCH).unwrap();
let path = format!("/some/path/of/your/choosing-{}.txt", ts.as_nanos());
Mutex::new(File::create(path).unwrap())
});
writeln!(HASH_LOG_FILE.lock().unwrap(), "{:x}", self.ptr.addr().get() as u64).unwrap();
let mut addr = self.ptr.addr().get() as u64;
addr ^= addr >> 32;
addr = addr.wrapping_mul(0x9e3779b97f4a7c15);
addr ^= addr >> 32;
addr.hash(s);
}
} Then if you could upload (some of) the generated files I can take a look and see why these pointers are causing problems on your system/project. |
@orlp Hey! Yeah definitely, I just did that. Problem is I got some huge files (I guess it was to be expected...). Here it is, just a heads up that upon decompression you should expect around 6GB (which in itself is kind of amazing, because the compressed file is like 250MB). |
The only immediate pattern I see to your data is that the top 32 bits are all the same, as are the bottom 3 bits. I'm not seeing a trivial hash collapse, if anything the new rustc-hash seems better distributed in both the top 7 bits (hashbrown tag) and bottom 7 bits (bucket index if hash table has 128 buckets), yet create more collisions. For everyone's convenience (and because Google Drive links tend to rot) I've taken the largest file (5GB) and uniquified all the addresses: uniq_addr.zip. The resulting file is only 30MB so much more manageable to study. |
I must say I'm very confused. I wrote this little test program: use std::collections::HashSet;
use std::hash::{BuildHasher, BuildHasherDefault};
use std::sync::atomic::{AtomicU64, Ordering};
use rustc_hash::FxHasher;
static COLLISIONS: AtomicU64 = AtomicU64::new(0);
#[derive(Hash, Eq)]
struct CountCollisions(u64);
impl PartialEq for CountCollisions {
fn eq(&self, other: &Self) -> bool {
COLLISIONS.fetch_add(1, Ordering::Relaxed);
self.0 == other.0
}
}
fn entropy(hist: &[usize]) -> f64 {
let s: usize = hist.iter().sum();
if s == 0 { return 0.0; }
hist.iter().map(|n| {
let p = *n as f64 / s as f64;
if *n > 0 { -p * p.log2() } else { 0.0 }
}).sum()
}
fn main() {
const NUM_BUCKETS: usize = 1<<4;
let mut tag_counts = [0; 128];
let mut bucket_counts = [0; NUM_BUCKETS];
let mut combo_counts = [0; NUM_BUCKETS * 128];
let hasher = BuildHasherDefault::<FxHasher>::new();
let mut collider = HashSet::with_hasher(hasher.clone());
let file = std::fs::read_to_string("uniq_addr.txt").unwrap();
for line in file.lines() {
let ptr = u64::from_str_radix(line, 16).unwrap();
let h = hasher.hash_one(ptr);
let tag = (h >> (64 - 7)) as usize;
let bucket = h as usize % NUM_BUCKETS;
tag_counts[tag] += 1;
bucket_counts[bucket] += 1;
combo_counts[(bucket << 7) | tag] += 1;
collider.insert(CountCollisions(ptr));
}
println!("collisions: {}", COLLISIONS.load(Ordering::Relaxed));
println!("tag entropy: {:.5} bits", entropy(&tag_counts));
println!("bucket entropy: {:.5} bits", entropy(&bucket_counts));
println!("combined entropy: {:.5} bits", entropy(&combo_counts));
} I got the following results:
I don't understand why. It really seems rustc-hash 2.1.0 is better distributed both in the tag and in the lower 4 bits (I've also checked 8 bits and 16 bits, same result). Also when the two are combined it is better distributed, so it doesn't seem to be a correlation issue between tag and bucket index either... Yet despite this 2.1.0 has many more calls to |
It is not some strange effect of re-hashing order, as doing the following gives roughly the same results: let len = file.lines().count();
let mut collider = HashSet::with_capacity_and_hasher(len + 1, hasher.clone()); |
@orlp How many unique addresses are there in your test? Maybe it's using a lot more than 16 bits even? |
@orlp I think I get it, the example you posted shows that despite "better" hashing, the counter counts more collisions against the biggest files, which we can consider a reproduction of the slow build, is that right? |
@steffahn Together with @purplesyringa we figured it out. The new rustc-hash finalizer is the problem: self.hash.rotate_left(20) as u64 I did not anticipate tables could grow to 2^20 and beyond in the Rust compiler, so in the event that the low bits have 0 entropy (which is the case here as the bottom 3 bits are all zero), we start only using a portion of the hash table buckets when the total number of buckets grows beyond 2^20. A new finalizer for rustc-hash that avoids this is needed. |
A problem that only appears with millions of entries - so much for searching for a minimal example 😂 @rustbot label -S-needs-repro Are there perhaps any sufficiently cheap solutions that can spread the entropy more evenly? (As an alternative to the approach of just increase that number 20.) Magic numbers, reliant on HashMap implementation details, possibility for subtle performance regressions that only show in huge compilations, IMO this isn't a satisfactory outcome that any legitimate hashing algorithm should get away with.. |
We've found a few satisfactory solutions (I'll let @orlp elaborate on that, they're still experimenting), but this whole mess with complicated hash functions stems from a suboptimal implementation detail of Ideally we could update |
@steffahn It's a trade-off when using the rotate. The larger you make that rotation the more sizes you can support, however, it also means the top bits are more likely to not affect anything. E.g. for hash tables with 256 buckets in the current system the top 20 - 8 = 12 bits are effectively ignored. If I increased the rotation to 32 bits the top 24 bits are ignored in the same scenario. |
We could always add another generic argument to |
@orlp For the specific crate we were testing before, here is the profile for two significant commits: 3a6da61357aca9fbf7b3017ed9d795cab46b57dd 1m42s The results were so exciting that I wanted to test against the whole project. |
@lsunsi How does |
1.83 stable just one crate EDIT sorry I was already running against 1.83 public before I read your message properly. I'll build 90b35a6 and run it, just a sec. |
@lsunsi FYI your 'whole project' profiles are missing debug symbols again. |
90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf (1.83) just one crate 1m43s @orlp It seems like the parameters for public release really made a difference, great catch. The times are basically identical from your fixed branch! |
There's no real need for that, because it didn't aim to fix any of your performance in the first place. It should be just as bad as nightly. (The point was to prove/check that it didn't become any worse.) |
Code
I have a big private project and we try to stay current on rust versions. Upon trying 1.84 I saw compilation times grow about 3 times. I already seen this behavior in other rust versions which made me have to skip it, for example 1.82.
I self-profiled the compile in 1.83 and 1.84 and diffed, so I got to this (https://gist.github.com/lsunsi/7d301c7e332f50a734647d3aff0efbdc).
I'm not sure how useful it is, but there we go. I can post the prof data as well if it's useful.
Further, I'll try to bisect and get back with more information.
Version it worked on
It most recently worked on: 1.83
Version with regression
rustc --version --verbose
:@rustbot modify labels: +regression-from-stable-to-stable-regression-untriaged
The text was updated successfully, but these errors were encountered: