Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable rustc --version hangs forever #56736

Closed
fuchsnj opened this issue Dec 12, 2018 · 18 comments
Closed

stable rustc --version hangs forever #56736

fuchsnj opened this issue Dec 12, 2018 · 18 comments
Labels
regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@fuchsnj
Copy link

fuchsnj commented Dec 12, 2018

starting with nightly-2018-11-04 and anything later, just checking the version or doing anything with rustc causes the process to hang forever. It seems like it is waiting on a lock (no CPU usage)

Stable rust works fine, and well as any nightly version before this one.

OS is Ubuntu 18.04.1 LTS
Rust stable/nightly versions were installed with rustup

nathan@nathan-Precision-7510:~$ rustc +nightly-2018-11-04 --version
^C
nathan@nathan-Precision-7510:~$ rustc +nightly-2018-11-03 --version
rustc 1.32.0-nightly (8b096314a 2018-11-02)

The last few lines from strace rustc +nightly-2018-11-04 --version are

...
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
readlink("/etc/malloc.conf", 0x7ffd8dd4a900, 4096) = -1 ENOENT (No such file or directory)
open("/proc/sys/vm/overcommit_memory", O_RDONLY) = 3
futex(0x7fbdfd2510c8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x7fbdfe014228, FUTEX_WAIT_PRIVATE, 2, NULL

It hangs on the FUTEX_WAIT call forever

@ehuss
Copy link
Contributor

ehuss commented Dec 12, 2018

Another user reported the same issue here: rust-lang/cargo#6384

That is the first release that removed jemalloc. I would suspect that is related, but I don't have any ideas on how to reproduce.

@fuchsnj
Copy link
Author

fuchsnj commented Dec 12, 2018

Here is a stacktrace of the stuck process
https://gist.github.com/fuchsnj/5612923e3613a915b65aece0dd920149

This was captured with gdb on a stuck process running rustc +nightly --version
I just updated my nightly version, so it should be running the latest. (Couldn't really tell you which version that was though...)

@ehuss
Copy link
Contributor

ehuss commented Dec 12, 2018

Do you have ESET antivirus installed (or any other security software)?

I installed ESET and I'm able to reproduce it. It looks like jemalloc is getting stuck recursively trying to initialize itself.

@alexcrichton My guess is that there is something about jemalloc 5 has changed how it initializes maybe?

@fuchsnj
Copy link
Author

fuchsnj commented Dec 12, 2018

Yes, I have ESET antivirus installed.

@alexcrichton
Copy link
Member

cc @gnzlbg, do you know if jemalloc has a fix for this perhaps?

@ehuss
Copy link
Contributor

ehuss commented Dec 12, 2018

It appears to be an issue with ESET, jemalloc 5, and rustc being built for an old kernel.

It looks like jemalloc 5 has started to use CLOEXEC. Since rust is built with a very old linux kernel, it has to use fcntl (here) instead of just passing O_CLOEXEC (which requires 2.6.23). fcntl is intercepted by ESET which attempts to find the open symbol with dlsym. dlsym requires calling calloc, which hangs since jemalloc is in the middle of initializing.

I have confirmed building rustc locally (with jemalloc) that it does not hang, presumably because it is using O_CLOEXEC.

I don't offhand see any workarounds (other than using a newer kernel).

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 12, 2018

I don't offhand see any workarounds (other than using a newer kernel).

The PR that started using CLOEXEC was jemalloc/jemalloc#872 which fixed jemalloc/jemalloc#528 . We could patch jemalloc to not use CLOEXEC when built for Rust, but... it looks to me that jemalloc is doing the right thing here, and that this is a corner case that should be handled in ESETs side.

We should open a bug with ESET about their jemalloc 5 / fcntl / Rust support, maybe they can roll a fix quickly. Depending on their timeline, patching jemalloc to not use CLOEXEC shouldn't be hard: when will the first stable Rust version with this issue land? I think we should consider it a regression.

@alexcrichton
Copy link
Member

Opening a bug (if we can) with ESET sounds good to me for now, but if that doesn't pan out we can probably work around this and just not use cloexec there as it's a short-lived fd anyway

@estebank estebank added the regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. label Dec 12, 2018
@fuchsnj fuchsnj changed the title nightly rustc --version hangs forever (regression) nightly rustc --version hangs forever Dec 12, 2018
@pnkfelix
Copy link
Member

pnkfelix commented Dec 14, 2018

Something I think deserves clarification here: as of PR #55238 (resolving issue #36963), builds of rustc stopped linking in jemalloc by default; however, if I am correctly reading the documentation and commit messages of that PR, the rustc built via CI opts back into having jemalloc linked to rustc (on Linux or Mac OS X). (and thus the nightly you get via rustup or otherwise downloading CI-built executables will link to jemalloc).

Its a pretty confusing situation, IMO, since attempts to locally replicate the behavior described here via a local build of rustc would need to turn that flag back on. (I think @ehuss is saying in their comment above that they took care to opt back into jemalloc in their local build. But it is easily overlooked.)

(Also: the CI's opting back into using jemalloc affects not just the nightly builds but also the beta and stable ones......? I'm a bit flummoxed as to why we would want the out-of-the-box local build to differ in this way from what we deploy. At the very least I would expect more prominent documentation as to how to properly recreate the CI's build.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 14, 2018

the rustc built via CI opts back into having jemalloc linked to rustc.

IIUC the intent was for rustc to always depend on jemalloc by default, since that was the status-quo before that change, but to allow people to build it without jemalloc, e.g., if they want to use it in a system where jemalloc is not available. It might be that this did not fully materialize.

@pnkfelix
Copy link
Member

Yes, I too thought that was the intent. But...:

rust/config.toml.example

Lines 399 to 401 in f4b07e0

# Link the compiler against `jemalloc`, where on Linux and OSX it should
# override the default allocator for rustc and LLVM.
#jemalloc = false


It could very well be that I am wrong about my expectations, and that if one wants to replicate the CI build product, one should take care to actually run configure with args taken from e.g.:

RUST_CONFIGURE_ARGS="--enable-extended --enable-profiler --enable-lldb --set rust.jemalloc"

(or with configure args taken from https://github.com/rust-lang/rust/blob/master/appveyor.yml, as appropriate to one's platform)

@pnkfelix
Copy link
Member

I'm just going to open a separate issue about this discrepancy between the CI vs local builds, rather than continuing to clutter up this issue's comment thread. Sorry for the noise!

@pnkfelix
Copy link
Member

pnkfelix commented Dec 20, 2018

T-compiler triage. This issue is tagged as a regression but no T-label, so no team has default responsibility for it. Based on the comments in the issue here, I do not think T-compiler is in a position to fix this; it seems to be probably a T-infra problem? (And a problem that T-infra may well choose to close as "wont-fix")

@nagisa nagisa added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. and removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 3, 2019
@turboladen
Copy link

FWIW, we started getting this after updating to 1.31 on Ubuntu 14.04.5 LTS, but we are not running ESET. So far it's only only one one instance in our AWS stack, but that instance is responsible for a whole feature set in our beta environment. We've tried 1.31.0 and 1.31.1 so far. Also have reinstalled rustup + rust. So far the behavior is pretty consistent. We don't, however get this on the same feature set's staging instance, so we're trying to track down the differences. Hopefully we'll find something as this is currently hanging up the dev+QA cycle right before a scheduled app release.

@fuchsnj
Copy link
Author

fuchsnj commented Jan 26, 2019

As of Rust 1.32 (Jan 17th) this now affects the latest Rust stable version.

@fuchsnj fuchsnj changed the title nightly rustc --version hangs forever stable rustc --version hangs forever Jan 26, 2019
@aidanhs
Copy link
Member

aidanhs commented Feb 4, 2019

We discussed this in the infra team meeting a few weeks, and basically decided (given that this appears a strange interaction between jemalloc and ESET we're stuck in the middle of) to wait until beta (and stable!) to see if more people reported the issue.

Given that we've not seen more reports, unfortunately this isn't going to be something we prioritise - our hope is that either jemalloc or ESET fixes things.

That said - @turboladen, we're interested in your report. Did you manage to track anything down, e.g. via strace?

@turboladen
Copy link

@aidanhs unfortunately we didn't really get time to get much info on it. We ended up cloning over the AWS image we had for our same app from a different environment (beta was having the problem described in this ticket, staging was not); I do believe, however, that we had jemalloc installed on the beta instance (trying to help speed up Rails), but did not have jemalloc on the staging instance.

@antekone
Copy link

antekone commented Apr 1, 2019

Just FYI, ESET has fixed the issue in version 4.5.14.0. So if anyone uses older product version and suffers from this hang, please try updating to 4.5.14.0 or later.

@fuchsnj fuchsnj closed this as completed Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests