-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated segfaults on Windows integration tests #47957
Comments
I will also add that I have been running the test suite with |
It sounds like you are likely to have a data-race in the code, and either need to disable threading, or perhaps try ThreadSanitizer (on linux) to see if it can catch it |
I never see an issue on Linux though, only Windows. Is there a tool to find data races on Windows? (Or maybe to see if the Linux run also experiences the data race, but simply does not segfault over it?) I can't immediately think of anywhere there could be a race condition in the current code, but I will have a closer look. For the most part:
I suppose maybe there could be an issue of objects not being copied when passed to a thread, and so perhaps the head worker tries to access the same object before fetching...? Is there any other thing you could think of other than a data race? Edit: I found what look like a couple chances for data races: MilesCranmer/SymbolicRegression.jl@538c402. Let's see if that helps! Edit 2: Nope, still getting segfaults even after that fix: https://github.com/MilesCranmer/SymbolicRegression.jl/actions/runs/3759279230/jobs/6388650870#step:7:880. |
Are there any binaries with ThreadSanitizer built-in? I'm building from source and it's taking quite a while compared to a normal build... nearly 24 hours building now. |
@vtjnash I seem to be unable to build Julia with thread-sanitizer. Would you happen to have any advice for using it? I can build with address sanitizer just fine (following this), but thread sanitizer, I encounter various problems. Presumably because that page has much more detailed instructions for address sanitizer, I am probably missing some flags which are not mentioned? Rocky Linux 8.7First building the toolchain (following https://docs.julialang.org/en/v1/devdocs/sanitizers/#Example-setup), then building with `make debug` for Julia 1.8.3 with the following `Make.user`
gives me the following error:
I also tried the following alternative env variables, with the same segfault:
Each of these I commented out or left as is. Same error. I ran macOS Ventura 13.1 (M1 Pro)This takes over 24 hours to complete a single build, with the same combinations as above. It gets a bit further, but in the end, I segfault on building Building just with ASANIf I follow the tutorial on https://docs.julialang.org/en/v1/devdocs/sanitizers/#Example-setup exactly, for ASAN, I can actually build everything. It is only when I attempt to build TSAN do I get an error. Are there any flags not described in the docs which I am missing? Thanks. |
Is it possible to build the current version of Julia with thread sanitizer? Here's a docker container which gives a reproducible segfault during the build: FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
wget \
ca-certificates \
curl \
gpg-agent \
software-properties-common \
python3 \
python3-dev \
tar \
xz-utils \
gfortran
RUN wget https://apt.llvm.org/llvm.sh && chmod +x llvm.sh && ./llvm.sh 13
WORKDIR /toolchain
ENV TOOLCHAIN_WORKTREE=/toolchain
ARG JLVERSION=1.8.3
ARG PROCS=4
RUN git clone https://github.com/JuliaLang/julia ${TOOLCHAIN_WORKTREE} && \
cd ${TOOLCHAIN_WORKTREE} && \
git checkout v${JLVERSION}
# Build the toolchain
RUN echo "USE_BINARYBUILDER_LLVM=1" > ${TOOLCHAIN_WORKTREE}/Make.user && \
echo "BUILD_LLVM_CLANG=1" >> ${TOOLCHAIN_WORKTREE}/Make.user
RUN cd ${TOOLCHAIN_WORKTREE} && make -j ${PROCS} -C deps install-llvm install-clang install-llvm-tools
WORKDIR /julia
ENV BUILDDIR=/julia
RUN git clone https://github.com/JuliaLang/julia ${BUILDDIR} && \
cd ${BUILDDIR} && \
git checkout v${JLVERSION}
# Put the above commands into /julia/Make.user:
RUN echo "USECLANG=1" > ${BUILDDIR}/Make.user && \
echo "TOOLCHAIN_WORKTREE=/toolchain" >> ${BUILDDIR}/Make.user && \
echo "TOOLCHAIN=\$(TOOLCHAIN_WORKTREE)/usr/tools" >> ${BUILDDIR}/Make.user && \
echo "override CC=\$(TOOLCHAIN)/clang" >> ${BUILDDIR}/Make.user && \
echo "override CXX=\$(TOOLCHAIN)/clang++" >> ${BUILDDIR}/Make.user && \
echo "export ASAN_SYMBOLIZER_PATH=\$(TOOLCHAIN)/llvm-symbolizer" >> ${BUILDDIR}/Make.user && \
echo "USE_BINARYBUILDER_LLVM=1" >> ${BUILDDIR}/Make.user && \
echo "override SANITIZE=1" >> ${BUILDDIR}/Make.user && \
echo "override SANITIZE_THREAD=1" >> ${BUILDDIR}/Make.user && \
echo "override JULIA_BUILD_MODE=debug" >> ${BUILDDIR}/Make.user && \
echo "JULIA_PRECOMPILE=1" >> ${BUILDDIR}/Make.user && \
echo "export LBT_USE_RTLD_DEEPBIND=0" >> ${BUILDDIR}/Make.user
# Build:
RUN make -j ${PROCS} debug You can run with, e.g.,
|
Okay I finally got it working with TSAN after a couple of weeks of trying to build it. However, TSAN does not raise a single warning when running my code. So it seems there are no data races after all. Do you have any other tips for trying to debug this? |
Sorry, I take that back. I was running Julia with 1 thread! Looks like there are indeed some data races. Here are the outputs from a run of
Edit: more outputs
|
Your best bet now is to run it under rr and figure out where those $PC were allocated and what functions they correspond to. There is some debugging tips in the devdocs manual, as well as often people on Slack who can answer questions on #internals or #multithreading |
Thanks, will try. Curiously I tried running the entire test suite under rr chaos mode, and there were no problems at all. It's only Windows where things actually crash (but sadly no rr for Windows). |
I was never able to track down the bug. However it does seem like it's related to Thus this may potentially be fixed by setting |
I have been seeing segfaults on my Windows CI of SymbolicRegression.jl for maybe ~6 months now, and I am finally throwing in the towel and submitting a bug report.
What seems to be happening is the test suite will experience a segfault at some part of my test suite, randomly at some point through these integration test sets, which happen after the unit tests. The integration tests use multiprocessing, multithreading, and various other compute options, but they are not too strenuous and the Ubuntu and macOS tests always seem to pass fine.
I cannot reproduce these segfaults on a local copy of Windows; I only see them on GitHub actions'
windows-latest
machines. I usually see them on Julia 1.6.7 and Julia 1.7.3, although I have seen them on Julia 1.8.2 as well (but less frequently). If you have any recommendations for how I can get better traces of these segfaults, I would love to hear it. I know of therr
option on Linux, but it seems like there is no good equivalent for Windows.Essentially, the Windows tests will randomly segfault someway through the integration tests. Here are a few examples:
windows-latest
, Julia 1.6.7, commit 367d155. Segfaults at this test (multi-threading; with a few different search settings) - https://github.com/MilesCranmer/SymbolicRegression.jl/blob/367d155f26c5a7f0faf26bf529b95f097f1f7f22/test/test_mixed.jl#L39.windows-latest
, Julia 1.7.3, commit 367d155. Segfaults at the same test, but a little later on:At this commit, the test passes for Julia 1.8.2. All other operating systems pass.
windows-latest
, Julia 1.6.7, commit 81f9544 same error as above.windows-latest
, Julia 1.7.3, commit 81f9544. This one lasts longer than before. I think that this one segfaults here, which is the suite after the above test.Any help would be much appreciated.
These may or may not be related to these segfaults in the PyJulia frontend: MilesCranmer/PySR#238.
The text was updated successfully, but these errors were encountered: