-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows test failure in tests: sparse.jl #9185
Comments
Which version of Windows is this? 32-bit or 64-bit? |
Sorry for forgetting to tell that. 32-bits Thanks a lot. André Rifaut From: pao notifications@github.com Which version of Windows is this? 32-bit or 64-bit? |
Does it also fail if you run only the sparse test, via |
Okay despite my pretty thorough testing last week while backporting things to release-0.3, I can reproduce this with the following using the 0.3.3 release binaries.
Now let's see if I can reproduce it in a source build so I can bisect to find which backport caused the issue. |
Can you rerun the tests to see if this is reproducible? |
Looks like it's processor dependent. I can reproduce it on a 32-bit XP desktop with a Penryn processor, but not a 64-bit Windows 7 Sandy Bridge laptop (not even building and running 32 bit Julia). Instinct tells me this is probably either a suitesparse or openblas bug. Unfortunately I can't build Julia from source on the XP computer that I can reproduce the problem on. Maybe I'll be able to get it to happen in a VM. |
I rebooted my machine. I used TWICE the REPL (see below). D:\ari\Tools\julia\Julia-0.3.3\share\julia\test>call Please submit a bug report with steps to reproduce this fault, and any
_ _ ()_ | A fresh approach to technical computing julia> Base.runtests("sparse") Please submit a bug report with steps to reproduce this fault, and any julia>
_ _ ()_ | A fresh approach to technical computing julia> Base.runtests("sparse") Please submit a bug report with steps to reproduce this fault, and any julia> julia> From: Tony Kelman notifications@github.com Does it also fail if you run only the sparse test, via |
See below my answer to Tony Kelman ----- Forwarded by André Rifaut/PRO/SSI/TUDOR on 28/11/2014 08:16 ----- From: André Rifaut/PRO/SSI/TUDOR Date: 28/11/2014 08:11 I rebooted my machine. I used TWICE the REPL (see below). D:\ari\Tools\julia\Julia-0.3.3\share\julia\test>call Please submit a bug report with steps to reproduce this fault, and any
_ _ ()_ | A fresh approach to technical computing julia> Base.runtests("sparse") Please submit a bug report with steps to reproduce this fault, and any julia>
_ _ ()_ | A fresh approach to technical computing julia> Base.runtests("sparse") Please submit a bug report with steps to reproduce this fault, and any julia> julia> From: Tony Kelman notifications@github.com Does it also fail if you run only the sparse test, via From: Ivar Nesje notifications@github.com Can you rerun the tests to see if this is reproducible? |
Although it doesn't quite look like it, these are all failing at the same place. The important part to note is the message
Yes that's intended, it's the slightly confusing name for the 32-bit compiler that we use to build Julia for Windows. For 64 bit Julia the name is After some more testing, this might have something to do with the |
I made manually the tests in the file sparse. Everything is ok before the So, I can just confirm what you detected: I get a core dump with your I hope this help. From: Tony Kelman notifications@github.com Although it doesn't quite look like it, these are all failing at the same |
Yikes, yikes yikes. I don't understand why this would happen at all, but according to bisect the first bad commit is 478ac06 - which doesn't seem like it has anything to do with sparse matrices at all. |
Yeah, I thought it was bizarre too. I'm running those two commits over again several times and it's pretty consistent that 478ac06 fails |
I can't reproduce this on appveyor by setting |
Done. Thanks for the reminder. |
No prob, do the files need a permissions tweak? https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.55/job/jlu54f72skx7cwtl edit: or do I need to add some |
It seems like the bug must be somewhere in suitesparse but 478ac06 just triggers it somehow. If I rebuild SuiteSparse without having |
use https://status.julialang.org set MARCH on appveyor (cherry picked from commit ce29d78) [ci skip] since this is expected to fail on win32 right now due to #9185
Bumping to the very latest SuiteSparse version doesn't fix this. It seems we have 3 options:
Opinions? cc @JeffBezanson if you're not too busy. |
Given that all Windows binaries up to and including 0.3.2 did not have I suspect mingw defaults to conservative instruction sets to ensure portable binaries, so setting |
It's possible that the only reason not setting We should probably add |
Yeah already did. ce29d78 Jeff may have some theories about the deeper issues here, in which case messing with the flags is also a band aid. #9188 (comment) Seems as though we would have seen these sooner. Nightlies won't help this one since I think it might be specific to release-0.3, but it might help with #9189 |
@andre-rifaut in the meantime while we figure out a proper solution to this, you can try a build I just made of 0.3.3 without setting |
This might be a x87-vs-SSE problem, since setting march to i686 for building the deps but JULIA_CPU_TARGET to pentium4 for JIT code appears to work properly. Need to verify that this doesn't bring back any of the i686 issues on 32 bit linux though.
Thanks for your help. I actually want to start with bigint. I'm not in a hurry. I downloaded http://sourceforge.net/projects/juliadeps-win/files/julia-0.3.3-i686.exe but there is some problem in the script for running the tests. See below. D:\julia\Julia-0.3.3b\share\julia\test>call "D:\ari\Tools\julia\Julia-0.3.3b\bin\julia.exe" runtests.jl all |
That's strange. What are your And is D: a network drive or local? |
I rebooted my machine and re-installed julia with http://sourceforge.net/projects/juliadeps-win/files/julia-0.3.3-i686.exe D:\julia\Julia-0.3.3\share\julia\test>call "D:\julia\Julia-0.3.3\bin\julia.exe" runtests.jl all D:\julia\Julia-0.3.3\share\julia\test>endlocal |
Strange that a reboot would matter, but yeah the warnings are normal and that's a good sign. Still don't know why changing suitesparse compilation flags would cause things to segfault. |
Potentially work around JuliaLang/julia#9185
if we see this again on master, it should come with a more useful / valid stacktrace. please post it here so we can have a somewhat better idea where the error is coming from |
Thanks to @staticfloat we have rebuilt official 0.3.3 binaries that used different flags for SuiteSparse and do not exhibit this bug. Should we keep this open as something's still failing when SuiteSparse is built with |
Maybe file a bug against SuiteSparse? |
If strange seemingly unrelated changes in Julia didn't make this problem come and go, then I would think about trying to write a C reproduction case. I suspect the problem is more likely in LLVM, something having to do with ccall, or something in our finalizers, rather than a problem that could be tracked to suitesparse's source code. |
given that this happens when you set MARCH=pentium4, that means it happens when you enable SSE instructions (default is MARCH=i686). so this almost certainly means either the stack or some malloc'd data object (typically the stack) is not 16-byte aligned when we arrive in suitesparse. gcc assumes the stack will be 16-byte aligned. window and llvm do not (they provide 4-byte alignment) to fix this same bug in openblas, we added the appropriate build flags to openblas: Line 914 in c70ae4c
to fix this everywhere, we set the appropriate flag in llvm: Line 3693 in c70ae4c
the only remaining question then, is why that comment and code says 8-byte and 2-byte, when the correct statement to avoid this bug should be 16-byte and 4-byte? |
@tkelman or @staticfloat can you confirm and backport this fix? |
Sure I'll check |
Confirmed - at least on master, this makes the workaround at JuliaCI/julia-buildbot#10 no longer necessary. Win32 segfaults in the umfpack test at 364cd0c when SuiteSparse is built with |
backported in 09ecd06 |
"Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks."
... this is the bug report (80 columns wide).
The text was updated successfully, but these errors were encountered: