Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows test failure in tests: sparse.jl #9185

Closed
andre-rifaut opened this issue Nov 27, 2014 · 36 comments
Closed

Windows test failure in tests: sparse.jl #9185

andre-rifaut opened this issue Nov 27, 2014 · 36 comments
Labels
system:windows Affects only Windows

Comments

@andre-rifaut
Copy link

"Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks."
... this is the bug report (80 columns wide).

D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>SET tests=
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>IF NOT ! == ! GOTO RUN
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>SET tests=all
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>pushd D:\julia\julia-0.3.3-win32\julia-insta
ller\$_OUTDIR\bin
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>setlocal enableextensions enabledelayedexpansion
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>call "D:\julia\julia-0.3.3-win32\julia-insta
ller\$_OUTDIR\bin\prepare-julia-env.bat" all
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>set SYS_PATH=C:\Program Files\......
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>set PATH=D:\julia\julia-0.3.3-win32\julia-in
staller\$_OUTDIR\bin\;D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin\bin;D:\julia\julia-0.3
.3-win32\julia-installer\$_OUTDIR\bin\usr\bin;D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin\..\usr\b
in;D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin\..\..\usr\bin;C:\Program Files\......
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>set JULIA_EXE=julia.exe
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>for %A in (julia.exe) do set JULIA_HOME=%~dp$PATH:A
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>set JULIA_HOME=D:\julia\julia-0.3.3-win32\ju
lia-installer\$_OUTDIR\bin\
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>set JULIA=D:\julia\julia-0.3.3-win32\julia-i
nstaller\$_OUTDIR\bin\julia.exe
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>set PATH=C:\Program Files\.....
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>set private_libdir=bin
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>if not exist "D:\julia\julia-0.3.3-win32\jul
ia-installer\$_OUTDIR\bin\..\lib\julia\sys.ji" (
echo "Preparing Julia for first launch. This may take a while"   && echo "You may see two git related errors. This is co
mpletely normal"   && cd "D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin\..\share\julia\base"   && "D
:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin\julia.exe" --build "D:\julia\julia-0.3.3-win3
2\julia-installer\$_OUTDIR\bin\..\lib\julia\sys0" sysimg.jl && ^
 "D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin\julia.exe" --build "D:\julia\julia-0.3.3-w
in32\julia-installer\$_OUTDIR\bin\..\lib\julia\sys" -J sys0.ji sysimg.jl   && popd   && pushd "D:\julia\julia-
0.3.3-win32\julia-installer\$_OUTDIR\bin"
)
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>cd "D:\julia\julia-0.3.3-win32\julia-install
er\$_OUTDIR\bin\..\share\julia\test"
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\share\julia\test>call "D:\julia\julia-0.3.3-win3
2\julia-installer\$_OUTDIR\bin\julia.exe" runtests.jl all
        From worker 3:       * linalg2
        From worker 2:       * linalg1
        From worker 4:       * linalg3
        From worker 5:       * linalg4
        From worker 4:       * core
        From worker 4:       * keywordargs
        From worker 4:       * numbers
        From worker 5:       * strings
        From worker 5:       * collections
        From worker 4:       * hashing
        From worker 5:       * remote
        From worker 5:       * iobuffer
        From worker 4:       * arrayops
        From worker 5:       * reduce
        From worker 5:       * reducedim
        From worker 5:       * simdloop
        From worker 5:       * blas
        From worker 5:       * fft
        From worker 5:       * dsp
        From worker 4:       * sparse
        From worker 5:       * bitarray

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). T
hanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0xcd6afb8 -- unknown function (ip: 215396280)
unknown function (ip: 215396280)
unknown function (ip: 1829579430)
unknown function (ip: 2009870311)
unknown function (ip: 1829579430)
unknown function (ip: 2009869544)
unknown function (ip: 2006816973)
Worker 4 terminated.
ERROR: ProcessExitedException()
while loading D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\share\julia\test\runtests.jl, in expression
starting on line 39

D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\share\julia\test>endlocal
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>popd
D:\julia\julia-0.3.3-win32\julia-installer\$_OUTDIR\bin>pause
Press any key to continue . . .
@pao pao changed the title bug report of your test script on pc windows Windows test failure in sparse.jl Nov 28, 2014
@pao pao changed the title Windows test failure in sparse.jl Windows test failure in tests: sparse.jl Nov 28, 2014
@pao pao added the system:windows Affects only Windows label Nov 28, 2014
@pao
Copy link
Member

pao commented Nov 28, 2014

Which version of Windows is this? 32-bit or 64-bit?

@andre-rifaut
Copy link
Author

Sorry for forgetting to tell that.

32-bits
Intel Core i5 CPU M520
Windows 7 professional Service pack 1

Thanks a lot.

André Rifaut

From: pao notifications@github.com
To: JuliaLang/julia julia@noreply.github.com,
Cc: andre-rifaut andre.rifaut@tudor.lu
Date: 28/11/2014 02:43
Subject: Re: [julia] Windows test failure in tests: sparse.jl
(#9185)

Which version of Windows is this? 32-bit or 64-bit?

Reply to this email directly or view it on GitHub.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Does it also fail if you run only the sparse test, via Base.runtests("sparse") ? If so, could you try using Base.Test then execute each line of https://github.com/JuliaLang/julia/blob/76e2e1d973bb8ffae5885fa1a476a8f3490827ae/test/sparse.jl in the REPL, to let us know exactly where it fails?

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Okay despite my pretty thorough testing last week while backporting things to release-0.3, I can reproduce this with the following using the 0.3.3 release binaries.

a = speye(5) + 0.1*sprandn(5, 5, 0.2)
b = randn(5,3) + im*randn(5,3)
(maximum(abs(a\b - full(a)\b)) < 1000*eps())

Now let's see if I can reproduce it in a source build so I can bisect to find which backport caused the issue.

@ivarne
Copy link
Member

ivarne commented Nov 28, 2014

Can you rerun the tests to see if this is reproducible?
If it isn't, this will be really difficult (impossible?) to track down.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Looks like it's processor dependent. I can reproduce it on a 32-bit XP desktop with a Penryn processor, but not a 64-bit Windows 7 Sandy Bridge laptop (not even building and running 32 bit Julia). Instinct tells me this is probably either a suitesparse or openblas bug.

Unfortunately I can't build Julia from source on the XP computer that I can reproduce the problem on. Maybe I'll be able to get it to happen in a VM.

@andre-rifaut
Copy link
Author

I rebooted my machine.
I made a re-install and the error is not the same. Actually before the
first message sent to you yesterday I had an error that stopped the list
of worker at the message "*random".
Now it stops at "bigint". (Your tests it runs concurrently on 4
processors.)

I used TWICE the REPL (see below).
Do you notice the "i686-w64-mingw32" in the banner of the REPL ? (lines
split at 80-column)

D:\ari\Tools\julia\Julia-0.3.3\share\julia\test>call
"D:\ari\Tools\julia\Julia-0.3.3\bin\julia.exe" runtests.jl all
From worker 4: * linalg3
From worker 3: * linalg2
From worker 5: * linalg4
From worker 2: * linalg1
From worker 4: * core
From worker 4: * keywordargs
From worker 4: * numbers
From worker 5: * strings
From worker 4: * collections
From worker 5: * hashing
From worker 4: * remote
From worker 4: * iobuffer
From worker 4: * arrayops
From worker 5: * reduce
From worker 5: * reducedim
From worker 5: * simdloop
From worker 5: * blas
From worker 4: * fft
From worker 5: * dsp
From worker 4: * sparse
From worker 3: * bitarray
From worker 5: * random
From worker 5: * math
From worker 2: * functional
From worker 2: * bigint

Please submit a bug report with steps to reproduce this fault, and any
error messages that follow (in their entirety). T
hanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0xd1bafb8 -- unknown function
(ip: 219918264)
unknown function (ip: 219918264)
unknown function (ip: 1829579430)
unknown function (ip: 2009870311)
unknown function (ip: 1829579430)
unknown function (ip: 2009869544)
unknown function (ip: 2006816973)
Worker 4 terminated.
ERROR: ProcessExitedException()
while loading D:\ari\Tools\julia\Julia-0.3.3\share\julia\test\runtests.jl,
in expression starting on line 39

           _

_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.3 (2014-11-23 20:19 UTC)
/ |_'|||__'| |
|__/ | i686-w64-mingw32

julia> Base.runtests("sparse")
* sparse

Please submit a bug report with steps to reproduce this fault, and any
error mes
sages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0xd2fafb8 -- unknown function
(ip: 2212
28984)
unknown function (ip: 221228984)
unknown function (ip: 1829579430)
unknown function (ip: 2009870311)
unknown function (ip: 1829579430)
unknown function (ip: 2009869544)
unknown function (ip: 2006816973)
ERROR: A test has failed. Please submit a bug report including error
messages
above and the output of versioninfo():
Julia Version 0.3.3
Commit b24213b* (2014-11-23 20:19 UTC)
Platform Info:
System: Windows (i686-w64-mingw32)
CPU: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
WORD_SIZE: 32
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia>

           _

_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.3 (2014-11-23 20:19 UTC)
/ |_'|||__'| |
|__/ | i686-w64-mingw32

julia> Base.runtests("sparse")
* sparse

Please submit a bug report with steps to reproduce this fault, and any
error mes
sages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0xd3aafb8 -- unknown function
(ip: 2219
49880)
unknown function (ip: 221949880)
unknown function (ip: 1829579430)
unknown function (ip: 1998532583)
unknown function (ip: 1829579430)
unknown function (ip: 1998531816)
unknown function (ip: 1999804621)
unknown function (ip: 1999804634)
ERROR: A test has failed. Please submit a bug report including error
messages
above and the output of versioninfo():
Julia Version 0.3.3
Commit b24213b* (2014-11-23 20:19 UTC)
Platform Info:
System: Windows (i686-w64-mingw32)
CPU: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
WORD_SIZE: 32
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia>

julia>

From: Tony Kelman notifications@github.com
To: JuliaLang/julia julia@noreply.github.com,
Cc: andre-rifaut andre.rifaut@tudor.lu
Date: 28/11/2014 07:09
Subject: Re: [julia] Windows test failure in tests: sparse.jl
(#9185)

Does it also fail if you run only the sparse test, via
Base.runtests("sparse") ? If so, could you try using Base.Test then
execute each line of
https://github.com/JuliaLang/julia/blob/76e2e1d973bb8ffae5885fa1a476a8f3490827ae/test/sparse.jl
in the REPL, to let us know exactly where it fails?

Reply to this email directly or view it on GitHub.

@andre-rifaut
Copy link
Author

See below my answer to Tony Kelman

----- Forwarded by André Rifaut/PRO/SSI/TUDOR on 28/11/2014 08:16 -----

From: André Rifaut/PRO/SSI/TUDOR
To: JuliaLang/julia
reply@reply.github.com,

Date: 28/11/2014 08:11
Subject: Re: [julia] Windows test failure in tests: sparse.jl
(#9185)

I rebooted my machine.
I made a re-install and the error is not the same. Actually before the
first message sent to you yesterday I had an error that stopped the list
of worker at the message "*random".
Now it stops at "bigint". (Your tests it runs concurrently on 4
processors.)

I used TWICE the REPL (see below).
Do you notice the "i686-w64-mingw32" in the banner of the REPL ? (lines
split at 80-column)

D:\ari\Tools\julia\Julia-0.3.3\share\julia\test>call
"D:\ari\Tools\julia\Julia-0.3.3\bin\julia.exe" runtests.jl all
From worker 4: * linalg3
From worker 3: * linalg2
From worker 5: * linalg4
From worker 2: * linalg1
From worker 4: * core
From worker 4: * keywordargs
From worker 4: * numbers
From worker 5: * strings
From worker 4: * collections
From worker 5: * hashing
From worker 4: * remote
From worker 4: * iobuffer
From worker 4: * arrayops
From worker 5: * reduce
From worker 5: * reducedim
From worker 5: * simdloop
From worker 5: * blas
From worker 4: * fft
From worker 5: * dsp
From worker 4: * sparse
From worker 3: * bitarray
From worker 5: * random
From worker 5: * math
From worker 2: * functional
From worker 2: * bigint

Please submit a bug report with steps to reproduce this fault, and any
error messages that follow (in their entirety). T
hanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0xd1bafb8 -- unknown function
(ip: 219918264)
unknown function (ip: 219918264)
unknown function (ip: 1829579430)
unknown function (ip: 2009870311)
unknown function (ip: 1829579430)
unknown function (ip: 2009869544)
unknown function (ip: 2006816973)
Worker 4 terminated.
ERROR: ProcessExitedException()
while loading D:\ari\Tools\julia\Julia-0.3.3\share\julia\test\runtests.jl,
in expression starting on line 39

           _

_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.3 (2014-11-23 20:19 UTC)
/ |_'|||__'| |
|__/ | i686-w64-mingw32

julia> Base.runtests("sparse")
* sparse

Please submit a bug report with steps to reproduce this fault, and any
error mes
sages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0xd2fafb8 -- unknown function
(ip: 2212
28984)
unknown function (ip: 221228984)
unknown function (ip: 1829579430)
unknown function (ip: 2009870311)
unknown function (ip: 1829579430)
unknown function (ip: 2009869544)
unknown function (ip: 2006816973)
ERROR: A test has failed. Please submit a bug report including error
messages
above and the output of versioninfo():
Julia Version 0.3.3
Commit b24213b* (2014-11-23 20:19 UTC)
Platform Info:
System: Windows (i686-w64-mingw32)
CPU: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
WORD_SIZE: 32
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia>

           _

_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.3 (2014-11-23 20:19 UTC)
/ |_'|||__'| |
|__/ | i686-w64-mingw32

julia> Base.runtests("sparse")
* sparse

Please submit a bug report with steps to reproduce this fault, and any
error mes
sages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0xd3aafb8 -- unknown function
(ip: 2219
49880)
unknown function (ip: 221949880)
unknown function (ip: 1829579430)
unknown function (ip: 1998532583)
unknown function (ip: 1829579430)
unknown function (ip: 1998531816)
unknown function (ip: 1999804621)
unknown function (ip: 1999804634)
ERROR: A test has failed. Please submit a bug report including error
messages
above and the output of versioninfo():
Julia Version 0.3.3
Commit b24213b* (2014-11-23 20:19 UTC)
Platform Info:
System: Windows (i686-w64-mingw32)
CPU: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
WORD_SIZE: 32
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia>

julia>

From: Tony Kelman notifications@github.com
To: JuliaLang/julia julia@noreply.github.com,
Cc: andre-rifaut andre.rifaut@tudor.lu
Date: 28/11/2014 07:09
Subject: Re: [julia] Windows test failure in tests: sparse.jl
(#9185)

Does it also fail if you run only the sparse test, via
Base.runtests("sparse") ? If so, could you try using Base.Test then
execute each line of
https://github.com/JuliaLang/julia/blob/76e2e1d973bb8ffae5885fa1a476a8f3490827ae/test/sparse.jl
in the REPL, to let us know exactly where it fails?

Reply to this email directly or view it on GitHub.

From: Ivar Nesje notifications@github.com
To: JuliaLang/julia julia@noreply.github.com,
Cc: andre-rifaut andre.rifaut@tudor.lu
Date: 28/11/2014 07:52
Subject: Re: [julia] Windows test failure in tests: sparse.jl
(#9185)

Can you rerun the tests to see if this is reproducible?
If it isn't, this will be really difficult (impossible?) to track down.

Reply to this email directly or view it on GitHub.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Although it doesn't quite look like it, these are all failing at the same place. The important part to note is the message Worker 4 terminated. At that point, worker 4 was running the sparse test, so it looks like it's always the same underlying failure.

Do you notice the "i686-w64-mingw32" in the banner of the REPL

Yes that's intended, it's the slightly confusing name for the 32-bit compiler that we use to build Julia for Windows. For 64 bit Julia the name is x86_64-w64-mingw32.

After some more testing, this might have something to do with the MARCH=pentium4 flag that we only recently started building the binaries with. I'm bisecting in a source build on my Sandy Bridge laptop now that I can get this to happen there.

@andre-rifaut
Copy link
Author

I made manually the tests in the file sparse. Everything is ok before the
lines you mentioned.

So, I can just confirm what you detected: I get a core dump with your
3 lines sent before:
a = speye(5) + 0.1_sprandn(5, 5, 0.2)
b = randn(5,3) + im_randn(5,3)
@test (maximum(abs(a\b - full(a)\b)) < 1000*eps())

I hope this help.

From: Tony Kelman notifications@github.com
To: JuliaLang/julia julia@noreply.github.com,
Cc: andre-rifaut andre.rifaut@tudor.lu
Date: 28/11/2014 08:27
Subject: Re: [julia] Windows test failure in tests: sparse.jl
(#9185)

Although it doesn't quite look like it, these are all failing at the same
place. The important part to note is the message Worker 4 terminated. At
that point, worker 4 was running the sparse test, so it looks like it's
always the same underlying failure.
Do you notice the "i686-w64-mingw32" in the banner of the REPL
Yes that's intended, it's the slightly confusing name for the 32-bit
compiler that we use to build Julia for Windows. For 64 bit Julia the name
is x86_64-w64-mingw32.
After some more testing, this might have something to do with the
MARCH=pentium4 flag that we only recently started building the binaries
with. I'm bisecting in a source build on my Sandy Bridge laptop now that I
can get this to happen there.

Reply to this email directly or view it on GitHub.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Yikes, yikes yikes. I don't understand why this would happen at all, but according to bisect the first bad commit is 478ac06 - which doesn't seem like it has anything to do with sparse matrices at all.

@ivarne
Copy link
Member

ivarne commented Nov 28, 2014

@tkelman Can you try manually with 478ac06 and the parent 48facaf one more time. isimmutable is only used once in the entire repo. It seems unlikely that it is the cause.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Yeah, I thought it was bizarre too. I'm running those two commits over again several times and it's pretty consistent that 478ac06 fails make cleanall && make -j4 test-sparse, 48facaf succeeds. The sparse matrix code does involve a few different finalizers in terms of how it interfaces with umfpack IIRC, so there might be something to this. It's also worth noting that this doesn't seem to happen on master right now.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

I can't reproduce this on appveyor by setting MARCH=pentium4 on the 32 bit build of release-0.3 (https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.53/job/2am18yr4nfb4kt76), but I think that's probably because status.julialang.org/stable/win32 is still downloading 0.3.2. @staticfloat when you get a chance to update the stable links, let me know (ref missing item from the checklist that needed to be backported 2734406).

@staticfloat
Copy link
Member

Done. Thanks for the reminder.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

No prob, do the files need a permissions tweak? https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.55/job/jlu54f72skx7cwtl

edit: or do I need to add some --retry flags to my curl invocations in contrib/windows/msys_build.sh ?

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

It seems like the bug must be somewhere in suitesparse but 478ac06 just triggers it somehow. If I rebuild SuiteSparse without having MARCH set, then build base Julia with MARCH=pentium4 set at 478ac06, there's no error. I'll see if backporting the suitesparse version bump from a month or so ago changes anything.

tkelman added a commit that referenced this issue Nov 28, 2014
use https://status.julialang.org

set MARCH on appveyor

(cherry picked from commit ce29d78)

[ci skip] since this is expected to fail on win32 right now due to #9185
@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Bumping to the very latest SuiteSparse version doesn't fix this. It seems we have 3 options:

  1. Revert 478ac06 on release-0.3. It seems like the bug it fixed was fairly minor, certainly lower-priority than segfaulting on sparse a\b.
  2. Mess with the build flags so that -march=pentium4 does not get sent to SuiteSparse.
  3. Dig into what's going on with the finalizers here. If I run through the contents of each function that a\b calls in the REPL, then everything appears to work, but segfaults when I exit() julia. So I think maybe something's getting confused with which finalizer gets called?

Opinions? cc @JeffBezanson if you're not too busy.

tkelman added a commit that referenced this issue Nov 28, 2014
This reverts commit 478ac06
in order to fix issue #9185.
@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

Given that all Windows binaries up to and including 0.3.2 did not have MARCH set during compilation (though @ihnorton can correct me if I'm wrong here), and the fact that we don't distribute sys.dll right now, I think the best option would be removing the MARCH flag for building the Windows binaries. Have there been any instances yet of complaints that Windows binaries prior to 0.3.3 do not work on old hardware?

I suspect mingw defaults to conservative instruction sets to ensure portable binaries, so setting -march in all the dependencies is just exposing us to strange bugs. When we've resolved all the backtrace problems with sys.dll we can look at setting JULIA_CPU_TARGET (Jameson is making some good progress), but unless someone has a better solution or knows what's going on here it seems setting -march for deps, at least on Windows, causes more problems than it's worth.

@staticfloat
Copy link
Member

It's possible that the only reason not setting MARCH didn't cause problems before is because the host it was being compiled on had a very conservative set of CPU instructions available to it. I can't guarantee that we'll have the same luck on the buildbots. If you'd like, I'll have the nightlies stop adding MARCH and we can at least see if they work on Appveyor.

We should probably add --retry to curl.

@tkelman
Copy link
Contributor

tkelman commented Nov 28, 2014

We should probably add --retry to curl.

Yeah already did. ce29d78

Jeff may have some theories about the deeper issues here, in which case messing with the flags is also a band aid. #9188 (comment) Seems as though we would have seen these sooner.

Nightlies won't help this one since I think it might be specific to release-0.3, but it might help with #9189

@tkelman
Copy link
Contributor

tkelman commented Nov 29, 2014

@andre-rifaut in the meantime while we figure out a proper solution to this, you can try a build I just made of 0.3.3 without setting MARCH, that does not exhibit this problem. http://sourceforge.net/projects/juliadeps-win/files/julia-0.3.3-i686.exe

tkelman added a commit to tkelman/julia-buildbot that referenced this issue Nov 29, 2014
This might be a x87-vs-SSE problem, since setting march to i686 for building the deps
but JULIA_CPU_TARGET to pentium4 for JIT code appears to work properly.
Need to verify that this doesn't bring back any of the i686 issues on 32 bit linux though.
@andre-rifaut
Copy link
Author

Thanks for your help. I actually want to start with bigint. I'm not in a hurry. I downloaded http://sourceforge.net/projects/juliadeps-win/files/julia-0.3.3-i686.exe but there is some problem in the script for running the tests. See below.

D:\julia\Julia-0.3.3b\share\julia\test>call "D:\ari\Tools\julia\Julia-0.3.3b\bin\julia.exe" runtests.jl all
ERROR: stat: connection reset by peer (ECONNRESET)
D:\julia\Julia-0.3.3b\share\julia\test>endlocal

@tkelman
Copy link
Contributor

tkelman commented Nov 29, 2014

That's strange. What are your %HOMEDRIVE% and %HOMEPATH% environment variables set to?

And is D: a network drive or local?

@andre-rifaut
Copy link
Author

I rebooted my machine and re-installed julia with http://sourceforge.net/projects/juliadeps-win/files/julia-0.3.3-i686.exe
Everything seems ok. See warnings in transcript below.
Thanks a lot.

D:\julia\Julia-0.3.3\share\julia\test>call "D:\julia\Julia-0.3.3\bin\julia.exe" runtests.jl all
From worker 2: * linalg1
From worker 5: * linalg4
From worker 3: * linalg2
From worker 4: * linalg3
From worker 4: * core
From worker 4: * keywordargs
From worker 4: * numbers
From worker 5: * strings
From worker 5: * collections
From worker 5: * hashing
From worker 4: * remote
From worker 4: * iobuffer
From worker 4: * arrayops
From worker 5: * reduce
From worker 5: * reducedim
From worker 5: * simdloop
From worker 5: * blas
From worker 5: * fft
From worker 5: * dsp
From worker 4: * sparse
From worker 5: * bitarray
From worker 2: * random
From worker 4: * math
From worker 2: * functional
From worker 2: * bigint
From worker 2: * sorting
From worker 4: * statistics
From worker 2: * spawn
From worker 2: [stdio passthrough ok]
bash.exe: warning: could not find /tmp, please create!
bash.exe: warning: could not find /tmp, please create!
From worker 4: * backtrace
From worker 4: * priorityqueue
From worker 3: * arpack
From worker 4: * file
From worker 5: * suitesparse
From worker 5: * version
From worker 5: * resolve
From worker 4: * pollfd
From worker 4: * mpfr
From worker 4: * broadcast
From worker 3: * complex
From worker 2: * socket
From worker 5: * floatapprox
From worker 5: * readdlm
From worker 3: * regex
From worker 3: * float16
From worker 3: * combinatorics
From worker 2: * sysinfo
From worker 2: * rounding
From worker 2: * ranges
From worker 3: * mod2pi
From worker 3: * euler
From worker 3: * show
From worker 4: * lineedit
From worker 5: * replcompletions
From worker 4: * repl
From worker 3: * test
From worker 3: * goto
From worker 3: * examples
From worker 5: * parallel
SUCCESS
WARNING: Forcibly interrupting busy workers
WARNING: rmprocs: process 1 not removed

D:\julia\Julia-0.3.3\share\julia\test>endlocal

@tkelman
Copy link
Contributor

tkelman commented Nov 29, 2014

Strange that a reboot would matter, but yeah the warnings are normal and that's a good sign.

Still don't know why changing suitesparse compilation flags would cause things to segfault.

staticfloat added a commit to JuliaCI/julia-buildbot that referenced this issue Nov 30, 2014
@vtjnash
Copy link
Member

vtjnash commented Nov 30, 2014

if we see this again on master, it should come with a more useful / valid stacktrace. please post it here so we can have a somewhat better idea where the error is coming from

@tkelman
Copy link
Contributor

tkelman commented Nov 30, 2014

This doesn't happen on master more recently than 11d4dde. Only release-0.3 since 478ac06

@tkelman
Copy link
Contributor

tkelman commented Dec 1, 2014

Thanks to @staticfloat we have rebuilt official 0.3.3 binaries that used different flags for SuiteSparse and do not exhibit this bug. Should we keep this open as something's still failing when SuiteSparse is built with -march=pentium4, or close it since we've worked around the problem for now?

@nalimilan
Copy link
Member

Maybe file a bug against SuiteSparse?

@tkelman
Copy link
Contributor

tkelman commented Dec 1, 2014

If strange seemingly unrelated changes in Julia didn't make this problem come and go, then I would think about trying to write a C reproduction case. I suspect the problem is more likely in LLVM, something having to do with ccall, or something in our finalizers, rather than a problem that could be tracked to suitesparse's source code.

@vtjnash
Copy link
Member

vtjnash commented Dec 21, 2014

given that this happens when you set MARCH=pentium4, that means it happens when you enable SSE instructions (default is MARCH=i686). so this almost certainly means either the stack or some malloc'd data object (typically the stack) is not 16-byte aligned when we arrive in suitesparse. gcc assumes the stack will be 16-byte aligned. window and llvm do not (they provide 4-byte alignment)
http://msdn.microsoft.com/en-us/library/aa290049(v=vs.71).aspx

to fix this same bug in openblas, we added the appropriate build flags to openblas:

OPENBLAS_BUILD_OPTS += CFLAGS="$(CFLAGS) -mincoming-stack-boundary=2"
[edited by tkelman to add hash, hit "y" for code links]

to fix this everywhere, we set the appropriate flag in llvm:

attr->addStackAlignmentAttr(8);
[same]

the only remaining question then, is why that comment and code says 8-byte and 2-byte, when the correct statement to avoid this bug should be 16-byte and 4-byte?

@vtjnash
Copy link
Member

vtjnash commented Dec 21, 2014

@tkelman or @staticfloat can you confirm and backport this fix?

@tkelman
Copy link
Contributor

tkelman commented Dec 21, 2014

Sure I'll check

@tkelman
Copy link
Contributor

tkelman commented Dec 21, 2014

Confirmed - at least on master, this makes the workaround at JuliaCI/julia-buildbot#10 no longer necessary. Win32 segfaults in the umfpack test at 364cd0c when SuiteSparse is built with -march=pentium4, but passes at 04893a1

vtjnash added a commit that referenced this issue Dec 21, 2014
@tkelman
Copy link
Contributor

tkelman commented Dec 21, 2014

backported in 09ecd06

eschnett pushed a commit to eschnett/julia that referenced this issue Dec 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:windows Affects only Windows
Projects
None yet
Development

No branches or pull requests

7 participants