Recurrent intermittent travis test failure #4016

kmsquire · 2013-08-11T06:31:55Z

Happens both on clang (here and here) and gcc (here)

$ cd /tmp/julia/share/julia/test && /tmp/julia/bin/julia runtests.jl all
    From worker 4:       * numbers
    From worker 5:       * strings
    From worker 3:       * keywordargs
    From worker 6:       * unicode
    From worker 2:       * core
    From worker 7:       * collections
    From worker 9:       * remote
    From worker 8:       * hashing
    From worker 9:       * iostring
    From worker 3:       * arrayops
    From worker 9:       * linalg
    From worker 8:       * blas
    From worker 6:       * fft
    From worker 2:       * dsp
    From worker 7:       * sparse
    From worker 5:       * bitarray
    From worker 8:       * random
Worker 2 terminated.
ERROR: read: end of file
 in read at iobuffer.jl:68
 in read at stream.jl:609
 in anonymous at task.jl:797

ERROR: ProcessExitedException()
 in yield at multi.jl:1490
 in wait at task.jl:105
 in wait_full at multi.jl:545
 in remotecall_fetch at multi.jl:645
 in remotecall_fetch at multi.jl:650
 in anonymous at multi.jl:1332
at /tmp/julia/share/julia/test/runtests.jl:20
The command "/tmp/julia/bin/julia runtests.jl all" exited with 1.

It's also unclear where task.jl:797 is, since task.jl only has 164 lines, but possibly from stream.jl:797.
Other backtrace locations are iobuffer.jl:68 and stream.jl:609

I was looking to see if there might be a race condition in IOBuffer, e.g., where isopen() becomes false in wait_nb before data is written, or the readnotify condition is notified before the buffer is filled, etc., but didn't see anything obvious.

The text was updated successfully, but these errors were encountered:

staticfloat · 2013-08-12T20:58:20Z

I can get this on my OSX box as well, if I just run make testall in a loop, I eventually hit this.

If there's anything I can do to help debug this, let me know.

kmsquire · 2013-08-13T05:16:08Z

So if I had paid more attention, the problem occurs during the DSP tests, where the worker terminates. The following is sufficient to cause a segfault on two linux systems that I tried:

julia> ;cd test
/home/kmsquire/Source/julia/test

julia> using Base.Test

julia> while true
           include("dsp.jl")
       end
Segmentation fault (core dumped)

Keno · 2013-08-13T05:27:06Z

It might be worth valgrinding this one with MEMDEBUG enabled. I did earlier today (unreleatedly) and I saw a fair number of invalid reads/writes though I can't rule out that those weren't caused by my changes.

kmsquire · 2013-08-13T05:42:14Z

@loladiro, will do. Right now, in the debugger, I can see that there's memory corruption.

julia> while true
           include("dsp.jl")
       end

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
489     p->freelist = p->freelist->next;
Missing separate debuginfos, use: debuginfo-install ncurses-libs-5.7-3.20090208.el6.x86_64
(gdb) backtrace
#0  0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
#1  0x00007ffff72540ef in allocobj (sz=368) at gc.c:981
#2  0x00007ffff7242008 in _new_array (atype=0x67fa60, ndims=1, dims=0x7fffffffb7f0) at array.c:80
#3  0x00007ffff7242c95 in jl_alloc_array_1d (atype=0x67fa60, nr=40) at array.c:297
#4  0x00007ffff0e382a9 in ?? ()
#5  0x00007fffffffb930 in ?? ()
#6  0x01007ffff7242008 in ?? ()
#7  0x0000000003c48350 in ?? ()
#8  0x0000004e00000000 in ?? ()
#9  0x000000000000000b in ?? ()
#10 0x0000000000000008 in ?? ()
#11 0x000000000000000b in ?? ()
#12 0x00007fffffffb960 in ?? ()
#13 0x0000000200000100 in ?? ()
#14 0x0000000000ad90c0 in ?? ()
#15 0x0000000000000580 in ?? ()
#16 0x0000000000000000 in ?? ()
(gdb) print p     
$1 = (pool_t *) 0x7ffff7fcdb68
(gdb) print *p
$2 = {osize = 384, pages = 0x3e5c380, freelist = 0x4009000000000000}
(gdb) print *(p.freelist)
Cannot access memory at address 0x4009000000000000
(gdb)

amitmurthy · 2013-08-13T05:43:03Z

Here you go - https://gist.github.com/amitmurthy/6218188

Keno · 2013-08-13T05:45:38Z

Yup, that's the one I saw earlier today as well. I'm not quite sure but I think it might be related to the size of the work array in gesdd that was changed recently.

kmsquire · 2015-03-19T05:31:09Z

FWIW, I reported this upstream back when this problem originally appeared, and the documentation of ZGESDD was recently fixed (although the fix won't appear in an LAPACK release until sometime this summer).

the size of the RWORK array in zgesdd was wrong.

kmsquire mentioned this issue Aug 12, 2013

Combinatorics updates #4025

Merged

kmsquire closed this as completed in 9f60588 Aug 13, 2013

kmsquire referenced this issue Mar 19, 2015

fix #3966

484a9f0

the size of the RWORK array in zgesdd was wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurrent intermittent travis test failure #4016

Recurrent intermittent travis test failure #4016

kmsquire commented Aug 11, 2013

staticfloat commented Aug 12, 2013

kmsquire commented Aug 13, 2013

Keno commented Aug 13, 2013

kmsquire commented Aug 13, 2013

amitmurthy commented Aug 13, 2013

Keno commented Aug 13, 2013

kmsquire commented Mar 19, 2015

Recurrent intermittent travis test failure #4016

Recurrent intermittent travis test failure #4016

Comments

kmsquire commented Aug 11, 2013

staticfloat commented Aug 12, 2013

kmsquire commented Aug 13, 2013

Keno commented Aug 13, 2013

kmsquire commented Aug 13, 2013

amitmurthy commented Aug 13, 2013

Keno commented Aug 13, 2013

kmsquire commented Mar 19, 2015