Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault using Pkg on ARM #8314

Closed
ViralBShah opened this issue Sep 11, 2014 · 34 comments
Closed

Segfault using Pkg on ARM #8314

ViralBShah opened this issue Sep 11, 2014 · 34 comments
Labels
bug Indicates an unexpected problem or unintended behavior system:arm ARMv7 and AArch64

Comments

@ViralBShah
Copy link
Member

Doing Pkg operations crashes julia on ARM. Perhaps this is a sign of something else that needs to be addressed as part of the port.

julia> Pkg.status()

Program received signal SIGSEGV, Segmentation fault.
pool_alloc (p=0xb6fa9e30 <norm_pools+168>) at gc.c:507
507         p->freelist = p->freelist->next;
(gdb) bt
#0  pool_alloc (p=0xb6fa9e30 <norm_pools+168>) at gc.c:507
#1  allocb (sz=68, sz@entry=64) at gc.c:1023
#2  0xb625acc4 in array_resize_buffer (offs=<optimized out>, oldlen=<optimized out>, newlen=16, a=0x3045950) at array.c:562
#3  jl_array_grow_end (a=0x3045950, inc=1) at array.c:608
#4  0xb50075a4 in julia_without_linenums22522 () at inference.jl:1907
#5  0xb621ecdc in jl_apply (nargs=1, args=0xbeffc6c0, f=<optimized out>) at julia.h:987
#6  jl_apply_generic (F=0x16e2b70, args=0xbeffc6c0, nargs=1) at gf.c:1573
#7  0xb500b69c in julia_inlineable22537 () at inference.jl:2149
#8  0xb621ecdc in jl_apply (nargs=5, args=0xbeffc900, f=<optimized out>) at julia.h:987
#9  jl_apply_generic (F=0xb43b20, args=0xbeffc900, nargs=5) at gf.c:1573
#10 0xb5015610 in julia_inlining_pass22567 () at inference.jl:2654
#11 0xb50145d8 in julia_inlining_pass22567 () at inference.jl:2590
#12 0xb621ecdc in jl_apply (nargs=3, args=0xbeffcce8, f=<optimized out>) at julia.h:987
#13 jl_apply_generic (F=0x16046f0, args=0xbeffcce8, nargs=3) at gf.c:1573
#14 0xb5014c48 in julia_inlining_pass22567 () at inference.jl:2550
#15 0xb621ecdc in jl_apply (nargs=3, args=0xbeffd0fc, f=<optimized out>) at julia.h:987
#16 jl_apply_generic (F=0x16046f0, args=0xbeffd0fc, nargs=3) at gf.c:1573
#17 0xb503be10 in julia_typeinf22857 () at inference.jl:1549
#18 0xb503dd9c in jlcall_typeinf22857 () from /home/viral/julia/usr/bin/../lib/julia/sys.so
#19 0xb621ecdc in jl_apply (nargs=5, args=0xbeffd4c8, f=<optimized out>) at julia.h:987
#20 jl_apply_generic (F=0xa34eb0, args=0xbeffd4c8, nargs=5) at gf.c:1573
@ViralBShah ViralBShah added bug Indicates an unexpected problem or unintended behavior system:arm ARMv7 and AArch64 labels Sep 11, 2014
@jakebolewski
Copy link
Member

Are you able to shell out at all? It looks like there is some memory corruption happening in the GC.

@ViralBShah
Copy link
Member Author

Yes, I am able to do basic things, including a git pull by shelling out using ;.

@ihnorton
Copy link
Member

Pkg.status works fine with MEMDEBUG on. I didn't get much out of valgrind the first time, but I'm going to run it again.

@ViralBShah
Copy link
Member Author

Cc: @dmbates

@ihnorton
Copy link
Member

Link to some valgrind issues that make memcheck look inordinately scary on ARM at the moment:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=758905
https://bugs.kde.org/show_bug.cgi?id=338615

@ihnorton
Copy link
Member

The freelist pointer is consistently corrupted to the same address (0x41d504a0) in gdb and valgrind. I get an invalid read and then write preceding the segfault:

==5094== Invalid read of size 4
==5094==    at 0x496D32A: pool_alloc (gc.c:507)
==5094==    by 0x496EEE3: allocb (gc.c:1023)
==5094==    by 0x495E34F: array_resize_buffer (array.c:562)
==5094==    by 0x495E58D: jl_array_grow_end (array.c:608)
==5094==    by 0x6D091A7: julia_without_linenums22694 (inference.jl:1907)
==5094==    by 0x48D9BCB: jl_apply (julia.h:987)
==5094==    by 0x48DD149: jl_apply_generic (gf.c:1573)
==5094==    by 0x6D0D29F: julia_inlineable22709 (inference.jl:2149)
==5094==    by 0x48D9BCB: jl_apply (julia.h:987)
==5094==    by 0x48DD149: jl_apply_generic (gf.c:1573)
==5094==    by 0x6D17213: julia_inlining_pass22739 (inference.jl:2654)
==5094==    by 0x6D161DB: julia_inlining_pass22739 (inference.jl:2590)
==5094==  Address 0x41d504a0 is not stack'd, malloc'd or (recently) free'd
==5094==
==5094== Invalid write of size 4
==5094==    at 0x496D334: pool_alloc (gc.c:508)
==5094==    by 0x496EEE3: allocb (gc.c:1023)
==5094==    by 0x495E34F: array_resize_buffer (array.c:562)
==5094==    by 0x495E58D: jl_array_grow_end (array.c:608)
==5094==    by 0x6D091A7: julia_without_linenums22694 (inference.jl:1907)
==5094==    by 0x48D9BCB: jl_apply (julia.h:987)
==5094==    by 0x48DD149: jl_apply_generic (gf.c:1573)
==5094==    by 0x6D0D29F: julia_inlineable22709 (inference.jl:2149)
==5094==    by 0x48D9BCB: jl_apply (julia.h:987)
==5094==    by 0x48DD149: jl_apply_generic (gf.c:1573)
==5094==    by 0x6D17213: julia_inlining_pass22739 (inference.jl:2654)
==5094==    by 0x6D161DB: julia_inlining_pass22739 (inference.jl:2590)
==5094==  Address 0x41d504a0 is not stack'd, malloc'd or (recently) free'd

Which is something I already know - the pool is already corrupted at that point.

The crash still occurs if gc_disable() is called at the prompt, so the only thing I can think of is that the pool is getting corrupted during startup. I tried stopping early in startup and manually disabling gc from gdb, but that just crashes.

It seems very odd that the pointer is always set to the same (incorrect) value: allocations don't usually feel that deterministic. I must be missing some simple explanation for that.

@ViralBShah
Copy link
Member Author

I linked with electric fence and it crashes during startup.

(gdb) r
Starting program: /home/viral/julia/usr/bin/julia-debug 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

  Electric Fence 2.2 Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>

ElectricFence Aborting: free(f9ce8): address not from malloc().

Program received signal SIGILL, Illegal instruction.
0xb6046296 in kill () at ../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0xb6046296 in kill () at ../sysdeps/unix/syscall-template.S:81
#1  0xb6a0e010 in do_abort () from /home/viral/julia/usr/bin/../lib/libjulia-debug.so
#2  0xb6a0e05a in EF_Abortv () from /home/viral/julia/usr/bin/../lib/libjulia-debug.so
#3  0xb6a0e078 in EF_Abort () from /home/viral/julia/usr/bin/../lib/libjulia-debug.so
#4  0xb6a0d760 in free_locked () from /home/viral/julia/usr/bin/../lib/libjulia-debug.so
#5  0xb6a0ddd0 in free () from /home/viral/julia/usr/bin/../lib/libjulia-debug.so
#6  0xb627dab0 in uv__dlerror (lib=<optimized out>) at src/unix/dl.c:71
#7  0xb6241938 in jl_dlsym_e (handle=0xaa693ff8, symbol=0xb3a787a8 "MKL_Set_Num_Threads") at dlload.c:168
#8  0xb61e6128 in emit_cglobal (args=0xaa600a8c, nargs=2, ctx=0xbeffd624) at ccall.cpp:560
#9  0xb61ecf52 in emit_intrinsic (f=JL_I::cglobal, args=0xaa600a8c, nargs=2, ctx=0xbeffd624)
    at intrinsics.cpp:873
#10 0xb61f5600 in emit_known_call (ff=0xb2d22ba8, args=0xaa600a8c, nargs=2, ctx=0xbeffd624, 
    theFptr=0xbeffd24c, theF=0xbeffd250, expr=0xaa7a2070) at codegen.cpp:1663
#11 0xb61f8b1e in emit_call (args=0xaa600a8c, arglen=3, ctx=0xbeffd624, expr=0xaa7a2070) at codegen.cpp:2285
#12 0xb61fa674 in emit_expr (expr=0xaa7a2070, ctx=0xbeffd624, isboxed=false, valuepos=false)
    at codegen.cpp:2750
#13 0xb61ffc3c in emit_function (lam=0xaffcead0, cstyle=false) at codegen.cpp:3971
#14 0xb61e11fa in to_function (li=0xaffcead0, cstyle=false) at codegen.cpp:573
#15 0xb61e17b4 in jl_compile (f=0xaa7aa9b0) at codegen.cpp:694
#16 0xb61c5af0 in jl_get_specialization (f=0xb0c94190, types=0xb3a42ff0) at gf.c:1380
#17 0xb61f573a in emit_known_call (ff=0xb0c94190, args=0xaa7c106c, nargs=0, ctx=0xbeffe394, 
    theFptr=0xbeffde0c, theF=0xbeffde10, expr=0xaa7a1a70) at codegen.cpp:1686
#18 0xb61f8b1e in emit_call (args=0xaa7c106c, arglen=1, ctx=0xbeffe394, expr=0xaa7a1a70) at codegen.cpp:2285
#19 0xb61fa674 in emit_expr (expr=0xaa7a1a70, ctx=0xbeffe394, isboxed=true, valuepos=true) at codegen.cpp:2750
#20 0xb61f9b16 in emit_assignment (l=0xb3a78628, r=0xaa7a1a70, ctx=0xbeffe394) at codegen.cpp:2567
#21 0xb61fa6a6 in emit_expr (expr=0xaa7a1a60, ctx=0xbeffe394, isboxed=false, valuepos=false)
    at codegen.cpp:2754
#22 0xb61ffc3c in emit_function (lam=0xaffcea30, cstyle=false) at codegen.cpp:3971
#23 0xb61e11fa in to_function (li=0xaffcea30, cstyle=false) at codegen.cpp:573
#24 0xb61e17b4 in jl_compile (f=0xaa7aa980) at codegen.cpp:694
#25 0xb61c5af0 in jl_get_specialization (f=0xb1581e50, types=0xb3a42ff0) at gf.c:1380
#26 0xb61f573a in emit_known_call (ff=0xb1581e50, args=0xaae7bfac, nargs=0, ctx=0xbeffef54, 
    theFptr=0xbeffeb7c, theF=0xbeffeb80, expr=0xaa7a18e0) at codegen.cpp:1686
#27 0xb61f8b1e in emit_call (args=0xaae7bfac, arglen=1, ctx=0xbeffef54, expr=0xaa7a18e0) at codegen.cpp:2285
#28 0xb61fa674 in emit_expr (expr=0xaa7a18e0, ctx=0xbeffef54, isboxed=false, valuepos=false)
    at codegen.cpp:2750
#29 0xb61ffc3c in emit_function (lam=0xaffce620, cstyle=false) at codegen.cpp:3971
#30 0xb61e11fa in to_function (li=0xaffce620, cstyle=false) at codegen.cpp:573
#31 0xb61e17b4 in jl_compile (f=0xaae066d0) at codegen.cpp:694
#32 0xb61cca54 in jl_trampoline_compile_function (f=0xaae066d0, always_infer=0, sig=0xb3a3eff0)
    at builtins.c:805
#33 0xb61ccb28 in jl_trampoline (F=0xaae066d0, args=0x0, nargs=0) at builtins.c:819
#34 0xb61c2c94 in jl_apply (f=0xaae066d0, args=0x0, nargs=0) at julia.h:987
#35 0xb61c62bc in jl_apply_generic (F=0xb2736250, args=0x0, nargs=0) at gf.c:1592
#36 0xb61cf404 in jl_apply (f=0xb2736250, args=0x0, nargs=0) at julia.h:987
#37 0xb61d081a in jl_module_run_initializer (m=0xb29a37d0) at module.c:440
#38 0xb624acfc in jl_init_restored_modules () at dump.c:1108
#39 0xb6243ff6 in julia_init (imageFile=0x165c0 "/home/viral/julia/usr/bin/../lib/julia/sys.ji") at init.c:970
#40 0x00009e92 in main (argc=0, argv=0xbefffbd8) at repl.c:378

@ViralBShah
Copy link
Member Author

It seems that emit_cglobal is generating something that cannot be freed when a library lookup fails - since MKL is not installed here.

@ViralBShah
Copy link
Member Author

Here's a cglobal example that produces a segfault, which is reported in #8464.

julia> x = cglobal((:errno, :libc), Int32)
Ptr{Int32} @0xb6feb4d0

julia> cglobal((:errno, :libc), Int32)
julia: cgutils.cpp:1501: jl_value_t* static_constant_instance(llvm::Constant*, jl_value_t*): Assertion `((((jl_value_t*)(jt))->type)==(jl_value_t*)(jl_tuple_type))' failed.

signal (6): Aborted
Aborted (core dumped)

@ihnorton
Copy link
Member

How are you linking EF? If it is statically linked into Julia, then maybe malloc and free are unhooked in some invocations, so EF throws the error when it sees a free without a corresponding malloc - even though malloc was actually called. (in other words, I think this might be a red herring)

I believe the recommended way to do this is to LD_PRELOAD EF as a shared library: http://www.fifi.org/doc/electric-fence/README.gdb

The problem is that makes everything glacially slow (I've been running Pkg.status() for over an hour now after doing that via gdb).

cc @Keno

@ViralBShah
Copy link
Member Author

I just added the libefence.a to DEBUG_LIBS. I don't think it is a red herring because the cglobal example above crashes even without using libefence. With libefence, I cannot even start Julia and it crashes in startup in emit_cglobal. Without libefence, I can start julia, run the cglobal example and get a crash - probably because the LLVM assertions are turned on. I guess that could be happening because of other corruption, and perhaps this is not conclusive.

@ViralBShah
Copy link
Member Author

It is surprising that you were able to start Julia with efence loaded. That does not happen for me - perhaps for the reasons you mentioned?

@ihnorton
Copy link
Member

efence is supposed to crash on the first untracked free that you hit, so if it is not seeing a malloc then it will crash even if a free is otherwise safe. Also: the same dl__error crash happens on amd64 linux.

The crash starts here:
#6 0x00007ffff69f5ec1 in uv__dlerror (lib=<optimized out>) at src/unix/dl.c:71

where libuv is trying to free(lib->errmsg), which I think should only be allocated by the strdup in the next block. I'm not following how that is related to the cglobal error.

(I can't start Julia either; was prematurely testing with julia -e)

@ihnorton
Copy link
Member

See explanation here: https://bugzilla.mozilla.org/show_bug.cgi?id=760227#c1

@ihnorton
Copy link
Member

You can get past that efence stop by dynamically linking libuv - and then LLVM. But after that, it seems that there is a semaphore block between efence and LLVM.

@ViralBShah
Copy link
Member Author

Ok - so the search continues then.

@ViralBShah
Copy link
Member Author

cc: @vtjnash

@ihnorton
Copy link
Member

I believe I have isolated this to jlcall_stat. So far I can't see exactly where it is happening because when I set a watch on the address, gdb gets stuck in between jlcall_stat and dl_open_workers (I think).

@ihnorton
Copy link
Member

julia> stat("METADATA/AffineTransforms/url")

signal (11): Segmentation fault
Segmentation fault

@vtjnash
Copy link
Member

vtjnash commented Sep 25, 2014

that tends to indicate that you've messed up your compiler flags between libuv and julia, resulting in the wrong size (stack-allocated) stat_t struct.

@ihnorton
Copy link
Member

Nevermind, I was looking at the size of the kernel stat structs which obviously don't match libuv's stat struct.

@ViralBShah
Copy link
Member Author

So, is this a libuv issue or a julia compiler flags issue?

@ViralBShah
Copy link
Member Author

In options.h, if I enable MEMDEBUG, I do not get these segfaults.

localhost:~/julia $ ./usr/bin/julia-debug
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+1181 (2014-10-22 18:33 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit e9bb3e7* (0 days old master)
|__/                   |  arm-linux-gnueabihf

julia> stat("/home/viral/julia")
StatStruct(mode=040775, size=4096)

julia> Pkg.status()
No packages installed

@ihnorton
Copy link
Member

I don't think the allocator uses a freelist pool with MEMDEBUG, so the corruption might not show up in the same place, if at all. The size 72 freelist is consistently corrupted when using Pkg.status(), but watchpoints and electric fence both hang the process in jlcall_stat when watching the relevant address.

@vtjnash far as I can tell, the struct size is consistently 104 (as it should be since we compile with D_FILE_OFFSET_BITS=64 and that is passed in to libuv).

@ViralBShah
Copy link
Member Author

It gave me some comfort that the system image is building ok, and the issue is localized to the allocator, rather than some other corruption manifesting here.

@ihnorton
Copy link
Member

I stepped through with valgrind's gdb server and got some useful info (vgdb forces sequential execution so it is slow but at least doesn't hang).

The corruption happens here (see this gist for more context)

(gdb) p pools[14]->freelist->next
$6 = (struct _gcval_t *) 0x20641b0
(gdb) si

    0x750b5ad0 <jlcall_stat38271+172>       vstr   d9, [r0, #68]   ; 0x44

0x750b5ad4 in jlcall_stat38271 () from /home/user/dev/julia/usr/bin/../lib/julia/sys.so
(gdb) p pools[14]->freelist->next
$7 = (struct _gcval_t *) 0x41d504a0

where r0 holds the address we just got from allocobj(sz=68) to hold a StatStruct.

(gdb) p (void*)$r0
$10 = (void *) 0x2064120

Note that the corrupted memory is the next (currently free) block at $r0 + 72
which leads to the segfault next time we try to allocate something because ->next points to garbage.

(gdb) p pools[14]->freelist
$8 = (gcval_t *) 0x2064168   # <---  0x2064120 + 72

If I am reading the assembly and the ARM instruction docs correctly, this:

vstr d9, [r0, #68] ; 0x44

does a 64-bit store at r0+68. That shouldn't happen because the StatStruct that is represented at that address should only be 68 bytes total.

Here is the IR for that function:

define %jl_value_t* @jlcall_stat38271(%jl_value_t*, %jl_value_t**, i32) {
top:
  %3 = getelementptr %jl_value_t** %1, i32 0                              |
  %4 = load %jl_value_t** %3                                                       |  call the actual function
  %5 = call %StatStruct.11 @julia_stat38271(%jl_value_t* %4)   |  
  %6 = load %jl_value_t** @"+Main.Base.StatStruct37767"         
  %7 = call %jl_value_t* @allocobj(i32 72)                                   | allocate storage
  %8 = bitcast %jl_value_t* %7 to %jl_value_t**                           
  store %jl_value_t* %6, %jl_value_t** %8                                
  %9 = bitcast %jl_value_t* %7 to %jl_value_t**
  %10 = getelementptr %jl_value_t** %9, i32 1                             
  %11 = bitcast %jl_value_t** %10 to %StatStruct.11*
  store %StatStruct.11 %5, %StatStruct.11* %11                         | copy the local into the alloc'd space
  ret %jl_value_t* %7
}

which seems correct. So if the above analysis is accurate, there may be something wrong with how LLVM is calculating the offsets for the vstr. On the other hand, as @ViralBShah says, the whole thing bootstraps and starts so it would be weird to have such a deep problem. Hopefully I am misreading this and it is something simpler.

one thing I'm not sure about is what the ; 0x44 means after the vstr instruction. in ATT assembly that should just be a comment, right?
also I'm not entirely sure why there are two definitions in the IR for StatStruct, once eponymously and once as StatStruct.11 used in the functions above.

@ViralBShah
Copy link
Member Author

Should we try with llvm-svn?

@ihnorton
Copy link
Member

Already using it - pulled master two days ago.

@vtjnash
Copy link
Member

vtjnash commented Oct 27, 2014

@ihnorton what is the code_native for that function?

the comment 0x44 == 68, so it's just giving the hex representation of a number earlier on the line (in a comment)

@vtjnash
Copy link
Member

vtjnash commented Oct 27, 2014

it looks like it may have double-counted the getelementptr %jl_value_t** %9, i32 1 during the transition from storing items from registers vs. the stack – notice that the store to #32 is missing from that disassembly

@ihnorton
Copy link
Member

@vtjnash, thanks for that observation, that is extremely helpful. I can't access the system right now but I'll post the full disassembly later and also look at what Clang thinks the struct alignment should be on ARM (based on your hint and some very brief reading on ARM alignment rules, I think we might be under-allocating).

@vtjnash
Copy link
Member

vtjnash commented Oct 27, 2014

Alignment should be 4, if I read the doc for the instruction right, which aligns with Julia's assumption for alignment. Regardless, #32 seems arbitrary to decide to realign, if it was our issue

ihnorton added a commit to ihnorton/julia that referenced this issue Oct 28, 2014
@ihnorton
Copy link
Member

@vtjnash
Copy link
Member

vtjnash commented Oct 28, 2014

@JeffBezanson is there an assumption in allocobj that MAX_ALIGN <= sizeof(void*)?
e.g. how would it pad the following on arm32 or win32:

type Z
  data::Float64
end

we would probably want:

struct Z {
  int32_t type;
  int32_t pad;
  double data;
}

or instead, so that the data offset is constant:

struct Z {
  union {
    void *type;
    MAX_ALIGN_TYPE pad;
  };
  double data;
}

not the following, which is what I assume it does now

struct Z {
  int32_t *type;
  double data;
}

of course, in some respects, this is just a question of whether it is valid for the compiler to choose aligned SIMD instructions (vmovsd, etc) on x64, which would expect a 16-byte aligned data field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior system:arm ARMv7 and AArch64
Projects
None yet
Development

No branches or pull requests

4 participants