Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap failure on ARMv7 #10602

Closed
nalimilan opened this issue Mar 22, 2015 · 45 comments
Closed

Bootstrap failure on ARMv7 #10602

nalimilan opened this issue Mar 22, 2015 · 45 comments
Labels
system:arm ARMv7 and AArch64

Comments

@nalimilan
Copy link
Member

When trying to build a nightly RPM on Fedora 22 armv7hl, I get failures during the bootstrap step. Depending on the value of JULIA_TARGET_CPU I use, the error is different:

  • With native (JULIA_TARGET_CPU not passed):
osutils.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 82: LoadError(at "osutils.jl" line 38: BoundsError(a=Array{Any, 1}[(), ()], i=(0,))))
Makefile:168: recipe for target '/builddir/build/BUILD/julia/build/usr/lib/julia/sys0.o' failed

https://kojipkgs.fedoraproject.org//work/tasks/26/9290026/build.log

  • With cortex-a8:
osutils.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 82: LoadError(at "osutils.jl" line 38: Base.TypeError(func=:getfield, context="", expected=DataType, got=<?::false>)))
signal (11): Segmentation fault
/bin/sh: line 1:  7563 Segmentation fault      (core dumped) /builddir/build/BUILD/julia/build/usr/bin/julia -C cortex-a8 --build /builddir/build/BUILD/julia/build/usr/lib/julia/sys0 sysimg.jl
Makefile:168: recipe for target '/builddir/build/BUILD/julia/build/usr/lib/julia/sys0.o' failed

https://kojipkgs.fedoraproject.org//work/tasks/1460/9291460/build.log

I'm not familiar with ARM at all, so I'm not sure which one is the more reasonable (maybe none). Note that I'm using USE_SYSTEM_LLVM=1, which is LLVM 3.5. A difficulty is that gcc, which is used in the build, uses -march=armv7-a on Fedora armv7hl, but LLVM does not accept this as a target. This is why I tried JULIA_CPU_TARGET=cortex-a8. Fedora's LLVM is built using these settings:

  --with-cpu=cortex-a8 \
  --with-tune=cortex-a8 \
  --with-arch=armv7-a \
  --with-float=hard \
  --with-fpu=vfpv3-d16 \
  --with-abi=aapcs-vfp

http://pkgs.fedoraproject.org/cgit/llvm.git/tree/llvm.spec?h=f22&id=5aea06bdf020fd2fc750286d397e51e01a94a765#n404

I see that ARM.inc advises --with-cpu=cortex-a9 --with-fpu=neon instead. Do you think that can be an issue?

BTW, note JCFLAGS includes -fsigned-char.

@nalimilan nalimilan added the system:arm ARMv7 and AArch64 label Mar 22, 2015
@vtjnash
Copy link
Member

vtjnash commented Mar 22, 2015

you must match the --with-float and --with-abi flags exactly between the compilers (system libraries, gcc, llvm, and julia) or you will get random data corruption and segfaults. the rest (cpu, tune, arch) generally don't matter and simply describe the available instruction set.

@nalimilan
Copy link
Member Author

@vtjnash That's the case AFAICT, as the values of these two flags passed to gcc when building Julia are exactly the same as those used to build LLVM (that's the point of a distribution, after all). But I'm not sure how code generation from Julia works: what flags are used in that case? Those used to build LLVM? Some other default? Can that be the problem?

Another surprising fact is that changing the value of JULIA_TARGET_CPU changes the error, which IIUC should not happen if as you say it doesn't matter. Or maybe it just changes the symptom of an underlying bug due to something else.

@nalimilan
Copy link
Member Author

I get yet another error on Fedora rawhide (gcc 5 with _GLIBCXX_USE_CXX11_ABI=1):

osutils.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 82: LoadError(at "osutils.jl" line 38: Base.ArgumentError(msg="set must be non-empty")))
Makefile:168: recipe for target '/builddir/build/BUILD/julia/build/usr/lib/julia/sys0.o' failed

https://kojipkgs.fedoraproject.org//work/tasks/7279/9357279/build.log
https://kojipkgs.fedoraproject.org//work/tasks/7338/9357338/build.log

@ViralBShah Any suggestions?

EDIT: JULIA_CPU_TARGET=cortex-a9 does not make any difference.

@ViralBShah
Copy link
Member

I suspect it is a julia bug, and we should run it through valgrind or something to figure things out. It is quite common for the build to die in the system image phase in different places on different ARM architectures.

Cc: @ihnorton

@ViralBShah
Copy link
Member

I wonder if Jameson's patches for ppc will help - especially the bit relaying to cpuid stuff.

@vtjnash
Copy link
Member

vtjnash commented Mar 30, 2015

i didn't patch any cpuid stuff, just deleted some dead code to make it easier to build.

fwiw, a little known fact is that the sys.ji file is actually fairly portable. if you have one from a similar build SHA1 on a similar machine (same OS and WORD_SIZE), you can just copy it over. For example, I'm currently using a sys.ji from my normal x86_64 ubuntu build to debug and test the PPC64le build.

@ViralBShah
Copy link
Member

Yes, I did mean disabling it. The contents suggest hard coded values for armv7, which probably don't work out well on other arm processors. I will try using SYS.ji from elsewhere.

@vtjnash
Copy link
Member

vtjnash commented Mar 30, 2015

i don't think those values were actually read anywhere that mattered.

you might also try enabling MEMDEBUG (in options.h) and see if that helps

@nalimilan
Copy link
Member Author

I ran the bootstrap inside Valgrind, with MEMDEBUG enabled. It's been running for almost a day now, and it's still not finished. So far Valgrind has uncovered a few things, but the traces are very poor. I've tried running julia-debug instead of julia, and still the same. What should I do to fix this? Is there any value in this kind of information?

 cd base && valgrind --smc-check=all-non-file --suppressions=../contrib/valgrind-julia.supp /builddir/build/BUILD/julia/build/usr/bin/julia-debug -C cortex-a8 --build /builddir/build/BUILD/julia/build/usr/lib/julia/sys0 sysimg.jl
==3822== Memcheck, a memory error detector
==3822== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==3822== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==3822== Command: /builddir/build/BUILD/julia/build/usr/bin/julia-debug -C cortex-a8 --build /builddir/build/BUILD/julia/build/usr/lib/julia/sys0 sysimg.jl
==3822== 
==3822== Conditional jump or move depends on uninitialised value(s)
==3822==    at 0x401B984: index (in /usr/lib/ld-2.21.90.so)
==3822== 
==3822== Conditional jump or move depends on uninitialised value(s)
==3822==    at 0x401B988: index (in /usr/lib/ld-2.21.90.so)
==3822== 
==3822== Conditional jump or move depends on uninitialised value(s)
==3822==    at 0x4008A5C: fillin_rpath (in /usr/lib/ld-2.21.90.so)
==3822==    by 0x4009283: _dl_init_paths (in /usr/lib/ld-2.21.90.so)
==3822== 
==3822== Conditional jump or move depends on uninitialised value(s)
==3822==    at 0x401BE40: strcpy (in /usr/lib/ld-2.21.90.so)
==3822== 
==3822== Use of uninitialised value of size 4
==3822==    at 0x401BD78: strcpy (in /usr/lib/ld-2.21.90.so)
==3822== 
==3822== Use of uninitialised value of size 4
==3822==    at 0x401BD84: strcpy (in /usr/lib/ld-2.21.90.so)
==3822== 
==3822== Invalid read of size 4
==3822==    at 0x401BD20: strcpy (in /usr/lib/ld-2.21.90.so)
==3822==  Address 0x7777ac0 is 16 bytes inside a block of size 19 alloc'd
==3822==    at 0x4848AAC: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==3822== 
exports.jl
base.jl
reflection.jl
build_h.jl
version_git.jl
c.jl
options.jl
promotion.jl
tuple.jl
range.jl
expr.jl
error.jl
bool.jl
number.jl
int.jl
operators.jl
pointer.jl
refpointer.jl
rounding.jl
float.jl
==3822== Invalid read of size 4
==3822==    at 0x401BE6C: strcpy (in /usr/lib/ld-2.21.90.so)
==3822==  Address 0x7e45a80 is 24 bytes inside a block of size 25 alloc'd
==3822==    at 0x4848AAC: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==3822== 

https://kojipkgs.fedoraproject.org//work/tasks/4521/9414521/build.log

@ViralBShah
Copy link
Member

I am working on setting up a machine at scaleway.com. That should hopefully allow people to get in and fix things quickly.

@nalimilan
Copy link
Member Author

Cool. Though the Fedora build system is not that bad, I can easily make some local changes and send them for build. (Not sure about the speed of the build machine compared with Scaleway's.)

@nalimilan
Copy link
Member Author

Crap, the build timed out after 24h. I tried using gdb instead of Valgrind, but a bug prevents it from working on Fedora arm at the moment.

So if you have a Fedora/RHEL image on Scaleway, I'd be happy to try there.

@sbromberger
Copy link
Contributor

I'll repeat the offer I made on julia-users: if someone wants a shell account on a Raspberry Pi 2 to test builds, I'm happy to provide one (or more). Just let me know. I have been unsuccessful in getting Julia to build (see #10235) and would welcome someone with some expertise here.

@ViralBShah
Copy link
Member

My chromebook environment got wiped away due to an error on my part during reboot. It will be a while for me to set it up again. This may not be sooner than next week.

@ViralBShah
Copy link
Member

I am guessing this one is also the ARM architecture not being detected correctly as in #10917

@nalimilan
Copy link
Member Author

@ViralBShah Shouldn't passing JULIA_CPU_TARGET=cortex-a9 be enough if the automatic detection fails?

@ViralBShah
Copy link
Member

I would have thought so, but could you try JULIA_CPU_ARCH=arm1176jzf-s? That is the one from Raspberry Pi 1, that ought to be conservative enough.

Can you post the /proc/cpuinfo for the build machine. @vtjnash should be able to say more.

@PallHaraldsson
Copy link
Contributor

"could you try JULIA_CPU_ARCH=arm1176jzf-s? That is the one from Raspberry Pi 1, that ought to be conservative enough."

Not sure.. This is for ARM11 chip, that is pre-ARMv7 (yes, know it's confusing..). It seems from the Wikipedia RPi article, a few operating systems work on RPi 2, not all that work for the original. [Still, the GPU - everything else I think besides the ARM core, is the same.]

ARM cores are not always fully upward compatible. E.g. 26-bit addressing mode was dropped at some point. [And ARMv8-A e.g. isn't with ARMv7 for kernel space, but are for user space.]

I'm not sure I can help more, not even sure what "I get failures during the bootstrap step" means.. and do not have either Pi/ARM-chip except in my ARMv7 phone I'm using and ARMv6 phone I'm not using, but could root/whatever.. and my original Acorn Archimedes (kind of broken..), think I had some programmers ref./ARM? manuals at some point..

vtjnash added a commit that referenced this issue Jul 8, 2015
this seems likely to be a better default for the majority of users. plus the failure mode is a SIGILL, rather than random data getting returned from floating point computations
@ViralBShah
Copy link
Member

@nalimilan Can you try once again?

@ViralBShah
Copy link
Member

Please reopen if reproducible.

@yuyichao
Copy link
Contributor

@ViralBShah Did you mean to close this?

@ViralBShah
Copy link
Member

Yes - thank you.

@nalimilan
Copy link
Member Author

@ViralBShah Sorry for not replying earlier. Now I have more time to investigate this. The error has moved with 0.4.0. With LLVM 3.7, the build now stops in inference.jl:

 cd base && /builddir/build/BUILD/julia/build/usr/bin/julia -C cortex-a8 --output-ji /builddir/build/BUILD/julia/build/usr/lib/julia/inference0.ji -f coreimg.jl
essentials.jl
reflection.jl
options.jl
promotion.jl
tuple.jl
range.jl
expr.jl
error.jl
bool.jl
number.jl
int.jl
operators.jl
pointer.jl
abstractarray.jl
array.jl
hashing.jl
nofloat_hashing.jl
functors.jl
reduce.jl
intset.jl
dict.jl
iterator.jl
inference.jl
error during bootstrap:
UndefVarError(var=:depwarn)
Makefile:176: recipe for target '/builddir/build/BUILD/julia/build/usr/lib/julia/inference0.ji' failed
make[1]: Leaving directory '/builddir/build/BUILD/julia-0.4.0'
Makefile:66: recipe for target 'julia-inference' failed
make[1]: *** [/builddir/build/BUILD/julia/build/usr/lib/julia/inference0.ji] Error 1
make: *** [julia-inference] Error 2

https://kojipkgs.fedoraproject.org//work/tasks/1325/11491325/build.log

I get the same error with JULIA_CPU_TARGET=arm1176jzf-s:
http://koji.fedoraproject.org/koji/watchlogs?taskID=11491415

With LLVM 3.3, it fails even earlier: https://kojipkgs.fedoraproject.org//work/tasks/1327/11491327/build.log

/proc/cpuinfo says:

model name  : ARMv7 Processor rev 0 (v7l)
BogoMIPS    : 2795.11
Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part    : 0xc09
CPU revision    : 0
Hardware    : Highbank
Revision    : 0000
Serial      : 0000000000000000

If you can give me access to an ARM machine with a Fedora image I can try to debug this further (though I'm not really the best person to do that -- I can give anybody simple instructions to create a similar build only from a Fedora image and the standard Julia sources).

@nalimilan nalimilan reopened this Oct 17, 2015
@nalimilan
Copy link
Member Author

I've just found a direct SSH access to the same kind of machine. Turns out building Julia with USE_SYSTEM_LLVM=0 (and still LLVM 3.7.0) gets inference building fine, though it fails later:

primes.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 145: LoadError(at "primes.jl" line 118: InexactError()))

*** This error is usually fixed by running `make clean`. If the error persists, try `make cleanall`. ***
Makefile:230: recipe for target '/home/fedora/nalimilan/julia/usr/lib/julia/sys.o' failed
make[1]: *** [/home/fedora/nalimilan/julia/usr/lib/julia/sys.o] Error 1
make[1]: Leaving directory '/home/fedora/nalimilan/julia'
Makefile:96: recipe for target 'julia-sysimg-release' failed
make: *** [julia-sysimg-release] Error 2

primes.jl:118 is this

const PRIMES = primes(2^16)

Please give me any instructions you can think of to debug this, both with and without USE_SYSTEM_LLVM=1. Could the fact that the build fails earlier when I pass it indicate that Julia relies on LLVM being built with a certain ABI?

@nalimilan
Copy link
Member Author

I've found a shorter way of reproducing the bug:

    list = Int[]
    sizehint!(list, floor(Int, 2^16 / log(2^16)))

but calling floor(Int, 2^16 / log(2^16) apparently isn't enough to trigger it.

If I change the call to const PRIMES = primes(7), the build continues until random.jl, where I get a segfault.

@nalimilan
Copy link
Member Author

Interesting. If I add x = floor(Int, 2^16 / log(2^16)) to primes.jl and simplify to call to prime to const PRIMES = primes(7), then the InexactError occurs later at dSFMT.jl:12 in const N = floor(Int, ((MEXP - 128) / 104 + 1)).

But that second failure doesn't happen if I do not add x = floor(Int, 2^16 / log(2^16)) to primes.jl. In that case, I get a segfault in random.jl, with the following assertion when calling make julia-debug:

julia-debug: /home/fedora/nalimilan/julia/src/gc.c :420 : find_region:  assertion "maybe && "find_region failed"" failed.

@nalimilan
Copy link
Member Author

I think I nailed one of the culprits: when building with MEMDEBUG2=1, I get this assertion failure:

 ./flisp/flisp-debug /home/fedora/nalimilan/julia/src/mk_julia_flisp_boot.scm /home/fedora/nalimilan/julia/src/ jlfrontend.scm julia_flisp.boot
flisp-debug: flisp.c :403 : alloc_words:  assertion "(((uptrint_t)(first))&0x7)==0 && "flisp requires malloc to return 8-aligned pointers"" failed.

@nalimilan
Copy link
Member Author

Is there any solution to check memory accesses when building random.jl without running the full bootstrap phase under Valgrind? I feel like it's going to take the whole week... (and all that for possibly a broken trace!)

@ViralBShah
Copy link
Member

Also, if it builds, the random tests fail.

@ViralBShah
Copy link
Member

I wonder if malloc does not return 8-aligned pointers by default on arm, and if that could possibly be the issue here. Cc @ihnorton

@ihnorton
Copy link
Member

I wonder if malloc does not return 8-aligned pointers by default on arm, and if that could possibly be the issue here.

The CHECK_ALIGN8 assert is independent of MEMDEBUG2, so I think you would see the same error without MEMDEBUG2 if that was the case.

Please give me any instructions you can think of to debug this

There are some tips for debugging bootstrap errors, here: http://docs.julialang.org/en/release-0.4/devdocs/debuggingtips/#debugging-during-julia-s-build-process-bootstrap

@nalimilan
Copy link
Member Author

@ihnorton @ViralBShah Thanks. So I've made some progress.

When building using the system LLVM, the "undefined depwarn" error comes from having stored 0 in an IntSet. When I remove the offending call to depwarn, which arises more precisely in pop!, I get an error later about indexing an empty array with 0. Adding a call to jl_breakpoint in pop! at the place of the original depwarn call, I get this trace:

#0  jl_breakpoint (v=0x9551c580)
    at /home/fedora/nalimilan/julia2/src/builtins.c:1679
#1  0x94d5635c in julia_pop!_1129 ()
#2  0xb61b5edc in jl_apply (f=0x9551a0f0, args=0xbef95844, nargs=3)
    at /home/fedora/nalimilan/julia2/src/julia.h:1328
#3  0xb61b60bc in jl_apply_unspecialized (meth=0x9551a0e0, args=0xbef95844,
    nargs=3) at /home/fedora/nalimilan/julia2/src/gf.c:32
#4  0xb61bcb58 in jl_apply_generic (F=0x95210560, args=0xbef95844, nargs=3)
    at /home/fedora/nalimilan/julia2/src/gf.c:1683
#5  0x94d5a080 in julia_delete!_1126 ()
#6  0xb61b5edc in jl_apply (f=0x95519ba0, args=0xbef95c5c, nargs=2)
    at /home/fedora/nalimilan/julia2/src/julia.h:1328
#7  0xb61b60bc in jl_apply_unspecialized (meth=0x95519b90, args=0xbef95c5c,
    nargs=2) at /home/fedora/nalimilan/julia2/src/gf.c:32
#8  0xb61bcb58 in jl_apply_generic (F=0x9533a210, args=0xbef95c5c, nargs=2)
    at /home/fedora/nalimilan/julia2/src/gf.c:1683
#9  0x94e63414 in julia_typeinf_uncached_778 ()
#10 0xb61b5edc in jl_apply (f=0x954b8ed0, args=0xbef95e0c, nargs=7)
    at /home/fedora/nalimilan/julia2/src/julia.h:1328
[...]
(gdb) p jl_(v) # I've put the IntSet in v
Core.Inference.IntSet(bits=Array{UInt32, 1}[0x00000002, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000], limit=256, fill1s=false)

There's only one call to delete! in typeinf_uncached, so it looks like at least we know where it happens. What is interesting is that there doesn't seem to be any call to push! inserting a zero value to the set; maybe that's a bug with the initialization? Any ideas about what to do next welcome!


When building with USE_SYSTEM_LLVM=0 LLVM_VER=3.7.0, after disabling all code that triggers errors, I get to the end of the bootstrap, but then I get this error (a bit approximate, I no longer have it offhand):

ld: error: sys-debug.so uses VFP register arguments, sys-debug.o does not

Under what circumstances wouldn't Julia use the hard float ABI, despite all the system being configured to do that? Can it come from a CPU detection issue in LLVM? I've tried passing MARCH=armv7-a and JULIA_CPU_TARGET=cortex-a8, and it doesn't make any difference. I'm currently rebuilding LLVM with explicit LLVM_FLAGS.

@nalimilan
Copy link
Member Author

Unfortunately, I get the same ld error about VFP when building LLVM and Julia using LLVM_FLAGS="--with-cpu=cortex-a8 --with-tune=cortex-a8 --with-arch=armv7-a --with-float=hard --with-abi=aapcs-linux --with-fpu=vfpv3-d16". So the issue is deeper than that.

@nalimilan
Copy link
Member Author

@ihnorton Most of those memory alignment assertions are only enabled when MEDEBUG2=1. This situation also happens on i686: #10942 Doesn't seem to be a real issue, as it only happens with MEMDEBUG2=1.

Regarding the ld error about VFP, I've checked by calling sys::getHostCPUFeatures(), that neon, fp16 and vfp3 are true. So it looks like CPU detection is not the issue here.

@nalimilan
Copy link
Member Author

The Valgrind run stopped for an unknown reason (probably OOM) after days, before reaching random.jl. Pasting the results here just in case, but they don't seem to contain anything interesting. Looks like I'll need another approach, for now I'll just skip random.jl.

$ cd /home/fedora/nalimilan/julia/base && valgrind --smc-check=all-non-file --suppressions=../contrib/valgrind-julia.supp /home/fedora/nalimilan/julia/usr/bin/julia-debug -C cortex-a8 --output-o /home/fedora/nalimilan/julia/usr/lib/julia/sys-debug.o  -f -J /home/fedora/nalimilan/julia/usr/lib/julia/inference.ji sysimg.jl
==4009== Memcheck, a memory error detector
==4009== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==4009== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==4009== Command: /home/fedora/nalimilan/julia/usr/bin/julia-debug -C cortex-a8 
--output-o /home/fedora/nalimilan/julia/usr/lib/julia/sys-debug.o -f -J /home/fedora/nalimilan/julia/usr/lib/julia/inference.ji sysimg.jl
==4009==
==4009== Invalid read of size 4
==4009==    at 0x401A54C: strcpy (in /usr/lib/ld-2.21.so)
==4009==  Address 0x7a5aa00 is 32 bytes inside a block of size 35 alloc'd
==4009==    at 0x4845AE0: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==4009==
==4009== Invalid read of size 4
==4009==    at 0x401A480: strcpy (in /usr/lib/ld-2.21.so)
==4009==  Address 0x7a5ab3c is 28 bytes inside a block of size 29 alloc'd
==4009==    at 0x4845AE0: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==4009==
==4009== Invalid read of size 4
==4009==    at 0x401A5CC: strcpy (in /usr/lib/ld-2.21.so)
==4009==  Address 0x7e6ca2c is 0 bytes after a block of size 36 alloc'd
==4009==    at 0x4845AE0: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==4009==
essentials.jl
[...]
env.jl
==4009== Conditional jump or move depends on uninitialised value(s)
==4009==    at 0x401A0E4: index (in /usr/lib/ld-2.21.so)
==4009==
==4009== Conditional jump or move depends on uninitialised value(s)
==4009==    at 0x401A0E8: index (in /usr/lib/ld-2.21.so)
==4009==
==4009== Conditional jump or move depends on uninitialised value(s)
==4009==    at 0x4014AD4: dl_open_worker (in /usr/lib/ld-2.21.so)
==4009==
==4009== Conditional jump or move depends on uninitialised value(s)
==4009==    at 0x4008A90: _dl_map_object (in /usr/lib/ld-2.21.so)
==4009==    by 0x4014BDB: dl_open_worker (in /usr/lib/ld-2.21.so)
==4009==
[...]
uv_constants.jl
Processus arrêté

@nalimilan
Copy link
Member Author

I've just tried with latest master, and I still get the same error with USE_SYSTEM_LLVM=0:

primes.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 168: LoadError(at "primes.jl" line 96: InexactError()))

With USE_SYSTEM_LLVM=1, the build fails earlier at inference.jl:

error during bootstrap:
UndefVarError(var=:ArgumentError)

The gdb backtrace when breaking at jl_throw isn't particularly interesting:

(gdb) ba
#0  jl_throw (e=0x10e0960)
    at /home/fedora/nalimilan/rpmbuild/BUILD/julia/src/task.c:865
#1  0xb6e15dcc in jl_undefined_var_error (var=var@entry=0x123664)
    at /home/fedora/nalimilan/rpmbuild/BUILD/julia/src/builtins.c:122
#2  0xb6e1e4ac in jl_get_binding_or_error (m=<optimized out>, var=0x123664)
    at /home/fedora/nalimilan/rpmbuild/BUILD/julia/src/module.c:247
#3  0xb3d131e0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Here's what Valgrind (with -DMEMDEBUG) gives. Does it indicate a real bug?

$ JULIA_CPU_TARGET=cortex-a8 valgrind --tool=memcheck --smc-check=all-non-file --trace-children=yes --suppressions=../contrib/valgrind-julia.supp /home/fedora/nalimilan/rpmbuild/BUILD/julia/build/usr/bin/julia -C cortex-a8 --output-ji /home/fedora/nalimilan/rpmbuild/BUILD/julia/build/usr/lib/julia/inference0.ji -f coreimg.jl
==7566== Memcheck, a memory error detector
==7566== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==7566== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==7566== Command: /home/fedora/nalimilan/rpmbuild/BUILD/julia/build/usr/bin/julia -C cortex-a8 --output-ji /home/fedora/nalimilan/rpmbuild/BUILD/julia/build/usr/lib/julia/inference0.ji -f coreimg.jl
==7566== 
==7566== Conditional jump or move depends on uninitialised value(s)
==7566==    at 0x401ABBC: strcpy (in /usr/lib/ld-2.22.so)
==7566== 
==7566== Invalid read of size 1
==7566==    at 0x492E09C: JuliaGCAllocator::allocate_frame() (llvm-gcroot.cpp:619)
==7566==    by 0x492FB8F: jl_codegen_finalize_temp_arg(llvm::CallInst*, llvm::Type*) (llvm-gcroot.cpp:756)
==7566==    by 0x490DADB: finalize_gc_frame(llvm::Function*) [clone .constprop.1255] (codegen.cpp:3664)
==7566==    by 0x490DC47: finalize_gc_frame(llvm::Module*) (codegen.cpp:3706)
==7566==    by 0x490DDAF: jl_finalize_module(llvm::Module*) (codegen.cpp:987)
==7566==    by 0x490F157: jl_generate_fptr (codegen.cpp:1095)
==7566==    by 0x48D0013: jl_call_method_internal (julia_internal.h:66)
==7566==    by 0x48D0013: jl_toplevel_eval_flex.part.7 (toplevel.c:556)
==7566==    by 0x48AAE5F: jl_parse_eval_all (ast.c:784)
==7566==    by 0x48D09AB: jl_load (toplevel.c:583)
==7566==    by 0x48BF373: _julia_init (init.c:644)
==7566==    by 0x48C05FF: julia_init (task.c:275)
==7566==    by 0x111DB: main (repl.c:655)
==7566==  Address 0x8435ec4 is 36 bytes inside a block of size 68 free'd
==7566==    at 0x48480EC: operator delete(void*) (vg_replace_malloc.c:576)
==7566==    by 0x5360C53: llvm::CallInst::~CallInst() (in /usr/lib/llvm37/libLLVM-3.7.so)
==7566==    by 0x535EC5B: llvm::Instruction::eraseFromParent() (in /usr/lib/llvm37/libLLVM-3.7.so)
==7566==    by 0x492E09B: JuliaGCAllocator::allocate_frame() (llvm-gcroot.cpp:619)
==7566==    by 0x492FB8F: jl_codegen_finalize_temp_arg(llvm::CallInst*, llvm::Type*) (llvm-gcroot.cpp:756)
==7566==    by 0x490DADB: finalize_gc_frame(llvm::Function*) [clone .constprop.1255] (codegen.cpp:3664)
==7566==    by 0x490DC47: finalize_gc_frame(llvm::Module*) (codegen.cpp:3706)
==7566==    by 0x490DDAF: jl_finalize_module(llvm::Module*) (codegen.cpp:987)
==7566==    by 0x490F157: jl_generate_fptr (codegen.cpp:1095)
==7566==    by 0x48D0013: jl_call_method_internal (julia_internal.h:66)
==7566==    by 0x48D0013: jl_toplevel_eval_flex.part.7 (toplevel.c:556)
==7566==    by 0x48AAE5F: jl_parse_eval_all (ast.c:784)
==7566==    by 0x48D09AB: jl_load (toplevel.c:583)
==7566==  Block was alloc'd at
==7566==    at 0x4846D18: operator new(unsigned int) (vg_replace_malloc.c:328)
==7566==    by 0x53A2FFF: llvm::User::operator new(unsigned int, unsigned int) (in /usr/lib/llvm37/libLLVM-3.7.so)
==7566==    by 0x536D84F: llvm::CallInst::cloneImpl() const (in /usr/lib/llvm37/libLLVM-3.7.so)
==7566== 
essentials.jl
generator.jl
reflection.jl
options.jl
promotion.jl
tuple.jl
range.jl
expr.jl
error.jl
bool.jl
number.jl
int.jl
operators.jl
pointer.jl
abstractarray.jl
array.jl
hashing.jl
nofloat_hashing.jl
functors.jl
reduce.jl
intset.jl
dict.jl
iterator.jl
inference.jl
error during bootstrap:
UndefVarError(var=:ArgumentError)

==7566== Syscall param epoll_ctl(event) points to uninitialised byte(s)
==7566==    at 0x4BC2B50: syscall (in /usr/lib/libc-2.22.so)
==7566==    by 0x495D2E7: uv__epoll_ctl (linux-syscalls.c:300)
==7566==    by 0xBDF58DB7: ???
==7566==  Address 0xbdf58dbc is on thread 1's stack
==7566== 
==7566== 
==7566== HEAP SUMMARY:
==7566==     in use at exit: 22,007,613 bytes in 195,257 blocks
==7566==   total heap usage: 1,393,098 allocs, 1,197,841 frees, 445,408,972 bytes allocated
==7566== 
==7566== LEAK SUMMARY:
==7566==    definitely lost: 12,032 bytes in 47 blocks
==7566==    indirectly lost: 0 bytes in 0 blocks
==7566==      possibly lost: 1,567,832 bytes in 26,544 blocks
==7566==    still reachable: 20,427,749 bytes in 168,666 blocks
==7566==                       of which reachable via heuristic:
==7566==                         newarray           : 36 bytes in 1 blocks
==7566==                         multipleinheritance: 768 bytes in 3 blocks
==7566==         suppressed: 0 bytes in 0 blocks
==7566== Rerun with --leak-check=full to see details of leaked memory
==7566== 
==7566== For counts of detected and suppressed errors, rerun with: -v
==7566== Use --track-origins=yes to see where uninitialised values come from
==7566== ERROR SUMMARY: 518 errors from 3 contexts (suppressed: 0 from 0)

@yuyichao
Copy link
Contributor

Have you checked if this is a stackoverflow?

@nalimilan
Copy link
Member Author

Adding -fstack-check and -fstack-protector-all does not make any difference (if that was your question).

@yuyichao
Copy link
Contributor

No, I mean ulimit -s

@nalimilan
Copy link
Member Author

Ah. No, ulimit -s 100000 doesn't fix it either.

@yuyichao
Copy link
Contributor

Yeah, assuming this is the same system as #13752 (comment), I'll not be too surprised if the ABI issue could mess up bootstrap...

@waTeim
Copy link
Contributor

waTeim commented Apr 2, 2016

In addition to that indicated above with primes.jl and random.jl (seg fault), the following happens. Both random.jl and dSFMT.jl are left out of sysimg.jl to proceed. Various other files are modified to remove any references to Random provided functions including sparse/sparsematrix.jl, sparse/sparsevector.jl, deprecated.jl and precompile.jl

dSFMT.jl
error during bootstrap: 
LoadError(at "sysimg.jl" line 216: LoadError(at "dSFMT.jl" line 12: InexactError()))

further on...

error during bootstrap:
LoadError(at "sysimg.jl" line 281: LoadError(at "irrationals.jl" line 95: Base.AssertionError(msg="Float64(π) == Float64(big(π))")))

caused by REPL documentation

error during bootstrap:
LoadError(at "sysimg.jl" line 353: LoadError(at "/home/jeffw/src/julia/base/precompile.jl" line 494: InexactError()))

and finally, the full message

/home/jeffw/src/julia/base/precompile.jl
    LINK usr/lib/julia/sys.so
/usr/bin/ld: error: /home/jeffw/src/julia/usr/lib/julia/sys.so uses VFP register arguments, /home/jeffw/src/julia/usr/lib/julia/sys.o does not
/usr/bin/ld: failed to merge target specific data of file /home/jeffw/src/julia/usr/lib/julia/sys.o
collect2: error: ld returned 1 exit status

So sys.so both does and does not use VFP?

processor   : 0
model name  : ARMv7 Processor rev 1 (v7l)
BogoMIPS    : 125.00
Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 1

Hardware    : Generic DT based system
Revision    : 0000
Serial      : 0000000000000000

@ViralBShah
Copy link
Member

Is this still an issue?

@nalimilan
Copy link
Member Author

Yes...

@ViralBShah
Copy link
Member

Probably should open a new issue (since so much has changed) if still a problem on the same machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:arm ARMv7 and AArch64
Projects
None yet
Development

No branches or pull requests

8 participants