Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in serializing RandomDevice objects #16451

Closed
PythonNut opened this issue May 19, 2016 · 4 comments
Closed

Issue in serializing RandomDevice objects #16451

PythonNut opened this issue May 19, 2016 · 4 comments

Comments

@PythonNut
Copy link
Contributor

PythonNut commented May 19, 2016

The following incantation may produce segmentation faults:

# start with julia -p 1
rng = RandomDevice()
rand(rng)
@parallel (+) for _ in 1:1 rand(rng) end

I've reproduced this so far on three systems:

Main development laptop

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libm
  LLVM: libLLVM-3.3

Output

signal (11): Segmentation fault
jl_ios_get_nbyte_int at /usr/bin/../lib/julia/libjulia.so (unknown line)
rand at random.jl:229
jlcall_rand_21321 at  (unknown line)
jl_apply_generic at /usr/bin/../lib/julia/libjulia.so (unknown line)
anonymous at none:1
jl_f_apply at /usr/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:923
run_work_thunk at multi.jl:661
jlcall_run_work_thunk_21277 at  (unknown line)
jl_apply_generic at /usr/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:923
unknown function (ip: 0x7ff5ed22c743)
unknown function (ip: (nil))
Worker 2 terminated.
ERROR: ProcessExitedException()
 in preduce at multi.jl:1533
 [inlined code] from multi.jl:1542
 in anonymous at expr.jl:113
ERROR (unhandled task failure): EOFError: read end of file

Crusty old laptop

Julia Version 0.4.5
Commit 2ac304d* (2016-03-18 00:58 UTC)
Platform Info:
  System: Linux (i686-redhat-linux)
  CPU: Genuine Intel(R) CPU           T2500  @ 2.00GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Banias)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
signal (11): Segmentation fault
jl_ios_get_nbyte_int at /usr/bin/../lib/julia/libjulia.so (unknown line)
rand at random.jl:229
jlcall_rand_21130 at  (unknown line)
jl_trampoline at /usr/bin/../lib/julia/libjulia.so (unknown line)
jl_apply_generic at /usr/bin/../lib/julia/libjulia.so (unknown line)
anonymous at none:1
jl_trampoline at /usr/bin/../lib/julia/libjulia.so (unknown line)
jl_f_apply at /usr/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:923
jl_trampoline at /usr/bin/../lib/julia/libjulia.so (unknown line)
run_work_thunk at multi.jl:661
jlcall_run_work_thunk_21088 at  (unknown line)
jl_apply_generic at /usr/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:923
jl_trampoline at /usr/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0xb76254ef)
Worker 2 terminated.
ERROR: ProcessExitedException()
 in preduce at multi.jl:1533
 [inlined code] from multi.jl:1542
 in anonymous at expr.jl:113
ERROR (unhandled task failure): EOFError: read end of file

Raspberry Pi 2

Julia Version 0.5.0-dev+4124
Commit a717fbe (2016-05-16 22:35 UTC)
Platform Info:
  System: Linux (arm-linux-gnueabihf)
  CPU: ARMv7 Processor rev 5 (v7l)
  WORD_SIZE: 32
  BLAS: libopenblas (NO_AFFINITY ARMV7)
  LAPACK: libopenblas
  LIBM: libm
  LLVM: libLLVM-3.7.1 (ORCJIT, generic)
signal (11): Segmentation fault
while loading no file, in expression starting on line 0
Allocations: 2284736 (Pool: 2283773; Big: 963); GC: 7
Worker 2 terminated.
ERROR: ProcessExitedException()
ERROR (unhandled task failure): EOFError: read end of file

Other notes

If you omit the initial call to rand(rng), the following error is produced instead of the segmentation fault:

ERROR: On worker 2:
EOFError: read end of file
 in rand at random.jl:229
 [inlined code] from none:1
 in anonymous at no file:0
 in anonymous at multi.jl:923
 in run_work_thunk at multi.jl:661
 [inlined code] from multi.jl:923
 in anonymous at task.jl:63
 in preduce at multi.jl:1533
 [inlined code] from multi.jl:1542
 in anonymous at expr.jl:113

Also, if @everywhere rng = RandomDevice() is used instead, everything happens as expected.

I realize that omitting the @everywhere is probably bad form, although it hasn't caused trouble for me until now (I previously thought it only applied to functions).

# this works fine
a = 1
@parallel for _ in 1:1 a end

I understand that this is most likely an issue of telling the user where they've messed up, not making the broken code run.

Also I apologies for the long post. Do I need to shorten it?

@amitmurthy
Copy link
Contributor

The "not working" part is expected. rng is not available on the workers and hence the errors. The @everywhere defines rng in global scope everywhere and so it works as expected.

On the current master, while it does not error out or segfault, the return value is wrong (it returns a 0).

Changing issue description to reflect the correct issue.

@amitmurthy amitmurthy changed the title RandomDevice + @parallel - @everywhere = Segmentation Fault @parallel does not catch and propagate remote exceptions May 20, 2016
@PythonNut
Copy link
Contributor Author

PythonNut commented May 20, 2016

@amitmurthy thanks. That's largely what I expected. However, I am a bit confused: if @everywhere is required for a variable to be defined on all workers, why does the following work?

# this works fine
a = 1
@assert 10000 == @parallel (+) for _ in 1:10000 a end

@amitmurthy
Copy link
Contributor

a is a copied into the closure serialized to and executed on the workers. I suspect RandomDevice() does not lend itself to a safe serialize and deserialize.

In my local testing, sometimes I see a silent failure, i.e., no exceptions but the returned value is wrong, an sometimes an error due to a faulty deserialization of rng.

@tkelman tkelman added the parallelism Parallel or distributed computation label May 20, 2016
@amitmurthy amitmurthy changed the title @parallel does not catch and propagate remote exceptions Issue in serializing RandomDevice objects May 30, 2016
@amitmurthy amitmurthy removed the parallelism Parallel or distributed computation label May 30, 2016
@amitmurthy
Copy link
Contributor

The issue here is that RandomDevice objects serialize/deserialize differently depending on whether they have been accessed once or not.

Serializing a RandomDevice() throws an exception as expected:

julia> rng = RandomDevice()
RandomDevice(IOStream(<file /dev/urandom>))

julia> remotecall_fetch(rand, 2, rng)
ERROR: On worker 2:
EOFError: read end of file
 in read at ./iostream.jl:160
 in rand at ./random.jl:54 [inlined]
 in rand at ./random.jl:57 [inlined]
 in rand at ./random.jl:220
 in #306 at ./multi.jl:1062
 in run_work_thunk at ./multi.jl:769
 in macro expansion at ./multi.jl:1062 [inlined]

However one that has been used to generate a random number just returns a 0 after deserialization.

julia> rng = RandomDevice()
RandomDevice(IOStream(<file /dev/urandom>))

julia> rand(rng)
0.7916044177391832

julia> remotecall_fetch(rand, 2, rng)
0.0

Will submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants