Ray issue on Odroid-XU4 board #1008

akzare · 2017-09-24T03:38:38Z

I've built Ray on Odroid-XU4 board (http://www.hardkernel.com/main/products/prdt_info.php?g_code=G143452239825). As I try to run a simple application on it, the following issues is reported by Ray:

Attached Ray_Issue_XU4.log represents the Ray log.
Ray_Issue_XU4.log

robertnishihara · 2017-09-24T17:59:21Z

I'm a little surprised to see this error

/ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4

since that value has never changed from 0, see https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.h#L34

It's possible that we're doing the arithmetic incorrectly somewhere in this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L94-L109 and this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L63-L69.

E.g., maybe one of the types has the wrong size or something or there is a mismatch between the two blocks.

robertnishihara · 2017-09-24T18:02:19Z

Another thing to verify is that you can start the plasma store by hand without any trouble. In your case probably

/usr/local/lib/python2.7/dist-packages/ray-0.2.0-py2.7-linux-armv7l.egg/ray/plasma/../core/src/plasma/plasma_store -s /tmp/s1 -m 1000000

If that works, then try connecting a plasma manager. E.g., check out the instructions in this comment #108 (comment).

arvindc95 · 2017-10-04T13:55:30Z

I have the same issue on a different platform (Ubuntu 16.04 VM running on Windows 7). I followed the instructions for connecting a plasma manager, and was able to start a plasma store, but when I tried to start a plasma manager, I received a /ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4 error thrown from the plasma store, and /ray/src/plasma/plasma_manager.cc483 Check failed: _s.ok() Bad status: IOError: Broken pipe thrown from the plasma manager. Any advice on how to proceed?

robertnishihara · 2017-10-05T05:26:22Z

@arvindc95 @akzare could you try cherry-picking this commit apache/arrow#1172, recompiling Arrow, and see if it fixes the problem? I just looked through the code in that file and spotted that potential bug.

Let me know if you have questions about how to do this.

If that doesn't work, then I think we'll just need to add a lot of print statements (e.g., in this function https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L90) and print the actual bytes that are being sent and see if we can infer anything from that.

arvindc95 · 2017-10-05T18:13:44Z

@robertnishihara thanks for the help, that commit helped me get ray initialized; I'm able to put and get objects from the plasma store, and use the remote function when there's nothing to be parallelized, but when I try running the time.sleep example in the documentation (http://ray.readthedocs.io/en/latest/tutorial.html#remote-functions), I get a segmentation fault thrown from the local scheduler. Do you have any ideas how I can debug this? Are there log files generated by the scheduler?

robertnishihara · 2017-10-05T18:37:18Z

Glad to hear it, and thanks for trying it out! Sounds like there's a bug in the local scheduler (perhaps similar to the previous bug).

You're rebuilding all of Ray, right? Because the local scheduler also communicates with the plasma store, so it probably needs the same fix from apache/arrow#1172.

Some processes log to /tmp/raylogs, so it's worth looking at the most recent files in there and see if anything turns up, but if you're starting Ray with ray.init(), then the local scheduler STDERR/STDOUT will just go to the terminal.

What I would suggest is trying to run the same workload that is causing the crash, but to start the local scheduler in gdb. To do that, you could do something like the following.

First modify

ray/python/ray/local_scheduler/local_scheduler_services.py

Line 122 in aebe9f9

pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)

to be something like
```
import IPython
IPython.embed()
# pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
pid = 9999
```
Then start Python and do import ray and ray.init(). This will open up IPython when it tries to start the local scheduler. Run print(command) in the IPython shell to print the command that Ray wants to use to start the local scheduler.
Then go to a different terminal window, and do
```
gdb ray/python/ray/core/src/local_scheduler/local_scheduler
```
Then do run followed by the command printed by print(command). to start the local scheduler in gdb. However, you'll need to drop the initial executable from the command, AND you'll need to add quotes around the full argument to the -w flag, which is pretty long. Otherwise you'll get an error saying unknown flag or something like that.
Then go back to the IPython shell and do exit()
Then run your workload and see what errors are caught in gdb.

Note that if the error is uninformative, we may need to recompile Ray with more debug information. E.g., maybe add a -g to the line

ray/src/common/CMakeLists.txt

Line 9 in aebe9f9

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")

arvindc95 · 2017-10-10T16:17:18Z

@robertnishihara I tried the steps you outlined for using gdb, but when I tried to run my workload I kept getting an exception when defining a function with the @ray.remote decorator; I've attached the error thrown:
decorator_error.txt

Also, when making the fix you referenced, I made the code change in the arrow code and then reran python setup.py install in order to rebuild Ray. Let me know if this procedure is incomplete for rebuilding Ray (I also ran this after changing local_scheduler_services.py because the IPython shell wasn't showing up)

Thanks again for your help!

robertnishihara · 2017-10-10T22:29:28Z

@arvindc95 interesting, that seems like the same error as #394.

You could try using IPython instead of Python, since #394 was only an issue in the regular Python interpreter.

It's also possible that when you reran python setup.py install, it undid your changes to Arrow. Can you check that your changes were unaffected? Or perhaps comment out this line

ray/src/thirdparty/download_thirdparty.sh

Line 16 in b1660c4

git checkout 988338c544580ffd367a5540f1061dd7b0fccc0e

Also, instead of using python setup.py install, I'd suggest using python setup.py develop because that way whenever you change the Python code, you won't need to rerun setup.py, the changes will automatically be used.

arvindc95 · 2017-10-11T20:12:38Z

@robertnishihara Using IPython helped; my workload runs successfully, but the debugger throws the following error immediately after the workload completes: Program received signal SIGSEGV, Segmentation fault. __strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50 50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory. Would this segfault be from local_scheduler_services.py or any of the functions it calls?

robertnishihara · 2017-10-11T22:10:57Z

If you do bt in gdb, does that print anything?

This error looks similar https://groups.google.com/forum/#!topic/jansson-users/u78eGC15itw.

cc @atumanov

arvindc95 · 2017-10-12T15:45:25Z

Yes, here's the output:

Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory.
(gdb) bt
#0 __strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
#1 0x080965ce in redisvFormatCommand (target=0xbfffe8b8,
format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe928 "") at hiredis.c:262
#2 0x0809b91c in redisvAsyncCommand (ac=0x80c83c0, fn=0x0, privdata=0x0,
format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe920 "\340\316\f\b\036")
at async.c:654
#3 0x0809b99c in redisAsyncCommand (ac=0x80c83c0, fn=0x0, privdata=0x0,
format=0x809fd45 "ZADD %b %s %b") at async.c:669
#4 0x0806c621 in RayLogger_log_event (db=0x80c7de0,
key=0x80ccee0 "event_log:\213\313\363\265O\312\300Է#\206Ɍ\234\274\301\332T\222\004", key_length=30,
value=0x80cc8e8 "[[1507822835.414047, "ray:get_task", 1, {}], [1507822914.265168, "ray:import_function_to_run", 1, {}], [1507822914.265763, "ray:import_function_to_run", 2, {}], [1507822914.266116, "ray:import_functio"...,
value_length=1520, timestamp=1507822920.4063809)
at /home/achand/ray/src/common/logging.cc:100
#5 0x08056c35 in process_message(aeEventLoop*, int, void*, int) ()
#6 0x0807bcbd in aeProcessEvents (eventLoop=0x80bea38, flags=3)
at /home/achand/ray/src/common/thirdparty/ae/ae.c:412
#7 0x0807c19b in aeMain (eventLoop=0x80bea38)
at /home/achand/ray/src/common/thirdparty/ae/ae.c:455
#8 0x0805f8f8 in event_loop_run (loop=0x80bea38)
at /home/achand/ray/src/common/event_loop.cc:58

pcmoritz · 2017-10-13T18:23:24Z

It looks like it is using SSE2 instructions which probably aren't available on ARM. Could it be that there is some issue with the (cross-)compilation?

arvindc95 · 2017-10-13T20:34:11Z

@pcmoritz I checked the instruction sets supported in the VM guest and SSE2 is one of them (it's supported in the VM host as well)

pcmoritz · 2017-10-13T21:00:45Z

@arvindc95 I created a PR here: #1122 Could you try both the commits in the PR and see if one of them makes it work? These are both fixing potential problems here. Thanks!

robertnishihara · 2017-10-13T21:05:19Z

In particular, we'd be interested in knowing which of the two commits fixes it (assuming one of them does in fact fix it).

arvindc95 · 2017-10-13T21:58:16Z

seg_fault_fix_logging.txt
seg_fault_add_casts.txt

Both failed the same way as before; the segfault happened after the results of foo.remote() were returned. I made the the logging code change, ran python setup.py develop, then tried running a workload, and repeated the process for the static cast addition as well. The gdb logs show the updated logging code change, and the lines referenced are slightly different between the two logs, so I think the changes were compiled; let me know if I missed anything.

arvindc95 · 2017-10-13T22:04:05Z

Also, I've been manually starting ray because when I don't, the plasma store never initializes. I changed the socket name from /tmp/s1 to /tmp/s2 in case the same socket was being reused every time I manually started the store, but the store was still being initialized, so I'm not sure why it doesn't get made when I don't manually start the store.

pcmoritz · 2017-10-13T22:14:11Z

Hm, thanks for trying it out. Is there any chance you can share your VirtualBox image together with instructions to reproduce the problem with us or an EC2 AMI if you have one so we can dig deeper into this?

atumanov · 2017-10-14T08:10:14Z

I was able to reproduce on 32bit Ubuntu 16.04 and fix. I put together a quick PR that fixes it for me. Could you please try out #1126. Thanks.

arvindc95 · 2017-10-16T15:16:45Z

@atumanov The changes from your PR worked, thanks so much!
Also, thanks to @pcmoritz and @robertnishihara thank you for your help resolving this as well! Would you still like me post my VirtualBox image?

atumanov · 2017-10-16T16:12:23Z

@arvindc95 , awesome, glad to hear! The virtualbox image will be helpful for testing, in case we need to reproduce any other problems you encounter. If you are in a position to provide us with the ODROID platform for testing purposes as well, even better :)

robertnishihara · 2018-02-02T07:53:12Z

Closing for now since a lot of things have changed.

robertnishihara mentioned this issue Oct 5, 2017

Turn compiler warnings into errors. #1087

Closed

atumanov mentioned this issue Oct 14, 2017

Make Ray work on 32 bit Linux #1127

Closed

robertnishihara closed this as completed Feb 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray issue on Odroid-XU4 board #1008

Ray issue on Odroid-XU4 board #1008

akzare commented Sep 24, 2017

robertnishihara commented Sep 24, 2017

robertnishihara commented Sep 24, 2017

arvindc95 commented Oct 4, 2017

robertnishihara commented Oct 5, 2017 •

edited

Loading

arvindc95 commented Oct 5, 2017 •

edited

Loading

robertnishihara commented Oct 5, 2017 •

edited

Loading

arvindc95 commented Oct 10, 2017

robertnishihara commented Oct 10, 2017

arvindc95 commented Oct 11, 2017

robertnishihara commented Oct 11, 2017

arvindc95 commented Oct 12, 2017 •

edited

Loading

pcmoritz commented Oct 13, 2017

arvindc95 commented Oct 13, 2017

pcmoritz commented Oct 13, 2017

robertnishihara commented Oct 13, 2017

arvindc95 commented Oct 13, 2017

arvindc95 commented Oct 13, 2017 •

edited

Loading

pcmoritz commented Oct 13, 2017 •

edited

Loading

atumanov commented Oct 14, 2017

arvindc95 commented Oct 16, 2017

atumanov commented Oct 16, 2017

robertnishihara commented Feb 2, 2018

Ray issue on Odroid-XU4 board #1008

Ray issue on Odroid-XU4 board #1008

Comments

akzare commented Sep 24, 2017

robertnishihara commented Sep 24, 2017

robertnishihara commented Sep 24, 2017

arvindc95 commented Oct 4, 2017

robertnishihara commented Oct 5, 2017 • edited Loading

arvindc95 commented Oct 5, 2017 • edited Loading

robertnishihara commented Oct 5, 2017 • edited Loading

arvindc95 commented Oct 10, 2017

robertnishihara commented Oct 10, 2017

arvindc95 commented Oct 11, 2017

robertnishihara commented Oct 11, 2017

arvindc95 commented Oct 12, 2017 • edited Loading

pcmoritz commented Oct 13, 2017

arvindc95 commented Oct 13, 2017

pcmoritz commented Oct 13, 2017

robertnishihara commented Oct 13, 2017

arvindc95 commented Oct 13, 2017

arvindc95 commented Oct 13, 2017 • edited Loading

pcmoritz commented Oct 13, 2017 • edited Loading

atumanov commented Oct 14, 2017

arvindc95 commented Oct 16, 2017

atumanov commented Oct 16, 2017

robertnishihara commented Feb 2, 2018

robertnishihara commented Oct 5, 2017 •

edited

Loading

arvindc95 commented Oct 5, 2017 •

edited

Loading

robertnishihara commented Oct 5, 2017 •

edited

Loading

arvindc95 commented Oct 12, 2017 •

edited

Loading

arvindc95 commented Oct 13, 2017 •

edited

Loading

pcmoritz commented Oct 13, 2017 •

edited

Loading