Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray issue on Odroid-XU4 board #1008

Closed
akzare opened this issue Sep 24, 2017 · 22 comments
Closed

Ray issue on Odroid-XU4 board #1008

akzare opened this issue Sep 24, 2017 · 22 comments

Comments

@akzare
Copy link

akzare commented Sep 24, 2017

I've built Ray on Odroid-XU4 board (http://www.hardkernel.com/main/products/prdt_info.php?g_code=G143452239825). As I try to run a simple application on it, the following issues is reported by Ray:

Attached Ray_Issue_XU4.log represents the Ray log.
Ray_Issue_XU4.log

@robertnishihara
Copy link
Collaborator

I'm a little surprised to see this error

/ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4

since that value has never changed from 0, see https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.h#L34

It's possible that we're doing the arithmetic incorrectly somewhere in this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L94-L109 and this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L63-L69.

E.g., maybe one of the types has the wrong size or something or there is a mismatch between the two blocks.

@robertnishihara
Copy link
Collaborator

Another thing to verify is that you can start the plasma store by hand without any trouble. In your case probably

/usr/local/lib/python2.7/dist-packages/ray-0.2.0-py2.7-linux-armv7l.egg/ray/plasma/../core/src/plasma/plasma_store -s /tmp/s1 -m 1000000

If that works, then try connecting a plasma manager. E.g., check out the instructions in this comment #108 (comment).

@arvindc95
Copy link

I have the same issue on a different platform (Ubuntu 16.04 VM running on Windows 7). I followed the instructions for connecting a plasma manager, and was able to start a plasma store, but when I tried to start a plasma manager, I received a /ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4 error thrown from the plasma store, and /ray/src/plasma/plasma_manager.cc483 Check failed: _s.ok() Bad status: IOError: Broken pipe thrown from the plasma manager. Any advice on how to proceed?

@robertnishihara
Copy link
Collaborator

robertnishihara commented Oct 5, 2017

@arvindc95 @akzare could you try cherry-picking this commit apache/arrow#1172, recompiling Arrow, and see if it fixes the problem? I just looked through the code in that file and spotted that potential bug.

Let me know if you have questions about how to do this.

If that doesn't work, then I think we'll just need to add a lot of print statements (e.g., in this function https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L90) and print the actual bytes that are being sent and see if we can infer anything from that.

@arvindc95
Copy link

arvindc95 commented Oct 5, 2017

@robertnishihara thanks for the help, that commit helped me get ray initialized; I'm able to put and get objects from the plasma store, and use the remote function when there's nothing to be parallelized, but when I try running the time.sleep example in the documentation (http://ray.readthedocs.io/en/latest/tutorial.html#remote-functions), I get a segmentation fault thrown from the local scheduler. Do you have any ideas how I can debug this? Are there log files generated by the scheduler?

@robertnishihara
Copy link
Collaborator

robertnishihara commented Oct 5, 2017

Glad to hear it, and thanks for trying it out! Sounds like there's a bug in the local scheduler (perhaps similar to the previous bug).

You're rebuilding all of Ray, right? Because the local scheduler also communicates with the plasma store, so it probably needs the same fix from apache/arrow#1172.

Some processes log to /tmp/raylogs, so it's worth looking at the most recent files in there and see if anything turns up, but if you're starting Ray with ray.init(), then the local scheduler STDERR/STDOUT will just go to the terminal.

What I would suggest is trying to run the same workload that is causing the crash, but to start the local scheduler in gdb. To do that, you could do something like the following.

  1. First modify

    pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)

    to be something like

    import IPython
    IPython.embed()
    # pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
    pid = 9999
  2. Then start Python and do import ray and ray.init(). This will open up IPython when it tries to start the local scheduler. Run print(command) in the IPython shell to print the command that Ray wants to use to start the local scheduler.

  3. Then go to a different terminal window, and do

    gdb ray/python/ray/core/src/local_scheduler/local_scheduler
    

    Then do run followed by the command printed by print(command). to start the local scheduler in gdb. However, you'll need to drop the initial executable from the command, AND you'll need to add quotes around the full argument to the -w flag, which is pretty long. Otherwise you'll get an error saying unknown flag or something like that.

  4. Then go back to the IPython shell and do exit()

  5. Then run your workload and see what errors are caught in gdb.

    Note that if the error is uninformative, we may need to recompile Ray with more debug information. E.g., maybe add a -g to the line

    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")

@arvindc95
Copy link

@robertnishihara I tried the steps you outlined for using gdb, but when I tried to run my workload I kept getting an exception when defining a function with the @ray.remote decorator; I've attached the error thrown:
decorator_error.txt

Also, when making the fix you referenced, I made the code change in the arrow code and then reran python setup.py install in order to rebuild Ray. Let me know if this procedure is incomplete for rebuilding Ray (I also ran this after changing local_scheduler_services.py because the IPython shell wasn't showing up)

Thanks again for your help!

@robertnishihara
Copy link
Collaborator

@arvindc95 interesting, that seems like the same error as #394.

You could try using IPython instead of Python, since #394 was only an issue in the regular Python interpreter.

It's also possible that when you reran python setup.py install, it undid your changes to Arrow. Can you check that your changes were unaffected? Or perhaps comment out this line

git checkout 988338c544580ffd367a5540f1061dd7b0fccc0e

Also, instead of using python setup.py install, I'd suggest using python setup.py develop because that way whenever you change the Python code, you won't need to rerun setup.py, the changes will automatically be used.

@arvindc95
Copy link

@robertnishihara Using IPython helped; my workload runs successfully, but the debugger throws the following error immediately after the workload completes: Program received signal SIGSEGV, Segmentation fault. __strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50 50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory. Would this segfault be from local_scheduler_services.py or any of the functions it calls?

@robertnishihara
Copy link
Collaborator

If you do bt in gdb, does that print anything?

This error looks similar https://groups.google.com/forum/#!topic/jansson-users/u78eGC15itw.

cc @atumanov

@arvindc95
Copy link

arvindc95 commented Oct 12, 2017

Yes, here's the output:

Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory.
(gdb) bt
#0 __strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
#1 0x080965ce in redisvFormatCommand (target=0xbfffe8b8,
format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe928 "") at hiredis.c:262
#2 0x0809b91c in redisvAsyncCommand (ac=0x80c83c0, fn=0x0, privdata=0x0,
format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe920 "\340\316\f\b\036")
at async.c:654
#3 0x0809b99c in redisAsyncCommand (ac=0x80c83c0, fn=0x0, privdata=0x0,
format=0x809fd45 "ZADD %b %s %b") at async.c:669
#4 0x0806c621 in RayLogger_log_event (db=0x80c7de0,
key=0x80ccee0 "event_log:\213\313\363\265O\312\300Է#\206Ɍ\234\274\301\332T\222\004", key_length=30,
value=0x80cc8e8 "[[1507822835.414047, "ray:get_task", 1, {}], [1507822914.265168, "ray:import_function_to_run", 1, {}], [1507822914.265763, "ray:import_function_to_run", 2, {}], [1507822914.266116, "ray:import_functio"...,
value_length=1520, timestamp=1507822920.4063809)
at /home/achand/ray/src/common/logging.cc:100
#5 0x08056c35 in process_message(aeEventLoop*, int, void*, int) ()
#6 0x0807bcbd in aeProcessEvents (eventLoop=0x80bea38, flags=3)
at /home/achand/ray/src/common/thirdparty/ae/ae.c:412
#7 0x0807c19b in aeMain (eventLoop=0x80bea38)
at /home/achand/ray/src/common/thirdparty/ae/ae.c:455
#8 0x0805f8f8 in event_loop_run (loop=0x80bea38)
at /home/achand/ray/src/common/event_loop.cc:58

@pcmoritz
Copy link
Contributor

It looks like it is using SSE2 instructions which probably aren't available on ARM. Could it be that there is some issue with the (cross-)compilation?

@arvindc95
Copy link

@pcmoritz I checked the instruction sets supported in the VM guest and SSE2 is one of them (it's supported in the VM host as well)
capture

@pcmoritz
Copy link
Contributor

@arvindc95 I created a PR here: #1122 Could you try both the commits in the PR and see if one of them makes it work? These are both fixing potential problems here. Thanks!

@robertnishihara
Copy link
Collaborator

In particular, we'd be interested in knowing which of the two commits fixes it (assuming one of them does in fact fix it).

@arvindc95
Copy link

seg_fault_fix_logging.txt
seg_fault_add_casts.txt

Both failed the same way as before; the segfault happened after the results of foo.remote() were returned. I made the the logging code change, ran python setup.py develop, then tried running a workload, and repeated the process for the static cast addition as well. The gdb logs show the updated logging code change, and the lines referenced are slightly different between the two logs, so I think the changes were compiled; let me know if I missed anything.

@arvindc95
Copy link

arvindc95 commented Oct 13, 2017

Also, I've been manually starting ray because when I don't, the plasma store never initializes. I changed the socket name from /tmp/s1 to /tmp/s2 in case the same socket was being reused every time I manually started the store, but the store was still being initialized, so I'm not sure why it doesn't get made when I don't manually start the store.

@pcmoritz
Copy link
Contributor

pcmoritz commented Oct 13, 2017

Hm, thanks for trying it out. Is there any chance you can share your VirtualBox image together with instructions to reproduce the problem with us or an EC2 AMI if you have one so we can dig deeper into this?

@atumanov
Copy link
Contributor

I was able to reproduce on 32bit Ubuntu 16.04 and fix. I put together a quick PR that fixes it for me. Could you please try out #1126. Thanks.

@arvindc95
Copy link

@atumanov The changes from your PR worked, thanks so much!
Also, thanks to @pcmoritz and @robertnishihara thank you for your help resolving this as well! Would you still like me post my VirtualBox image?

@atumanov
Copy link
Contributor

@arvindc95 , awesome, glad to hear! The virtualbox image will be helpful for testing, in case we need to reproduce any other problems you encounter. If you are in a position to provide us with the ODROID platform for testing purposes as well, even better :)

@robertnishihara
Copy link
Collaborator

Closing for now since a lot of things have changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants