-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray issue on Odroid-XU4 board #1008
Comments
I'm a little surprised to see this error
since that value has never changed from It's possible that we're doing the arithmetic incorrectly somewhere in this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L94-L109 and this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L63-L69. E.g., maybe one of the types has the wrong size or something or there is a mismatch between the two blocks. |
Another thing to verify is that you can start the plasma store by hand without any trouble. In your case probably
If that works, then try connecting a plasma manager. E.g., check out the instructions in this comment #108 (comment). |
I have the same issue on a different platform (Ubuntu 16.04 VM running on Windows 7). I followed the instructions for connecting a plasma manager, and was able to start a plasma store, but when I tried to start a plasma manager, I received a |
@arvindc95 @akzare could you try cherry-picking this commit apache/arrow#1172, recompiling Arrow, and see if it fixes the problem? I just looked through the code in that file and spotted that potential bug. Let me know if you have questions about how to do this. If that doesn't work, then I think we'll just need to add a lot of print statements (e.g., in this function https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L90) and print the actual bytes that are being sent and see if we can infer anything from that. |
@robertnishihara thanks for the help, that commit helped me get ray initialized; I'm able to put and get objects from the plasma store, and use the remote function when there's nothing to be parallelized, but when I try running the time.sleep example in the documentation (http://ray.readthedocs.io/en/latest/tutorial.html#remote-functions), I get a segmentation fault thrown from the local scheduler. Do you have any ideas how I can debug this? Are there log files generated by the scheduler? |
Glad to hear it, and thanks for trying it out! Sounds like there's a bug in the local scheduler (perhaps similar to the previous bug). You're rebuilding all of Ray, right? Because the local scheduler also communicates with the plasma store, so it probably needs the same fix from apache/arrow#1172. Some processes log to What I would suggest is trying to run the same workload that is causing the crash, but to start the local scheduler in gdb. To do that, you could do something like the following.
|
@robertnishihara I tried the steps you outlined for using gdb, but when I tried to run my workload I kept getting an exception when defining a function with the Also, when making the fix you referenced, I made the code change in the arrow code and then reran Thanks again for your help! |
@arvindc95 interesting, that seems like the same error as #394. You could try using IPython instead of Python, since #394 was only an issue in the regular Python interpreter. It's also possible that when you reran ray/src/thirdparty/download_thirdparty.sh Line 16 in b1660c4
Also, instead of using |
@robertnishihara Using IPython helped; my workload runs successfully, but the debugger throws the following error immediately after the workload completes: |
If you do This error looks similar https://groups.google.com/forum/#!topic/jansson-users/u78eGC15itw. cc @atumanov |
Yes, here's the output: Program received signal SIGSEGV, Segmentation fault. |
It looks like it is using SSE2 instructions which probably aren't available on ARM. Could it be that there is some issue with the (cross-)compilation? |
@pcmoritz I checked the instruction sets supported in the VM guest and SSE2 is one of them (it's supported in the VM host as well) |
@arvindc95 I created a PR here: #1122 Could you try both the commits in the PR and see if one of them makes it work? These are both fixing potential problems here. Thanks! |
In particular, we'd be interested in knowing which of the two commits fixes it (assuming one of them does in fact fix it). |
seg_fault_fix_logging.txt Both failed the same way as before; the segfault happened after the results of foo.remote() were returned. I made the the logging code change, ran |
Also, I've been manually starting ray because when I don't, the plasma store never initializes. I changed the socket name from |
Hm, thanks for trying it out. Is there any chance you can share your VirtualBox image together with instructions to reproduce the problem with us or an EC2 AMI if you have one so we can dig deeper into this? |
I was able to reproduce on 32bit Ubuntu 16.04 and fix. I put together a quick PR that fixes it for me. Could you please try out #1126. Thanks. |
@atumanov The changes from your PR worked, thanks so much! |
@arvindc95 , awesome, glad to hear! The virtualbox image will be helpful for testing, in case we need to reproduce any other problems you encounter. If you are in a position to provide us with the ODROID platform for testing purposes as well, even better :) |
Closing for now since a lot of things have changed. |
I've built Ray on Odroid-XU4 board (http://www.hardkernel.com/main/products/prdt_info.php?g_code=G143452239825). As I try to run a simple application on it, the following issues is reported by Ray:
Attached Ray_Issue_XU4.log represents the Ray log.
Ray_Issue_XU4.log
The text was updated successfully, but these errors were encountered: