Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulation Runner still STATIC #341

Closed
stonier opened this issue Apr 4, 2018 · 21 comments · Fixed by #437
Closed

Simulation Runner still STATIC #341

stonier opened this issue Apr 4, 2018 · 21 comments · Fixed by #437

Comments

@stonier
Copy link
Collaborator

stonier commented Apr 4, 2018

Drops out at the gil_scoped_acquire. Haven't been able to quickly fathom why. Multiply called?

Easy to test, just drop the STATIC keyword in the add_library(simulation_runner and comment out the PIC code, then run keyboard_controlled_simulation.py. A backtrace.

@apojomovsky are you the right fellow to look at this?

* <p> will pause the simulation if unpaused and viceversa. *
* <s> will step the simulation once if paused.             *
* <q> will stop the simulation and quit the demo.          *
************************************************************

[New Thread 0x7fffccdc8700 (LWP 28880)]

Thread 15 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffccdc8700 (LWP 28880)]
0x00000000004aeac0 in ?? ()
(gdb) bt
#0  0x00000000004aeac0 in ?? ()
#1  0x00000000004c0654 in PyEval_GetBuiltins ()
#2  0x00007ffff644d381 in pybind11::detail::get_internals () at /mnt/mervin/workspaces/delphyne/install/include/pybind11/detail/internals.h:144
#3  0x00007ffff644d8e9 in pybind11::gil_scoped_acquire::gil_scoped_acquire (this=0x7fffccdc7e30)
    at /mnt/mervin/workspaces/delphyne/install/include/pybind11/pybind11.h:2150
#4  0x00007ffff643f951 in delphyne::backend::SimulatorRunner::RunSimulationStep (this=this@entry=0xed61c0)
    at /mnt/mervin/workspaces/delphyne/src/delphyne/backend/simulation_runner.cc:228
#5  0x00007ffff643fb70 in delphyne::backend::SimulatorRunner::RunSimulationLoopFor(double, std::function<void ()>) (this=0xed61c0, 
    duration=<optimised out>, callback=...) at /mnt/mervin/workspaces/delphyne/src/delphyne/backend/simulation_runner.cc:184
#6  0x00007ffff643fcc9 in delphyne::backend::SimulatorRunner::<lambda()>::operator() (__closure=<optimised out>)
    at /mnt/mervin/workspaces/delphyne/src/delphyne/backend/simulation_runner.cc:166
#7  std::_Bind_simple<delphyne::backend::SimulatorRunner::RunAsyncFor(double, std::function<void()>)::<lambda()>()>::_M_invoke<> (this=<optimised out>)
    at /usr/include/c++/5/functional:1531
#8  std::_Bind_simple<delphyne::backend::SimulatorRunner::RunAsyncFor(double, std::function<void()>)::<lambda()>()>::operator() (this=<optimised out>)
    at /usr/include/c++/5/functional:1520
#9  std::thread::_Impl<std::_Bind_simple<delphyne::backend::SimulatorRunner::RunAsyncFor(double, std::function<void()>)::<lambda()>()> >::_M_run(void) (
    this=0xf43260) at /usr/include/c++/5/thread:115
#10 0x00007ffff443bc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ffff7bc16ba in start_thread (arg=0x7fffccdc8700) at pthread_create.c:333
#12 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) quit
@stonier stonier assigned stonier and apojomovsky and unassigned stonier Apr 4, 2018
@stonier stonier added this to the Milestone #2 milestone Apr 4, 2018
@apojomovsky
Copy link

Sorry for the late response @stonier , didn't see my mention at first glance.
Yes, I definitely can take a look at this.

@apojomovsky
Copy link

apojomovsky commented Apr 4, 2018

@stonier , sorry for bothering. I'm having problems to reproduce the issue.
Just to be sure I understand correctly, I assume you are suggesting to drop this and this other lines, but that fails at link time with the following:

CMakeFiles/automotive-demo.dir/automotive_demo.cc.o: In function `delphyne::backend::(anonymous namespace)::main(int, char**) [clone .isra.526]':
/home/alexis/2_delphyne_ws/src/delphyne/backend/automotive_demo.cc:189: undefined reference to `delphyne::backend::SimulatorRunner::SimulatorRunner(std::unique_ptr<delphyne::backend::AutomotiveSimulator<double>, std::default_delete<delphyne::backend::AutomotiveSimulator<double> > >, double)'
/home/alexis/2_delphyne_ws/src/delphyne/backend/automotive_demo.cc:190: undefined reference to `delphyne::backend::SimulatorRunner::Start()'
/home/alexis/2_delphyne_ws/src/delphyne/backend/automotive_demo.cc:193: undefined reference to `delphyne::backend::WaitForShutdown()'
/home/alexis/2_delphyne_ws/src/delphyne/backend/automotive_demo.cc:189: undefined reference to `delphyne::backend::SimulatorRunner::~SimulatorRunner()'
/home/alexis/2_delphyne_ws/src/delphyne/backend/automotive_demo.cc:189: undefined reference to `delphyne::backend::SimulatorRunner::~SimulatorRunner()'
collect2: error: ld returned 1 exit status
backend/CMakeFiles/automotive-demo.dir/build.make:123: recipe for target 'backend/automotive-demo' failed
make[2]: *** [backend/automotive-demo] Error 1
CMakeFiles/Makefile2:340: recipe for target 'backend/CMakeFiles/automotive-demo.dir/all' failed
make[1]: *** [backend/CMakeFiles/automotive-demo.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

May I be missing another step? I'm currently working on top of stonier/mostly_unwind_static.

@stonier
Copy link
Collaborator Author

stonier commented Apr 5, 2018

That looks like it's still hiding visibility (hence the undefined references). Did you try clearing out build/delphyne and rebuilding? I wonder if the flags are cached.

@stonier
Copy link
Collaborator Author

stonier commented Apr 5, 2018

Just removed those two lines from master, cleaned, rebuilt, and no linking problems on my end.

This was referenced Apr 6, 2018
@apojomovsky
Copy link

apojomovsky commented Apr 13, 2018

Hi @EricCousineau-TRI , sorry for bothering.
We are experimenting a very specific issue related to pybind11’s, gil_scoped_acquire. So after banging my head against the wall for a while, it sounded like a good moment to ask for help.
I’ll try to expose the idea as clearly as possible.

Original working state:

  • Compiling simulation_runner library as STATIC, with fPIC activated. (See here for details).

Experimental, non working state:

  • Compiling simulation_runner library as SHARED (fPIC disabled).

Problem:

  • A Segmentation Fault is generated at the moment the py::gil_scoped_acquire acquire; is called (commenting out this line avoids it but sadly that is not an option, as we need to keep the communication between python and C++ thread safe).

Background:

  • The simulation_runner class have a member variable called step_callbacks_ consisting on a vector of std::function<void()>. This vector is loaded from within python with python functions at runtime (example here) (usually lambdas, but the same issue has been proven to occur with regular functions too).
  • A SimulationRunner::RunInteractiveSimulationLoopStep() function is regularly called in the C++ side where, after some simulation-related function calls, the py::gil_scoped_acquire acquire; is called, followed by each python callback. Both run together on a clean scope. No explicit call to the py::gil_scoped_release release is made. (see here)

Any suggestions or comments that may shed some light on this issue would be highly appreciated (:

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Apr 17, 2018

Sorry for the late response! I had briefly tried it out on delphyne on my machine on Friday, and could reproduce the same behavior as you: using Daniel's Makefile, running ./src/delphyne/test/run_tests.sh would cause a segfault to arise in simulation_runner_py_test.py.

I unfortunately cannot pinpoint where it's going awry, or why it's due to shared linking rather than static linking.

One thing I did try was running the STATIC build under a debug build of Python 2.7, which indicates the following error after running the test "successfully":

...
Ran 1 test in 0.154s

OK
[88491 refs]
Fatal Python error: PyThreadState_Delete: NULL interp

This indicates that there may be a problem with how it is at present with STATIC linking that's making itself more apparent when building + running with SHARED linking.

My suggestion is to completely rebuild everything (make clean on the workspace - drake included) using a debug+valgrind build of Python2.7. I've used it to debug some NumPy issues, and it seems to work well:
cpython_dbg_valgrind.sh

Then use this to try and eliminate the NULL interp error in the STATIC build before moving to SHARED.

(Note: You may need to re-run pip install for your deps when you call fhs-extend or whatever.)

@apojomovsky
Copy link

Not a problem at all. Thank you for taking the time to give it a try and for all these pointers!
I'll be giving them a try and let you know if we get it solved (:

@apojomovsky
Copy link

apojomovsky commented Apr 24, 2018

Hi @EricCousineau-TRI , how are you doing?
Sorry for bothering on this issue again.
I’ve built a clean docker environment to reproduce the debugging environment you suggested here. With this I’ve been able to reproduce the steps from your bash script -after some issues with initial dependencies and stuff- but in the end, I managed to get all the tools built and installed in the desired paths.
Now, after having run the functions to pre-append the paths to the debug versions of the tools, I built the whole dephyne environment from scratch, including drake and all the ignition dependencies.
But now I’m having issues to build delphyne, specifically when linking to pybind11-dependent libraries.
Do you have any hint of what could be the issue here?
These are the current env variables that were set by the script, and that were used for each step of the build:

$PYTHONPATH /home/delphyne/.local/python/2.7.12-dbg/lib:/home/delphyne/.local/python/2.7.12-dbg/lib/python2.7/dist-packages:/home/delphyne/.local/python/2.7.12-dbg/lib/python2.7/site-packages:
$PATH /home/delphyne/.local/python/2.7.12-dbg/bin:/home/delphyne/bin:/home/delphyne/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
$LD_LIBRARY_PATH /home/delphyne/.local/python/2.7.12-dbg/lib:
$PKG_CONFIG_PATH /home/delphyne/.local/python/2.7.12-dbg/lib/pkgconfig:/home/delphyne/.local/python/2.7.12-dbg/share/pkgconfig:

This is the error I'm getting at buildtime:

[ 20%] Built target time_conversion
[ 28%] Built target drake_ign_translation_systems
[ 31%] Built target automotive_simulator
[ 36%] Built target simulation_runner
[ 37%] Linking CXX executable automotive-demo
libdelphyne-simulation-runner.a(simulation_runner.cc.o): In function `pybind11::handle::dec_ref() const &':
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:166: undefined reference to `_Py_RefTotal'
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:166: undefined reference to `_Py_Dealloc'
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:166: undefined reference to `_Py_NegativeRefcount'
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:166: undefined reference to `_Py_RefTotal'
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:166: undefined reference to `_Py_Dealloc'
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:166: undefined reference to `_Py_NegativeRefcount'
libdelphyne-simulation-runner.a(simulation_runner.cc.o): In function `pybind11::handle::inc_ref() const &':
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:159: undefined reference to `_Py_RefTotal'
libdelphyne-simulation-runner.a(simulation_runner.cc.o): In function `pybind11::str::operator std::__cxx11::basic_strin
g<char, std::char_traits<char>, std::allocator<char> >[abi:cxx11]() const':
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:869: undefined reference to `PyUnicodeUCS2_AsUTF8String'
libdelphyne-simulation-runner.a(simulation_runner.cc.o): In function `pybind11::handle::dec_ref() const &':
/home/delphyne/delphyne_ws/install/include/pybind11/pytypes.h:166: undefined reference to `_Py_RefTotal'
(...)

@EricCousineau-TRI
Copy link
Contributor

Doing well!
When you configured the CMake program, did you specify -DPYTHON_EXECUTABLE=$(which python)?
For some reason, FindPythonLib.cmake does not always seem to query ${PATH}, but rather just uses /usr/bin/python.

@apojomovsky
Copy link

Yay! That did the trick! Thanks a lot @EricCousineau-TRI !!
Now digging into the PyThreadState_Delete: NULL interp thing.

@apojomovsky
Copy link

apojomovsky commented Apr 25, 2018

Hi @EricCousineau-TRI ,
Is there any way the issue described above could be related to this https://bitbucket.org/cffi/cffi/issues/362/crash-on-thread-destruction-in ? I mean, the description of the situation looks quite similar, so do the symptoms. According to the author: "That's yet another issue with the threading API in CPython" so, do you think there's any chance this could be an issue on CPython's side?

Independently from that, I've tried the following without success (all the tests are mutually independant).

  • Removing the Py_Initialize and PyEval_InitThreads (since they seemed to be inherited from the boost::python version).
  • Calling an explicit gil_scoped_release in the object's destructor
  • Calling a Py_Finalize() in the object's destructor

I've found that this error only shows up in case we have at least one python-registered callback.
Please let me know if any of this rings you any bell. Thanks in advance!

@EricCousineau-TRI
Copy link
Contributor

Unfortunately, I'm not sure exactly what's going on... When I briefly tried debugging, I tried a subset of what you tried, but couldn't trace down the exact failure.

Perhaps the most definitive way (but the most work) would be to try and reproduce this error in a non-Delphyne project, with as few moving pieces as possible; using the current fork of pybind11, and see if you can reduce this to as minimal of an example as possible (and perhaps have compile-time flags, such as enabling / disabling threading, when to release GIL, etc.).
I had used this to develop some of the pybind11 features, since they were relatively generic.

Even if you're not able to find the issue, it'd probably help in filing an upstream issue, either with CPython or with pybind11 (if it's not a bug I had introduced in our fork...).

That's unfortunately the most helpful thing I can think of at the moment...

@apojomovsky
Copy link

apojomovsky commented Apr 25, 2018

Will try that, for sure! Thanks for the suggestions!

@stonier
Copy link
Collaborator Author

stonier commented Jun 1, 2018

@apojomovsky @basicNew

I just tried to reproduce this error this morning, and oddly enough, it has gone away. Can you confirm?

@apojomovsky
Copy link

Wow, that's great! but weird at the same time, since I tried it just a couple of days ago and the python tests segfaulted.
Let me try it with a completely fresh environment.
Will post my findings here after that.

@stonier
Copy link
Collaborator Author

stonier commented Jun 1, 2018

For reference (from @basicNew), c7c1e901 shows how it was done with the non-pybind11 api in case we want to drop the pybind11 dependency in the backend or work around a potential problem in pybind11 there.

@apojomovsky
Copy link

I've just checked and in my case, both the python tests and the demos keep failing after dropping the STATIC keyword in the simulation runner and commenting out the PIC code in the CMakeLists file.

@stonier
Copy link
Collaborator Author

stonier commented Jun 1, 2018

Maybe I'm subconsciously doing something to create a mystery - been watching too much of the BBC Sherlock series in the last two weeks. I'll update here if I can work out what.

@apojomovsky
Copy link

haha love that show! :)
I'm currently attempting to isolate the error discovered by Eric when attempting the run delphyne as-is but with the gdb-enabled version of cpython, just as he suggested in his last comment. Will post as soon as I have any update.

@stonier
Copy link
Collaborator Author

stonier commented Jun 1, 2018

Ok, deleted the build/install folders, cleaned out, rebuilt all. Failing on python tests, keyboard_controlled_simulation.py, ... but, working for all_cars_in_dragway.py and maliput_racing_circuit.py.

@apojomovsky
Copy link

apojomovsky commented Jun 1, 2018

Yes, it only fails when there is a callback registered from python. Both all_cars_in_dragway.py and maliput_racing_circuit.py don't have any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants