-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
free(): invalid pointer error in pydrake after importing pytorch. #12073
Comments
@gizatt ran into some issues with imports on his system, and he had this version of Python on Anaconda, whereas system Pytohn3 was 3.6.8. Are you using Anaconda? If so, would it be possible for you try to Slack discussion: https://drakedevelopers.slack.com/archives/C3YB3EV5W/p1568745853008800 EDIT: If it's not Anaconda, it might be that we're hitting up against #7856, which should be fixable! |
BTW I recall experiencing this error across different Python versions (iirc it was in both my 2.7 and 3.6 environments) when using Pytorch + Python from conda -- I found that importing pydrake first, and then torch immediately afterwards, was enough to work around the problem. |
Thanks for the help. You are correct, I was using conda. However I tried the script above with system python and a virtual env with the same results. So the three paths (via print(sys.executable)) were:
Exactly the same error as before in each case. Per @gizatt's suggestion though, simply importing torch after all the pydrake imports seems to work just fine, and that's an easy enough workaround. Also I don't think I have access to that slack channel. |
Hm... In Anzu, we typically import On Monday (or mebbe a bit after), I'll see if I can repro this in a Docker container. Just to check, were you using a binary package? |
That's correct, this is with binary release. I included the version info I was using when I saw this behavior in the first post. I have since tried the latest nightly (VERSION.txt: 20190920080020 52e0c75), and the behavior persists, again across the three python interpreters. |
I'm able to reproduce this, both in Docker (repro commit) and in Anzu as well. Going to dig a bit deeper... |
Hm... Doing a stack trace from Anzu, it looks like it's dying in Perhaps |
Maybe the https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html |
Yeah, that smells more like the issue. However, it seems like they offer the downloads just for LibTorch, so it's unclear if they have it in the Python PIP binaries... From here, it looks like it's just C++: |
Actually, I'm not entirely sure what's going on. I've got my repro branch with the following:
To try it out, run It fails when (a) For (a), if I check the diffs between For (b), I can reproduce by swapping out @jwnimmer-tri Am I over complicating this? Even though we're not doing any linking with LibTorch, should we expect GCC ABI differences to produce runtime (not linker time) errors? |
It does appear that when I compile just your examples with -D_GLIBCXX_USE_CXX11_ABI=1 I see the error when going through the "torch_first" code path, but compiling with D_GLIBCXX_USE_CXX11_ABI=0 There are no problems. That's both with clang 6.0.0 and g++ 7.4.0. That leads me to think this is in fact an ABI incompatibility problem, but I'm as confused as you are (actually, vey likely more so) as to why this is causing problems at run time.. I'm happy to share everything needed to reproduce that, but it might be better for your own sanity to just try and add that flag to your build yourself (if you haven't tried that already). I have no idea how how to use bazel, and found it faster to compile your code by hacking together a makefile and doing some manual copies. Also not sure if this helps, but in some more or less unrelated work today I noticed I can import an older TensorFlow (1.11.0, which as far as I can tell was compiled with -D_GLIBCXX_USE_CXX11_ABI=0) and then use pydrake as in that original example with no problems. Edit: Just found this issue in pytorch pytorch/pytorch#19739 and pytorch/pytorch#21018. Possibly this is a problem solely in pytorch. Possibly there are other libraries out there that will cause the same issue though. |
Thanks for pointing out those issues! It's bittersweet to see that other projects using And huh... I tried out both Just for grins, I also tried out using the same version of Still scratching my head, 'cause I'm confused as to why static vs. shared presents any issue... UPDATE: The use of |
I did some more digging, I'm finding when I examine all currently linked objects at runtime (via dl_iterate_phdr() in cc_regex.cc) that the version of libstdc++ changes when torch is imported first. Making me suspect that the RTLD_GLOBAL flag they set might be our issue. So I'm not 100% how this is all working but my current hypothesis is that our code (drake / regex) is being compiled such that it is using std::string (or whatever symbol that becomes) directly. However when it goes to load that symbol it is doing so from an old standard library (since torch clobbered our symbols?), getting a pre C++11 ABI string instead, which causes the seg fault.. Not sure if that's possible / plausible but I feel like I'm on the right track. in about half an hour I can share the code/makefiles I have to see if you can reproduce the behavior I'm seeing (including the D_GLIBCXX_USE_CXX11_ABI=0 "fix") |
Just for grins, tested that out here; it doesn't work in this hacked But yeah, looking forward to seeing how you're seeing the symbols resolve to different versions, as I'm not familiar with that instrumentation - thanks! UPDATE: Seems like the use of It kinda explains the usage of |
... accidental close - sorry! |
Looks like that may have been a false alarm. I do actually see libstd change depending on import order when using that conda environment, but not with the virtual environment (see conda_first.txt and conda_second.txt in the tmp/ if you're interested). Maybe related to whatever issue is making conda break for you guys? (FYI on linux I was able to use conda with drake with no problems so far.. though not the case on my mac. Though after seeing the warning you guys have on the "Using drake from python" page I stopped using conda ). Either way using the virtual_env I see the same libraries, but the order is different. Could still be a similar issue, our binary tries to load a symbol, takes the first one it sees which is the wrong version. Still not sure why we are seeing different behavior wrt -D_GLIBCXX_USE_CXX11_ABI=0, because I'm still seeing that. Do you have a list of flags that bazel is sending to the compiler? |
I can try out your branch + Makefile and see if that reproduces it on my machine.
If I do
As a follow-up, seems like someone has already complained about the use of |
Some suggestions from @sammy-tri:
Another thing to try is recompiling @sgillen I'll take a closer gander at what you had as well and see if that can give us hints as to what the symbol conflict is (if any). Thanks! |
I attempted to reproduce using a version of @EricCousineau-TRI 's repro branch above, with Running
Partially using Running
ah, good, we're doing work still inside our own compiled version of the regex code. We probably shouldn't mix compilations of |
\cc @jamiesnape |
Given that we have a (noisy) workaround and pinged upstream, lowering priority. |
I'm not sure what more we can do in Drake about this? Could we just close the issue now? |
SGTM. Closing for now. |
FTR, seems like there will be an upstream fix! Given that we have the more direct check on |
Hello, I'm encountering a bug when trying to use pydrake with pytorch. Here's a minimal example:
This gives the error:
The same code without the import torch runs with no issues.
Furthermore if I import torch after the call to AcrobotPlant() I can use torch and drake with no apparent problems (have not tested that extensively but the code I was working on at the time still worked fine after the late import). The error is not specific to the AcrobotPlant, a RigidBodyPlant gives me the same error.
Also worth noting that in the full version of the code where I discovered this bug, the actual error printed was:
It's not clear to me yet why the error message changed when I brought this down to a minimal example, but the behavior and my workaround are the same.
drake version: 20190813040020 bc77fda
OS: Ubuntu 18.04
Python: 3.6.9
Any hints or ideas for how to resolve this would be appreciated. Thankyou!
The text was updated successfully, but these errors were encountered: