Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

free(): invalid pointer error in pydrake after importing pytorch. #12073

Closed
sgillen opened this issue Sep 19, 2019 · 24 comments
Closed

free(): invalid pointer error in pydrake after importing pytorch. #12073

sgillen opened this issue Sep 19, 2019 · 24 comments
Assignees
Labels

Comments

@sgillen
Copy link

sgillen commented Sep 19, 2019

Hello, I'm encountering a bug when trying to use pydrake with pytorch. Here's a minimal example:

import torch
from pydrake.examples.acrobot import (AcrobotPlant)

acrobot = AcrobotPlant() # Error here

This gives the error:

free(): invalid pointer
Aborted (core dumped)

The same code without the import torch runs with no issues.

Furthermore if I import torch after the call to AcrobotPlant() I can use torch and drake with no apparent problems (have not tested that extensively but the code I was working on at the time still worked fine after the late import). The error is not specific to the AcrobotPlant, a RigidBodyPlant gives me the same error.

Also worth noting that in the full version of the code where I discovered this bug, the actual error printed was:

LLVM ERROR: out of memory
Aborted (core dumped)

It's not clear to me yet why the error message changed when I brought this down to a minimal example, but the behavior and my workaround are the same.

drake version: 20190813040020 bc77fda
OS: Ubuntu 18.04
Python: 3.6.9

Any hints or ideas for how to resolve this would be appreciated. Thankyou!

@sgillen sgillen changed the title Memory error in pydrake after importing pytorch. free(): invalid pointer error in pydrake after importing pytorch. Sep 19, 2019
@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Sep 19, 2019

Python: 3.6.9

@gizatt ran into some issues with imports on his system, and he had this version of Python on Anaconda, whereas system Pytohn3 was 3.6.8. Are you using Anaconda?
(In your program, can you do import sys; print(sys.executable) and put that information here?)

If so, would it be possible for you try to pydrake from Ubuntu system Python (/usr/bin/python3), install pytorch from pip using virtualenv, and see if this issue still occurs? Here's an example of using virtualenv with pre-built pydrake:
https://drake.mit.edu/python_bindings.html#inside-virtualenv

Slack discussion: https://drakedevelopers.slack.com/archives/C3YB3EV5W/p1568745853008800

EDIT: If it's not Anaconda, it might be that we're hitting up against #7856, which should be fixable!

@EricCousineau-TRI EricCousineau-TRI self-assigned this Sep 19, 2019
@EricCousineau-TRI EricCousineau-TRI added configuration: python component: distribution Nightly binaries, monthly releases, docker, installation labels Sep 19, 2019
@gizatt
Copy link
Contributor

gizatt commented Sep 19, 2019

BTW I recall experiencing this error across different Python versions (iirc it was in both my 2.7 and 3.6 environments) when using Pytorch + Python from conda -- I found that importing pydrake first, and then torch immediately afterwards, was enough to work around the problem.

@sgillen
Copy link
Author

sgillen commented Sep 19, 2019

Thanks for the help.

You are correct, I was using conda. However I tried the script above with system python and a virtual env with the same results. So the three paths (via print(sys.executable)) were:

/home/sgillen/miniconda3/envs/drake/bin/python
/home/sgillen/work/third_party/drake_venv/bin/python3
/usr/bin/python3

Exactly the same error as before in each case.
Conda env is still 3.6.9, system and virtualenv are 3.6.8.

Per @gizatt's suggestion though, simply importing torch after all the pydrake imports seems to work just fine, and that's an easy enough workaround.

Also I don't think I have access to that slack channel.

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Sep 20, 2019

Hm... In Anzu, we typically import torch first, then import pydrake, and haven't encountered an issue (yet).

On Monday (or mebbe a bit after), I'll see if I can repro this in a Docker container.

Just to check, were you using a binary package?

@sgillen
Copy link
Author

sgillen commented Sep 21, 2019

That's correct, this is with binary release. I included the version info I was using when I saw this behavior in the first post.

I have since tried the latest nightly (VERSION.txt: 20190920080020 52e0c75), and the behavior persists, again across the three python interpreters.

@EricCousineau-TRI
Copy link
Contributor

I'm able to reproduce this, both in Docker (repro commit) and in Anzu as well. Going to dig a bit deeper...

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Sep 23, 2019

Hm... Doing a stack trace from Anzu, it looks like it's dying in drake::NiceTypeName::Canonicalize. From coredumpctl gdb:
https://gist.github.com/EricCousineau-TRI/a42bc9658a363230d02ff3005d221fd7

Perhaps pytorch is compiled with a different libstdc++ / libc++?

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Sep 23, 2019

Maybe the -D_GLIBCXX_USE_CXX11_ABI=0 vs -D_GLIBCXX_USE_CXX11_ABI=1 thing.

pytorch/pytorch#17492

https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html

@EricCousineau-TRI
Copy link
Contributor

Yeah, that smells more like the issue. However, it seems like they offer the downloads just for LibTorch, so it's unclear if they have it in the Python PIP binaries... From here, it looks like it's just C++:
pytorch/pytorch#17492 (comment)

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Sep 24, 2019

Actually, I'm not entirely sure what's going on.
I'm trying to use strace to see when a different library gets loaded, but don't see anything significant... I would expect that pytorchs CXX ABI stuff to matter at compile time if directly using LibTorch; however, we're just using pytorch itself and thus shouldn't care about linker errors due to ABI mismatches, since they should be able to use the two in parallel -- if we're only using one implementation of STL.

I've got my repro branch with the following:

  • a minimal reproduction of the issue; Drake isn't necessary, I just need to construct a std::regex object (and I guess let it get destructed)
  • it's using pybind11 upstream master, not our fork
  • it outputs a filtered version of strace, namely just looking at when libraries are loaded successfully

To try it out, run ./tmp/repro.sh from: branch | commit

It fails when (a) torch is imported first, per @sgillen's repro and (b) std::regex is constructed / destructed in a module that is not directly contained by the Python extension module.

For (a), if I check the diffs between strace_first.raw.txt and strace_last.raw.txt, the only difference is ordering; if I sort the lines and compare, they're identical. The implementation of STL is still just /usr/lib/x86_64-linux-gnu/libstdc++.so.6...

For (b), I can reproduce by swapping out cc_shared_library (custom macro) with native.cc_library: commit for static stuff

@jwnimmer-tri Am I over complicating this? Even though we're not doing any linking with LibTorch, should we expect GCC ABI differences to produce runtime (not linker time) errors?

@sgillen
Copy link
Author

sgillen commented Sep 24, 2019

It does appear that when I compile just your examples with -D_GLIBCXX_USE_CXX11_ABI=1 I see the error when going through the "torch_first" code path, but compiling with D_GLIBCXX_USE_CXX11_ABI=0 There are no problems. That's both with clang 6.0.0 and g++ 7.4.0. That leads me to think this is in fact an ABI incompatibility problem, but I'm as confused as you are (actually, vey likely more so) as to why this is causing problems at run time..

I'm happy to share everything needed to reproduce that, but it might be better for your own sanity to just try and add that flag to your build yourself (if you haven't tried that already). I have no idea how how to use bazel, and found it faster to compile your code by hacking together a makefile and doing some manual copies.

Also not sure if this helps, but in some more or less unrelated work today I noticed I can import an older TensorFlow (1.11.0, which as far as I can tell was compiled with -D_GLIBCXX_USE_CXX11_ABI=0) and then use pydrake as in that original example with no problems.

Edit: Just found this issue in pytorch pytorch/pytorch#19739 and pytorch/pytorch#21018. Possibly this is a problem solely in pytorch. Possibly there are other libraries out there that will cause the same issue though.

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Sep 24, 2019

Thanks for pointing out those issues! It's bittersweet to see that other projects using pybind11 + C++>=11 encountering the same issue, and they have the same workaround (importing pytorch after other stuff).

And huh... I tried out both -D_GLIBCXX_USE_CXX11_ABI=1 (implicitly since it's the default, as well as explicitly) and -D_GLIBCXX_USE_CXX11_ABI=1, and both yield the same failure.
See commits 309fb5f and 434858a (note that the segfault still happens, and is not really affected in the diff aside from spacing)
That seems to match the experience that @ThibaultGROUEIX encountered:
pytorch/pytorch#19739 (comment)

Just for grins, I also tried out using the same version of pybind11 that pytorch used in dfec863. No dice.

Still scratching my head, 'cause I'm confused as to why static vs. shared presents any issue...
I'm going to try to bisect where in pytorch things go south.

UPDATE: The use of RTLD_GLOBAL | RTLD_LAZY by pytorch seems to support the oddity that there's a different behavior between static + shared... Commit: 4935353

@sgillen
Copy link
Author

sgillen commented Sep 24, 2019

I did some more digging, I'm finding when I examine all currently linked objects at runtime (via dl_iterate_phdr() in cc_regex.cc) that the version of libstdc++ changes when torch is imported first. Making me suspect that the RTLD_GLOBAL flag they set might be our issue.

So I'm not 100% how this is all working but my current hypothesis is that our code (drake / regex) is being compiled such that it is using std::string (or whatever symbol that becomes) directly. However when it goes to load that symbol it is doing so from an old standard library (since torch clobbered our symbols?), getting a pre C++11 ABI string instead, which causes the seg fault.. Not sure if that's possible / plausible but I feel like I'm on the right track.

in about half an hour I can share the code/makefiles I have to see if you can reproduce the behavior I'm seeing (including the D_GLIBCXX_USE_CXX11_ABI=0 "fix")

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Sep 24, 2019

Making me suspect that the RTLD_GLOBAL flag they set might be our issue.

Just for grins, tested that out here; it doesn't work in this hacked torch/__init__.py, as well as modifying the original to not use RTLD_GLOBAL.
Repro commit: 9bd22c6
Modified this line in the original: https://github.com/pytorch/pytorch/blob/v1.0.0/torch/__init__.py#L75

But yeah, looking forward to seeing how you're seeing the symbols resolve to different versions, as I'm not familiar with that instrumentation - thanks!

UPDATE: Seems like the use of RTLD_GLOBAL was introduced here:
pytorch/pytorch#45 - https://github.com/pytorch/pytorch/pull/45/files#diff-14258fce7c17ccb97b488e64373b0803

It kinda explains the usage of RTLD_LAZY, but no so much the RTLD_GLOBAL... I played with RTLD_GLOBAL in #6760 (comment), but that was a hack to try and workaround squirrely static + shared globals.

@EricCousineau-TRI
Copy link
Contributor

... accidental close - sorry!

@sgillen
Copy link
Author

sgillen commented Sep 24, 2019

Looks like that may have been a false alarm. I do actually see libstd change depending on import order when using that conda environment, but not with the virtual environment (see conda_first.txt and conda_second.txt in the tmp/ if you're interested). Maybe related to whatever issue is making conda break for you guys? (FYI on linux I was able to use conda with drake with no problems so far.. though not the case on my mac. Though after seeing the warning you guys have on the "Using drake from python" page I stopped using conda ).

Either way using the virtual_env I see the same libraries, but the order is different. Could still be a similar issue, our binary tries to load a symbol, takes the first one it sees which is the wrong version.
You can try out what I was doing here: https://github.com/sgillen/drake/tree/issue-12073.

Still not sure why we are seeing different behavior wrt -D_GLIBCXX_USE_CXX11_ABI=0, because I'm still seeing that. Do you have a list of flags that bazel is sending to the compiler?

@EricCousineau-TRI
Copy link
Contributor

Still not sure why we are seeing different behavior wrt -D_GLIBCXX_USE_CXX11_ABI=0, because I'm still seeing that.

I can try out your branch + Makefile and see if that reproduces it on my machine.

Do you have a list of flags that bazel is sending to the compiler?

If I do bazel build -s <args>, here's what I get (uber verbose):

/usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++0x' -MD -MF bazel-out/k8-py2-fastbuild/bin/tmp/_objs/libcc_regex.so/cc_regex.pic.d '-frandom-seed=bazel-out/k8-py2-fastbuild/bin/tmp/_objs/libcc_regex.so/cc_regex.pic.o' -fPIC -iquote . -iquote bazel-out/k8-py2-fastbuild/bin -iquote external/bazel_tools -iquote bazel-out/k8-py2-fastbuild/bin/external/bazel_tools '-D_GLIBCXX_USE_CXX11_ABI=0' -fdiagnostics-color -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c tmp/cc_regex.cc -o bazel-out/k8-py2-fastbuild/bin/tmp/_objs/libcc_regex.so/cc_regex.pic.o

/usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++0x' -MD -MF bazel-out/k8-py2-fastbuild/bin/tmp/_objs/cc_regex.so/cc_regex_py.pic.d '-frandom-seed=bazel-out/k8-py2-fastbuild/bin/tmp/_objs/cc_regex.so/cc_regex_py.pic.o' -fPIC -DEIGEN_MPL2_ONLY -iquote . -iquote bazel-out/k8-py2-fastbuild/bin -iquote external/pybind11 -iquote bazel-out/k8-py2-fastbuild/bin/external/pybind11 -iquote external/eigen -iquote bazel-out/k8-py2-fastbuild/bin/external/eigen -iquote external/numpy -iquote bazel-out/k8-py2-fastbuild/bin/external/numpy -iquote external/python -iquote bazel-out/k8-py2-fastbuild/bin/external/python -iquote external/bazel_tools -iquote bazel-out/k8-py2-fastbuild/bin/external/bazel_tools -isystem external/pybind11/include -isystem bazel-out/k8-py2-fastbuild/bin/external/pybind11/include -isystem external/eigen -isystem bazel-out/k8-py2-fastbuild/bin/external/eigen -isystem external/numpy/include -isystem bazel-out/k8-py2-fastbuild/bin/external/numpy/include -isystem external/python/include/_home_eacousineau_proj_tri_repo_branches_drake_tmp_drake_tmp_venv_include_python3.6m -isystem bazel-out/k8-py2-fastbuild/bin/external/python/include/_home_eacousineau_proj_tri_repo_branches_drake_tmp_drake_tmp_venv_include_python3.6m '-D_GLIBCXX_USE_CXX11_ABI=0' -fdiagnostics-color -Wno-unused-lambda-capture -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c tmp/cc_regex_py.cc -o bazel-out/k8-py2-fastbuild/bin/tmp/_objs/cc_regex.so/cc_regex_py.pic.o

# Dunno where the linking info is...

As a follow-up, seems like someone has already complained about the use of RTLD_GLOBAL a ways back:
pytorch/pytorch#3059
Will post there to see if there are mitigations or something as well.

@EricCousineau-TRI
Copy link
Contributor

Some suggestions from @sammy-tri:

  • Try out objdump -T -R <lib> | c++filt to see what symbols may have different versions (which could maybe be affected by RTLD_GLOBAL?)
  • Use ltrace for tracing library calls

Another thing to try is recompiling pytorch from source, attempting to reproduce the same linker flags; if it works, then that may point to symbol version differences?

@sgillen I'll take a closer gander at what you had as well and see if that can give us hints as to what the symbol conflict is (if any). Thanks!

@sammy-tri
Copy link
Contributor

I attempted to reproduce using a version of @EricCousineau-TRI 's repro branch above, with RTLD_GLOBAL re-enabled and looking around in ltrace.

Running ltrace -C -l \*cc_regex\* -o tmp/ltrace.out.first.txt python3 bazel-bin/tmp/repro_issue12073 --torch_first produced ltrace.out.first.txt which contains the following suspicious bit:

libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(16, 32, 0x7f1a18053656, 0x7ffc40dd9928) = 48
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(48, 64, 0x7f1a18053656, 0x7ffc40dd9928) = 112
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(112, 256, 0x7f1a18053656, 0x7ffc40dd9928) = 368
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(368, 512, 0x7f1a18053656, 0x7ffc40dd9928) = 880
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(880, 128, 0x7f1a18053656, 0x7ffc40dd9928) = 1008
libcc_regex.so->std::regex_constants::operator&(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(16, 1008, 1008, 0x7ffc40dd9928) = 16
libcc_regex.so->std::__detail::_Scanner<char>::_Scanner(char const*, char const*, std::regex_constants::syntax_option_type, std::locale)(0x7ffc40dd96b8, 0x7f1a1805363a, 0x7f1a18053656, 16 <unfinished ...>
libtorch.so.1->std::__detail::_Scanner<char>::_M_advance()(0x7ffc40dd96b8, 0x7ffc40dd9560, 0x7f1a19cb10c0, 6 <unfinished ...>
libtorch.so.1->std::__detail::_Scanner<char>::_M_scan_normal()(0x7ffc40dd96b8, 0x7ffc40dd9560, 0x7f1a19cb10c0, 6) = 0x7ffc40dd9780
<... std::__detail::_Scanner<char>::_M_advance() resumed> ) = 0x7ffc40dd9780
<... std::__detail::_Scanner<char>::_Scanner(char const*, char const*, std::regex_constants::syntax_option_type, std::locale) resumed> ) = 0x7ffc40dd9780

Partially using libtorch.so.1's version of some of libstdc++'s header-only library code seems alarming.

Running ltrace -C -l \*cc_regex\* -o tmp/ltrace.out.second.txt python3 bazel-bin/tmp/repro_issue12073 produced ltrace.out.second.txt seems more like what I'd expect:

ibcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(16, 32, 0x7fac704d0656, 0x7fff44211c18) = 48
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(48, 64, 0x7fac704d0656, 0x7fff44211c18) = 112
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(112, 256, 0x7fac704d0656, 0x7fff44211c18) = 368
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(368, 512, 0x7fac704d0656, 0x7fff44211c18) = 880
libcc_regex.so->std::regex_constants::operator|(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(880, 128, 0x7fac704d0656, 0x7fff44211c18) = 1008
libcc_regex.so->std::regex_constants::operator&(std::regex_constants::syntax_option_type, std::regex_constants::syntax_option_type)(16, 1008, 1008, 0x7fff44211c18) = 16
libcc_regex.so->std::__detail::_Scanner<char>::_Scanner(char const*, char const*, std::regex_constants::syntax_option_type, std::locale)(0x7fff442119a8, 0x7fac704d063a, 0x7fac704d0656, 16 <unfinished ...
>
libcc_regex.so->std::__detail::_ScannerBase::_ScannerBase(std::regex_constants::syntax_option_type)(0x7fff442119a8, 16, 16, 16 <unfinished ...>
libcc_regex.so->std::__detail::_ScannerBase::_M_is_ecma() const(0x7fff442119a8, 16, 16, 16 <unfinished ...>

ah, good, we're doing work still inside our own compiled version of the regex code. We probably shouldn't mix compilations of regex_scanner.

@EricCousineau-TRI
Copy link
Contributor

\cc @jamiesnape

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Oct 10, 2019

Given that we have a (noisy) workaround and pinged upstream, lowering priority.

@jwnimmer-tri
Copy link
Collaborator

I'm not sure what more we can do in Drake about this? Could we just close the issue now?

@EricCousineau-TRI
Copy link
Contributor

SGTM. Closing for now.

@EricCousineau-TRI
Copy link
Contributor

FTR, seems like there will be an upstream fix!
pytorch/pytorch#28536

Given that we have the more direct check on RTLD_GLOBAL (link) thanks to Jeremy's review, the warning should go away when the patch merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants