Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when non-ascii characters are used in symbols or sonames #173

Closed
rhelmot opened this issue Jan 4, 2018 · 2 comments
Closed

Crash when non-ascii characters are used in symbols or sonames #173

rhelmot opened this issue Jan 4, 2018 · 2 comments

Comments

@rhelmot
Copy link
Contributor

rhelmot commented Jan 4, 2018

At the bottom of every read from memory is the conversion from bytes to string with .decode('ascii'). This fails extremely loudly when there's a character >0x7e. Strings like this can occur naturally in e.g. elf files found on android systems. To reproduce just copy libc.so.6 and replace the libc.so.6 soname text with libc\xffso.6 or whatever.

Two possible solutions:

  • return bytes instead of string
  • replace s.decode('ascii') with ''.join(chr(c) for c in s)
@eliben
Copy link
Owner

eliben commented Jan 4, 2018

Using bytes makes sense to me, but there are a couple of gotchas to consider - one is Python 2 vs. 3 compatibility (pyelftools supports both from the same codebase), another is readelf compatibility (how does readelf show these when printed out).

Patches welcome :)

@rhelmot
Copy link
Contributor Author

rhelmot commented Feb 22, 2018

Here's a better, non-artificial testcase: clang will accept valid utf-8 files as input, and will accept unicode characters as part of symbols, encoding the symbol names in the elf as utf-8. Here's the source file, the compiled file, and a pyelftools script that will crash while trying to read the symbols. utf_elf.zip

readelf itself will not crash but it will be extremely unhappy about the situation. The version of it on one machine printed out <CE> (only half-correct, the full utf-8 is CE 94 irc), another version printed , and another seems like it printed a line feed but not a carriage return. That might be due to terminal issues, though.

Probably the best thing to do is to just utf-8 decode, since it won't break anything that wasn't already broken and there's no better standard for how to interpret a stream of bytes without an encoding...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants