Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ascii data in string tags causes a UnicodeDecodeError on Python 3 #16

Open
jmuhlich opened this issue Sep 24, 2019 · 4 comments
Open

Comments

@jmuhlich
Copy link
Contributor

Although TIFF string tags are only supposed to contain 7-bit ASCII characters, many tools write values in UTF-8 or other encodings that aren't 7-bit clean. OME-TIFF/BioFormats is one such tool, where the XML stored in the ImageDescription tag is explicitly encoded as UTF-8. On Python 3, pytiff.Tiff._read_ascii raises a UnicodeDecodeError upon reading such values. On Python 2, where the treatment of string encoding/decoding isn't as rigorous, the problem is effectively ignored.

Would you accept a patch to fix this? I think the simplest approach is to always decode strings as UTF-8, perhaps only under Python 3. This actually mirrors the way _set_tag already performs UTF-8 string encoding, only under Python 3. I would also be willing to implement a more flexible approach with a user-controlled encoding if you think that's a better option.

@pglock
Copy link
Contributor

pglock commented Sep 25, 2019

Hi @jmuhlich

feel free to open a PR to fix this. How would you implement the flexible approach? _set_tag should already accept a byte string, so a user could do the encoding and decoding himself.

@jmuhlich
Copy link
Contributor Author

A flexible approach might be to allow a user to set an encoding per Tiff that would be used to decode and encode all ascii tags. I don't think the current API offers a good way to offer per-tag encoding, however. In discussion with my colleague @thejohnhoffer we decided to modify _read_ascii to return bytes instead of str and leave decoding up to the user. This does change the API for Python 3 callers though. Certain str-str operations that used to be OK will now fail or behave differently when ascii tag values become bytes.

@pglock
Copy link
Contributor

pglock commented Jan 9, 2020

How about setting an attribute e.g. encode_strings_as_unicode to be backwards compatible.

@jmuhlich
Copy link
Contributor Author

I think @thejohnhoffer and I have come up with a reasonable backwards-compatible solution in #17 . The only API change is a new optional encoding arg in the constructor. If it's not specified, everything still works in terms of bytes as it did before. If it is specified, string tag values become unicode in read_tags and unicode is properly handled when setting tags. Setting tags with bytes is accepted in either case -- those values go through untouched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants