Non-ascii data in string tags causes a UnicodeDecodeError on Python 3 #16

jmuhlich · 2019-09-24T18:45:15Z

Although TIFF string tags are only supposed to contain 7-bit ASCII characters, many tools write values in UTF-8 or other encodings that aren't 7-bit clean. OME-TIFF/BioFormats is one such tool, where the XML stored in the ImageDescription tag is explicitly encoded as UTF-8. On Python 3, pytiff.Tiff._read_ascii raises a UnicodeDecodeError upon reading such values. On Python 2, where the treatment of string encoding/decoding isn't as rigorous, the problem is effectively ignored.

Would you accept a patch to fix this? I think the simplest approach is to always decode strings as UTF-8, perhaps only under Python 3. This actually mirrors the way _set_tag already performs UTF-8 string encoding, only under Python 3. I would also be willing to implement a more flexible approach with a user-controlled encoding if you think that's a better option.

The text was updated successfully, but these errors were encountered:

pglock · 2019-09-25T08:12:40Z

Hi @jmuhlich

feel free to open a PR to fix this. How would you implement the flexible approach? _set_tag should already accept a byte string, so a user could do the encoding and decoding himself.

jmuhlich · 2019-12-13T20:27:03Z

A flexible approach might be to allow a user to set an encoding per Tiff that would be used to decode and encode all ascii tags. I don't think the current API offers a good way to offer per-tag encoding, however. In discussion with my colleague @thejohnhoffer we decided to modify _read_ascii to return bytes instead of str and leave decoding up to the user. This does change the API for Python 3 callers though. Certain str-str operations that used to be OK will now fail or behave differently when ascii tag values become bytes.

pglock · 2020-01-09T08:35:02Z

How about setting an attribute e.g. encode_strings_as_unicode to be backwards compatible.

jmuhlich · 2020-01-28T20:48:06Z

I think @thejohnhoffer and I have come up with a reasonable backwards-compatible solution in #17 . The only API change is a new optional encoding arg in the constructor. If it's not specified, everything still works in terms of bytes as it did before. If it is specified, string tag values become unicode in read_tags and unicode is properly handled when setting tags. Setting tags with bytes is accepted in either case -- those values go through untouched.

thejohnhoffer mentioned this issue Dec 13, 2019

Encode and decode tags with encoding parameter #17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-ascii data in string tags causes a UnicodeDecodeError on Python 3 #16

Non-ascii data in string tags causes a UnicodeDecodeError on Python 3 #16

jmuhlich commented Sep 24, 2019

pglock commented Sep 25, 2019

jmuhlich commented Dec 13, 2019

pglock commented Jan 9, 2020

jmuhlich commented Jan 28, 2020

Non-ascii data in string tags causes a UnicodeDecodeError on Python 3 #16

Non-ascii data in string tags causes a UnicodeDecodeError on Python 3 #16

Comments

jmuhlich commented Sep 24, 2019

pglock commented Sep 25, 2019

jmuhlich commented Dec 13, 2019

pglock commented Jan 9, 2020

jmuhlich commented Jan 28, 2020