Fix unicode issues on Windows. #156

jpd236 · 2021-07-04T23:28:48Z

Windows uses 2-byte wchars and UTF-16, so emojis and other characters
which require more than 2 bytes were being encoded/decoded incorrectly
by trying to fit them into one wchar. Migrate to pugixml's UTF8<->wchar
conversion methods which work on all platforms.

Fixes #137

Windows uses 2-byte wchars and UTF-16, so emojis and other characters which require more than 2 bytes were being encoded/decoded incorrectly by trying to fit them into one wchar. Migrate to pugixml's UTF8<->wchar conversion methods which work on all platforms.

mrichards42 · 2021-07-08T01:04:36Z

Looks great! Two things:

Looks like utf8_to_unicode and unicode_to_utf8 were only used in this file, and are now unused
utf8_to_unicode throws InvalidEncoding when it encounters an invalid byte sequence, and I'm not sure what pugixml's decoder does (I assume it ignores it?), which is probably fine, but if there are any obvious potential issues that you can think of, just wanted to flag that. unicode_to_utf8 also throws InvalidEncoding, but only for values outside the unicode range, so that one is probably safe to drop.

jpd236 · 2021-07-08T02:54:52Z

Looks great! Two things:

Looks like utf8_to_unicode and unicode_to_utf8 were only used in this file, and are now unused

Maybe GitHub search was being deceiving, but these are used in a few other places (which I think are safe / not easy to replace):

xword/puz/puzstring.cpp

Line 219 in bc7f258

if (!can_encode_puz(utf8_to_unicode(it, end)))

xword/puz/puzstring.cpp

Line 247 in bc7f258

ret.push_back(unicode_to_puz(utf8_to_unicode(it, end)));

xword/puz/puzstring.cpp

Line 266 in bc7f258

unicode_to_utf8(puz_to_unicode(static_cast<unsigned char>(*it)), ret);

utf8_to_unicode throws InvalidEncoding when it encounters an invalid byte sequence, and I'm not sure what pugixml's decoder does (I assume it ignores it?), which is probably fine, but if there are any obvious potential issues that you can think of, just wanted to flag that. unicode_to_utf8 also throws InvalidEncoding, but only for values outside the unicode range, so that one is probably safe to drop.

https://pugixml.org/docs/manual.html says, "Invalid UTF sequences are silently discarded upon conversion", so your assumption seems correct. That feels better to me in any case than preventing the file from being opened at all, if that's the only problem.

mrichards42 · 2021-08-14T16:48:28Z

Maybe GitHub search was being deceiving, but these are used in a few other places (which I think are safe / not easy to replace) [...]

Right, ok can_encode_puz is new and I was looking at a different branch :) Unrelated to this PR, but it looks like that first if (cp > 255) return false condition is incorrect since all of the characters in that windowsTable are > 255.

Theoretically encode_puz works incorrectly with two UTF-16 surrogate pairs since it's going wchar by wchar, but practically speaking either surrogate is out of cp1252 so it'll throw InvalidEncoding. And I suppose decode_puz should be safe as well since I don't believe cp1252 or latin-1 have any characters that would be 4 bytes in UTF-16.

https://pugixml.org/docs/manual.html says, "Invalid UTF sequences are silently discarded upon conversion", so your assumption seems correct. That feels better to me in any case than preventing the file from being opened at all, if that's the only problem.

Works for me!

As pointed out in mrichards42#156, this returns false incorrectly for the Unicode characters which are in the 128-160 range of the Cp1252 charset. This means that XWord will unnecessarily use the less-supported 2.0 version when saving .puz files if one of these characters are used.

mrichards42 merged commit a042c9a into mrichards42:master Aug 14, 2021

jpd236 mentioned this pull request Aug 16, 2021

Fix bug in can_encode_puz. #163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unicode issues on Windows. #156

Fix unicode issues on Windows. #156

jpd236 commented Jul 4, 2021

mrichards42 commented Jul 8, 2021

jpd236 commented Jul 8, 2021

mrichards42 commented Aug 14, 2021

Fix unicode issues on Windows. #156

Fix unicode issues on Windows. #156

Conversation

jpd236 commented Jul 4, 2021

mrichards42 commented Jul 8, 2021

jpd236 commented Jul 8, 2021

mrichards42 commented Aug 14, 2021