Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unicode issues on Windows. #156

Merged
merged 1 commit into from
Aug 14, 2021
Merged

Conversation

jpd236
Copy link
Contributor

@jpd236 jpd236 commented Jul 4, 2021

Windows uses 2-byte wchars and UTF-16, so emojis and other characters
which require more than 2 bytes were being encoded/decoded incorrectly
by trying to fit them into one wchar. Migrate to pugixml's UTF8<->wchar
conversion methods which work on all platforms.

Fixes #137

Windows uses 2-byte wchars and UTF-16, so emojis and other characters
which require more than 2 bytes were being encoded/decoded incorrectly
by trying to fit them into one wchar. Migrate to pugixml's UTF8<->wchar
conversion methods which work on all platforms.
@mrichards42
Copy link
Owner

Looks great! Two things:

  1. Looks like utf8_to_unicode and unicode_to_utf8 were only used in this file, and are now unused
  2. utf8_to_unicode throws InvalidEncoding when it encounters an invalid byte sequence, and I'm not sure what pugixml's decoder does (I assume it ignores it?), which is probably fine, but if there are any obvious potential issues that you can think of, just wanted to flag that. unicode_to_utf8 also throws InvalidEncoding, but only for values outside the unicode range, so that one is probably safe to drop.

@jpd236
Copy link
Contributor Author

jpd236 commented Jul 8, 2021

Looks great! Two things:

  1. Looks like utf8_to_unicode and unicode_to_utf8 were only used in this file, and are now unused

Maybe GitHub search was being deceiving, but these are used in a few other places (which I think are safe / not easy to replace):

if (!can_encode_puz(utf8_to_unicode(it, end)))

ret.push_back(unicode_to_puz(utf8_to_unicode(it, end)));

unicode_to_utf8(puz_to_unicode(static_cast<unsigned char>(*it)), ret);

  1. utf8_to_unicode throws InvalidEncoding when it encounters an invalid byte sequence, and I'm not sure what pugixml's decoder does (I assume it ignores it?), which is probably fine, but if there are any obvious potential issues that you can think of, just wanted to flag that. unicode_to_utf8 also throws InvalidEncoding, but only for values outside the unicode range, so that one is probably safe to drop.

https://pugixml.org/docs/manual.html says, "Invalid UTF sequences are silently discarded upon conversion", so your assumption seems correct. That feels better to me in any case than preventing the file from being opened at all, if that's the only problem.

@mrichards42
Copy link
Owner

Maybe GitHub search was being deceiving, but these are used in a few other places (which I think are safe / not easy to replace) [...]

Right, ok can_encode_puz is new and I was looking at a different branch :) Unrelated to this PR, but it looks like that first if (cp > 255) return false condition is incorrect since all of the characters in that windowsTable are > 255.

Theoretically encode_puz works incorrectly with two UTF-16 surrogate pairs since it's going wchar by wchar, but practically speaking either surrogate is out of cp1252 so it'll throw InvalidEncoding. And I suppose decode_puz should be safe as well since I don't believe cp1252 or latin-1 have any characters that would be 4 bytes in UTF-16.

https://pugixml.org/docs/manual.html says, "Invalid UTF sequences are silently discarded upon conversion", so your assumption seems correct. That feels better to me in any case than preventing the file from being opened at all, if that's the only problem.

Works for me!

@mrichards42 mrichards42 merged commit a042c9a into mrichards42:master Aug 14, 2021
jpd236 added a commit to jpd236/xword that referenced this pull request Aug 16, 2021
As pointed out in mrichards42#156, this returns false incorrectly for the Unicode
characters which are in the 128-160 range of the Cp1252 charset. This
means that XWord will unnecessarily use the less-supported 2.0 version
when saving .puz files if one of these characters are used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emoji clues are rendered incorrectly on Windows
2 participants