-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code does not correctly handle UTF-16 surrogate pair codepoints #15
Comments
Nice catch. Mind opening a PR? |
Unfortunately, it appears that harfbuzz doesn't support codepoints > 0xfff, or there's some additional API call needed to make it work. The change above alone doesn't include these glyphs. I'm continuing to investigate. |
@ebraminio, thoughts? 😬 |
I've added two branches to my fork:
|
Ok, so playing around, it seems that passing only the lowest 16 bits (eg., |
Further integration testing shows that passing the lowest 16 bits does include the glyph, but it now has the wrong codepoint (eg., |
Interesting findings I see above! Unfortunately I'm myself out of sync with the upstream though, hopefully it won't reach to an issue with upstream as harfbuzzjs builds are themselves aren't able to be updated due to changes (need of libc++ headers) and thus you may now need an emscripten port :/ which exists https://github.com/emscripten-core/emscripten/blob/main/tools/ports/harfbuzz.py but doesn't include subset part and after that an slightly different interface would be needed. Sorry that probably I can't be more helpful here. |
@MLoughry, hmm, I'm not convinced that there's actually anything wrong. The current code seems to do the right thing: > for (const ch of "\u{1074d}") {console.log(ch.codePointAt(0).toString(16));}
1074d
> for (const ch of "\ud801\udf4d") {console.log(ch.codePointAt(0).toString(16));}
1074d I think the font you're testing with just doesn't include that code point in the first place? I've played around with the > require('fontkit').openSync('FluentSystemIcons-Filled.ttf').characterSet.slice(-10)
[
65526, 65527, 65528,
65529, 65530, 65531,
65532, 65533, 65534,
65535
] I added a test here that shows that passing a string that uses a surrogate pair representation results in the correct character being included in the subset (when it exists in the original font): a0cca1e |
You're right. I dug into the hex for the base font, and it looks like the icons that the codepoint JSON claims were > 0xFFFF we actually truncated to 16 bits. Digging a bit more, it seems the tool used to generate those icons and codepoint files has a bug where it uses The original bug reported at the top is still applicable, though. |
Could you spell out for me exactly what that bug is? |
Hmmm. I could have sworn that the Sorry for the confusion. I closed the PR and will close this issue. |
No worries! Thanks for taking the time to engage :) |
If you pass a UTF-16 surrogate pair in the text, the code incorrectly interprets the text as two distinct codepoints.
Example:
The problem is the
for...of
loop here:subset-font/index.js
Line 72 in fb8e2f0
The fix is fairly simple, if harfbuzz supports such codepoints (I'm verifying with the fix locally)
The text was updated successfully, but these errors were encountered: