Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code does not correctly handle UTF-16 surrogate pair codepoints #15

Closed
MLoughry opened this issue Apr 14, 2022 · 12 comments
Closed

Code does not correctly handle UTF-16 surrogate pair codepoints #15

MLoughry opened this issue Apr 14, 2022 · 12 comments

Comments

@MLoughry
Copy link

MLoughry commented Apr 14, 2022

If you pass a UTF-16 surrogate pair in the text, the code incorrectly interprets the text as two distinct codepoints.

Example:

subsetFont(mySfntFontBuffer, '\u{10077}', {
  targetFormat: 'woff2',
});

The problem is the for...of loop here:

for (const c of text) {

The fix is fairly simple, if harfbuzz supports such codepoints (I'm verifying with the fix locally)

  for (let i = 0; i < text.length; i++) {
    let codepoint = text.codePointAt(i);
    exports.hb_set_add(inputUnicodes, codepoint);
    if (codepoint > 0xffff) {
      // We're dealing with a UTF-16 surrogate pair
      i++;
    }
  }
@papandreou
Copy link
Owner

Nice catch. Mind opening a PR?

@MLoughry
Copy link
Author

Unfortunately, it appears that harfbuzz doesn't support codepoints > 0xfff, or there's some additional API call needed to make it work. The change above alone doesn't include these glyphs. I'm continuing to investigate.

@papandreou
Copy link
Owner

@ebraminio, thoughts? 😬

@MLoughry
Copy link
Author

I've added two branches to my fork:

@MLoughry
Copy link
Author

Ok, so playing around, it seems that passing only the lowest 16 bits (eg., 0x1074d -> 0x074d) seems to work, so I updated the PR with that change. However, I don't know enough about fonts to know whether that's the right fix or a complete hack.

@MLoughry
Copy link
Author

Further integration testing shows that passing the lowest 16 bits does include the glyph, but it now has the wrong codepoint (eg., 0x074d rather than 0x1074d), so trying to render '\u{1074d}' using the font fails.

@ebraminio
Copy link
Contributor

ebraminio commented Apr 15, 2022

@ebraminio, thoughts? 😬

Interesting findings I see above! Unfortunately I'm myself out of sync with the upstream though, hopefully it won't reach to an issue with upstream as harfbuzzjs builds are themselves aren't able to be updated due to changes (need of libc++ headers) and thus you may now need an emscripten port :/ which exists https://github.com/emscripten-core/emscripten/blob/main/tools/ports/harfbuzz.py but doesn't include subset part and after that an slightly different interface would be needed. Sorry that probably I can't be more helpful here.

papandreou added a commit that referenced this issue Apr 16, 2022
@papandreou
Copy link
Owner

papandreou commented Apr 16, 2022

@MLoughry, hmm, I'm not convinced that there's actually anything wrong. The current code seems to do the right thing:

> for (const ch of "\u{1074d}") {console.log(ch.codePointAt(0).toString(16));}
1074d
> for (const ch of "\ud801\udf4d") {console.log(ch.codePointAt(0).toString(16));}
1074d

I think the font you're testing with just doesn't include that code point in the first place? I've played around with the FluentSystemIcons-Filled.ttf from your branch on a webpage, with ttx and fontkit. It doesn't seem to contain any characters > 65535:

> require('fontkit').openSync('FluentSystemIcons-Filled.ttf').characterSet.slice(-10)
[
  65526, 65527, 65528,
  65529, 65530, 65531,
  65532, 65533, 65534,
  65535
]

I added a test here that shows that passing a string that uses a surrogate pair representation results in the correct character being included in the subset (when it exists in the original font): a0cca1e

@MLoughry
Copy link
Author

You're right. I dug into the hex for the base font, and it looks like the icons that the codepoint JSON claims were > 0xFFFF we actually truncated to 16 bits. Digging a bit more, it seems the tool used to generate those icons and codepoint files has a bug where it uses String.fromCharCode(), rather than String.fromCodePoint().

The original bug reported at the top is still applicable, though.

@papandreou
Copy link
Owner

The original bug reported at the top is still applicable, though.

Could you spell out for me exactly what that bug is?

@MLoughry
Copy link
Author

Hmmm. I could have sworn that the for (const c of text) was not properly iterating over UTF-16 surrogate pairs; but testing it locally shows that I was wrong.

Sorry for the confusion. I closed the PR and will close this issue.

@papandreou
Copy link
Owner

No worries! Thanks for taking the time to engage :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants