Code does not correctly handle UTF-16 surrogate pair codepoints #15

MLoughry · 2022-04-14T21:31:31Z

If you pass a UTF-16 surrogate pair in the text, the code incorrectly interprets the text as two distinct codepoints.

Example:

subsetFont(mySfntFontBuffer, '\u{10077}', {
  targetFormat: 'woff2',
});

The problem is the for...of loop here:

subset-font/index.js

Line 72 in fb8e2f0

for (const c of text) {

The fix is fairly simple, if harfbuzz supports such codepoints (I'm verifying with the fix locally)

  for (let i = 0; i < text.length; i++) {
    let codepoint = text.codePointAt(i);
    exports.hb_set_add(inputUnicodes, codepoint);
    if (codepoint > 0xffff) {
      // We're dealing with a UTF-16 surrogate pair
      i++;
    }
  }

The text was updated successfully, but these errors were encountered:

papandreou · 2022-04-15T09:38:13Z

Nice catch. Mind opening a PR?

MLoughry · 2022-04-15T15:52:54Z

Unfortunately, it appears that harfbuzz doesn't support codepoints > 0xfff, or there's some additional API call needed to make it work. The change above alone doesn't include these glyphs. I'm continuing to investigate.

papandreou · 2022-04-15T16:38:58Z

@ebraminio, thoughts? 😬

MLoughry · 2022-04-15T17:06:45Z

I've added two branches to my fork:

https://github.com/MLoughry/subset-font/tree/handle-surrogate-pairs, which has just the fix proposed above
https://github.com/MLoughry/subset-font/tree/test-handle-surrogate-pairs, which has a test script demonstrating the problem and a README breaking down the resulting subset font.

MLoughry · 2022-04-15T17:50:36Z

Ok, so playing around, it seems that passing only the lowest 16 bits (eg., 0x1074d -> 0x074d) seems to work, so I updated the PR with that change. However, I don't know enough about fonts to know whether that's the right fix or a complete hack.

MLoughry · 2022-04-15T19:54:54Z

Further integration testing shows that passing the lowest 16 bits does include the glyph, but it now has the wrong codepoint (eg., 0x074d rather than 0x1074d), so trying to render '\u{1074d}' using the font fails.

ebraminio · 2022-04-15T21:47:41Z

@ebraminio, thoughts? 😬

Interesting findings I see above! Unfortunately I'm myself out of sync with the upstream though, hopefully it won't reach to an issue with upstream as harfbuzzjs builds are themselves aren't able to be updated due to changes (need of libc++ headers) and thus you may now need an emscripten port :/ which exists https://github.com/emscripten-core/emscripten/blob/main/tools/ports/harfbuzz.py but doesn't include subset part and after that an slightly different interface would be needed. Sorry that probably I can't be more helpful here.

papandreou · 2022-04-16T09:22:43Z

@MLoughry, hmm, I'm not convinced that there's actually anything wrong. The current code seems to do the right thing:

> for (const ch of "\u{1074d}") {console.log(ch.codePointAt(0).toString(16));}
1074d
> for (const ch of "\ud801\udf4d") {console.log(ch.codePointAt(0).toString(16));}
1074d

I think the font you're testing with just doesn't include that code point in the first place? I've played around with the FluentSystemIcons-Filled.ttf from your branch on a webpage, with ttx and fontkit. It doesn't seem to contain any characters > 65535:

> require('fontkit').openSync('FluentSystemIcons-Filled.ttf').characterSet.slice(-10)
[
  65526, 65527, 65528,
  65529, 65530, 65531,
  65532, 65533, 65534,
  65535
]

I added a test here that shows that passing a string that uses a surrogate pair representation results in the correct character being included in the subset (when it exists in the original font): a0cca1e

MLoughry · 2022-04-18T18:01:11Z

You're right. I dug into the hex for the base font, and it looks like the icons that the codepoint JSON claims were > 0xFFFF we actually truncated to 16 bits. Digging a bit more, it seems the tool used to generate those icons and codepoint files has a bug where it uses String.fromCharCode(), rather than String.fromCodePoint().

The original bug reported at the top is still applicable, though.

papandreou · 2022-04-18T18:10:30Z

The original bug reported at the top is still applicable, though.

Could you spell out for me exactly what that bug is?

MLoughry · 2022-04-18T18:15:18Z

Hmmm. I could have sworn that the for (const c of text) was not properly iterating over UTF-16 surrogate pairs; but testing it locally shows that I was wrong.

Sorry for the confusion. I closed the PR and will close this issue.

papandreou · 2022-04-18T18:16:03Z

No worries! Thanks for taking the time to engage :)

MLoughry mentioned this issue Apr 15, 2022

Handle strings with UTF-16 surrogate pairs #16

Closed

papandreou added a commit that referenced this issue Apr 16, 2022

Add test for #15

a0cca1e

MLoughry closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code does not correctly handle UTF-16 surrogate pair codepoints #15

Code does not correctly handle UTF-16 surrogate pair codepoints #15

MLoughry commented Apr 14, 2022 •

edited

Loading

papandreou commented Apr 15, 2022

MLoughry commented Apr 15, 2022

papandreou commented Apr 15, 2022

MLoughry commented Apr 15, 2022

MLoughry commented Apr 15, 2022

MLoughry commented Apr 15, 2022

ebraminio commented Apr 15, 2022 •

edited

Loading

papandreou commented Apr 16, 2022 •

edited

Loading

MLoughry commented Apr 18, 2022

papandreou commented Apr 18, 2022

MLoughry commented Apr 18, 2022

papandreou commented Apr 18, 2022

Code does not correctly handle UTF-16 surrogate pair codepoints #15

Code does not correctly handle UTF-16 surrogate pair codepoints #15

Comments

MLoughry commented Apr 14, 2022 • edited Loading

papandreou commented Apr 15, 2022

MLoughry commented Apr 15, 2022

papandreou commented Apr 15, 2022

MLoughry commented Apr 15, 2022

MLoughry commented Apr 15, 2022

MLoughry commented Apr 15, 2022

ebraminio commented Apr 15, 2022 • edited Loading

papandreou commented Apr 16, 2022 • edited Loading

MLoughry commented Apr 18, 2022

papandreou commented Apr 18, 2022

MLoughry commented Apr 18, 2022

papandreou commented Apr 18, 2022

MLoughry commented Apr 14, 2022 •

edited

Loading

ebraminio commented Apr 15, 2022 •

edited

Loading

papandreou commented Apr 16, 2022 •

edited

Loading