Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render non-BMP CJKV characters locally #4550

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

1ec5
Copy link
Contributor

@1ec5 1ec5 commented Aug 13, 2024

Generalized glyph management and shaping code to operate on UTF-16 characters instead of code units, no longer splitting surrogate pairs. Enabled local glyph rendering for additional ideographic code blocks in the Supplementary Ideographic Plane (SIP) and Tertiary Ideographic Plane (TIP) for improved Chinese, Japanese, Korean, and Vietnamese (CJKV) character coverage.

Enabled local glyph rendering in some additional CJKV code blocks in the SIP and TIP, as well as in Tangut, which is thrash-prone like CJKV. Disabled letter spacing for more cursive scripts.

This is a step towards #2307, which tracks support for characters outside the Basic Multilingual Plane more generally, not just for CJKV. This PR depends on #4560.

Demonstration

Before After
那市镇 (Na Hang)
那🌫️市镇
那𧯄市镇 (Na Hang)
那𧯄市镇
丐市镇
丐🌫️市镇
丐𦨭市镇
丐𦨭市镇
拜社
拜🌫️社
拜𩡋社
拜𩡋社

Launch Checklist

  • Confirm your changes do not include backports from Mapbox projects (unless with compliant license) - if you are not sure about this, please ask!
  • Briefly describe the changes in this PR.
  • Link to related issues.
  • Include before/after visuals or gifs if this PR includes visual changes.
  • Write tests for all new functionality.
  • Document any changes to public APIs.
  • Post benchmark scores.
  • Add an entry to CHANGELOG.md under the ## main section.

@codecov-commenter
Copy link

codecov-commenter commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 56.81818% with 76 lines in your changes missing coverage. Please review.

Project coverage is 88.00%. Comparing base (42d6847) to head (acdb77d).
Report is 11 commits behind head on main.

Files Patch % Lines
src/util/script_detection.ts 4.44% 2 Missing and 41 partials ⚠️
src/util/is_char_in_unicode_block.ts 70.00% 0 Missing and 15 partials ⚠️
src/symbol/shaping.ts 75.86% 12 Missing and 2 partials ⚠️
src/data/bucket/symbol_bucket.ts 50.00% 2 Missing ⚠️
src/util/verticalize_punctuation.ts 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4550      +/-   ##
==========================================
+ Coverage   87.74%   88.00%   +0.26%     
==========================================
  Files         247      247              
  Lines       33494    33600     +106     
  Branches     2414     2430      +16     
==========================================
+ Hits        29390    29571     +181     
+ Misses       3118     3026      -92     
- Partials      986     1003      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/render/glyph_manager.ts Outdated Show resolved Hide resolved
@@ -21,16 +21,18 @@
"properties": {
"name_en": "abcde",
"name_ja": "あいうえお",
"name_zh": "阿衣乌唉哦",
"name_ko": "아이우"
"name_zh": "阿衣乌唉哦𪚥",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only render test that cover these changes? Do we need other tests?

}

prevChar = {value: char.value, premature: false};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the permutate flag here.
This seems like a change of logic from the original method, so I would consider adding some unit tests here to make sure we get the right results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prevChar is analogous to the old prevCharCode, but it parallels what the iterators return. This is using premature to flag a not-yet-set state, similar to done in the iterators’ return values. I agree that some more vertical layout tests are needed to exercise these changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unit tests can cover the logic in this method, and a render test or two can help here as well to keep parity with native and make sure this is working as expected.

@HarelM
Copy link
Collaborator

HarelM commented Aug 13, 2024

You can download the html report from the CI and commit the image that you download from the CI so that the CI will pass, assuming the image is correct.

@HarelM
Copy link
Collaborator

HarelM commented Aug 13, 2024

CC: @louwers there are expected changes in parity here maybe, just FYI.

src/render/glyph_manager.ts Outdated Show resolved Hide resolved
@@ -75,7 +75,7 @@ export type TextJustify = 'left' | 'center' | 'right';
const PUAbegin = 0xE000;
const PUAend = 0xF8FF;

class SectionOptions {
export class SectionOptions {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to export those classes? Can we test the content of this file without exporting those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests need to refer to these classes. Would you suggest a helper function that encapsulates the creation of a TaggedString, but only for testing purposes?

const tagged = new TaggedString();
tagged.text = ' \t \v ';
tagged.sections = [new SectionOptions()];

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as far as I understand (which is not a lot), these classes were used inside this file by other exported classes or methods.
If this is the case, I think it might be better to test the exported classes/methods in the unit test in a flow that reaches these non exported classes.
If this is not possible, I would then prefer to have these classes moved to their own file and have unit test to these new files separately. In general, I like it better when I have a single file for a single class, but that might just be me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we’ll have to split the classes into separate files. The other exported functions in this file do so much that it would be difficult to isolate the most important behavior that these tests are intended to exercise.

@HarelM
Copy link
Collaborator

HarelM commented Aug 13, 2024

BTW, is this a breaking change in terms of behavior of the map/text/fonts?

@wipfli
Copy link
Contributor

wipfli commented Aug 13, 2024

BTW, is this a breaking change in terms of behavior of the map/text/fonts?

No

Comment on lines +343 to +344
'Supplementary Private Use Area-A': (char) => char >= 0xF0000 && char <= 0xFFFFF,
'Supplementary Private Use Area-B': (char) => char >= 0x100000 && char <= 0x10FFFF
Copy link
Contributor Author

@1ec5 1ec5 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wipfli, if I understand this article correctly, your solution for Devanagari text shaping uses Private Use Area codepoints but has to steer clear of image placeholders that are also assigned to this block. With the changes in this PR, you could probably use the much more spacious Supplementary Private Use Areas instead. Most CJKV fonts informally use these blocks for characters that haven’t been officially encoded yet, but I wouldn’t expect vector tiles or GeoJSON to take advantage of these characters due to interoperability issues. You’d still be constrained by SymbolBucket.MAX_GLYPHS, but at least you could pick a different plane without worrying about overlapping with unrelated usage before you reach that limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, your pull request would be very useful for positioned glyph fonts!

@1ec5

This comment was marked as resolved.

@HarelM
Copy link
Collaborator

HarelM commented Aug 15, 2024

I've fixed it in the global branch, it's not that the size had tripled but only that you added 2k, which is fine, just increase the expected size so that the test will pass...

src/render/glyph_manager.ts Outdated Show resolved Hide resolved
@wipfli
Copy link
Contributor

wipfli commented Aug 15, 2024

Just to double-check, the MVT spec does support UTF-16?

@1ec5
Copy link
Contributor Author

1ec5 commented Aug 15, 2024

Just to double-check, the MVT spec does support UTF-16?

Depends what you mean by “support”, but what we’re doing here is compatible with the MVT specification. As of MVT v2.1, a feature property value is a string, which the protocol buffer format defines as UTF-8–encoded text. Both UTF-8 and UTF-16 are variable-width encodings. The Chinese character 𪚥 (U+2A6A5) is encoded in UTF-8 as a sequence of four code units: F0 AA 9A A5. JavaScript decodes this sequence as a surrogate pair of two UTF-16 code units: D869 DEA5. Either way, it’s still a single character. Previously, we were incorrectly splitting the character in half, effectively assuming the obsolete UCS-2 encoding. Now we group the code units together by Unicode character, as fonts and users would expect. None of this is incompatible with UTF-8.

Although this is an improvement, it isn’t quite perfect: some characters and emoji require variation selectors and other control characters, so that what the user perceives as a grapheme consists of several UTF-16 characters. This PR doesn’t address that issue, because GL JS still pervasively assumes that one codepoint equals one glyph. If someday we eliminate that assumption, we can write custom code or use a Unicode set regular expression to match those sequences. (Unicode sets are very new, introduced to Firefox about a year ago.)

entry.ranges[range] = true;
return {stack, id, glyph: null};
} else {
throw new Error('glyphs > 65535 not supported');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this error at @wipfli’s request in #4550 (comment), but unless we solve #4563, this will be a breaking change and a difficult one for developers to cope with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky situation indeed.

Copy link
Contributor Author

@1ec5 1ec5 Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4564 provides a fallback to local glyph rendering, so it won’t be a breaking change after all. The behavior is essentially what I floated in #4550 (comment), except that server-side glyphs take precedence over client-side glyphs for non-CJKV codepoints.

if (isChar['CJK Compatibility'](char)) return true;
if (isChar['CJK Compatibility Forms'](char)) return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks like we are repeating the process of uncommenting a line in unicodeBlockLookup and then adding it as another "if" row here.
Is it possible to change this in one place instead of two?

Copy link
Contributor Author

@1ec5 1ec5 Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we should inline the lookup functions wherever we use them instead of keeping them in a separate lookup table? We might as well eliminate that file in favor of inline regular expressions like this:

if (/* CJK Compatibility Forms */ /[\uFE30-\uFE4F]/.test(String.fromCodePoint(char))) return true;

I think the separate lookup table was created originally out of concern that these character classes would be less readable or more error-prone. I don’t have a strong preference, but then again, I have a higher tolerance for regular expressions than most developers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about regular expression (I don't like them much, they are not readable when things get complicated), but more about using the same hard coded string in two places.
I don't mind keeping the original lookup table, but then why do we pass the same hardcoded string here again? Wht not use something like Object.keys(unicodeBlockLookup)...?

Copy link
Contributor Author

@1ec5 1ec5 Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the functions that use unicodeBlockLookup should ever iterate over all the keys. Each function only selectively accesses the lookup functions relevant to the task at hand. As the table in #4560 (comment) illustrates, it’s a big Venn diagram; every function cares about a slightly different set of Unicode blocks. I tried to unify the functions somewhat, but inherently we need to list specific blocks in each function in order to have the correct behavior.

These strings are the official Unicode block names, not something we invented here, and the ISO standard basically guarantees that they won’t change. If the problem is that these strings add to the bundle size, we could CamelCase them (as in maplibre-native) and use the property syntax so that maybe the property names will get minified at compile time:

if (isChar.CJKCompatibilityForms(char)) return true;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't care about bundle size in this respect.
What I would like to achieve is the need to change in multiple places.
i.e. add all the needed information to the lookup table so that uncommenting a line and setting the right properties would be the only change need to support another script.

Copy link
Contributor Author

@1ec5 1ec5 Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, the only reason we’d ever modify this file is that the Unicode standard itself has changed (to add another script). If this PR works as designed, then we’d have comprehensive support for the latest Unicode standard, and we shouldn’t have to manually add new scripts in general, though that is sort of what’s going on in charInSupportedScript, which I haven’t touched.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above suggestion is better in terms that in encapsulates all the relevant properties to a single place and later on you won't need to worry about them, hopefully, or you could create another method to have a different "query" on this object.
You can also define the type of every element so that it adheres to the relevant model.
I find it more maintainable, but it might just be me, IDK...

Copy link
Contributor Author

@1ec5 1ec5 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t abandoned this PR. 🙂 In 1ec5#1, I’m prototyping, among other things, code generation for anything we’re doing based on the Unicode Character Database that isn’t directly supported by ECMAScript regular expressions. It has the potential to reduce maintenance overhead for script detection, though I suspect we’ll still need to roll a little bit of our own code for some remaining typographical functionality that isn’t expressed in the Unicode standard. Once the branch is in better shape (testing, performance), I’ll pick it apart and post formal PRs to this repository.

Copy link
Contributor Author

@1ec5 1ec5 Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bespoke build script I wrote in 1ec5/maplibre-gl-js@be6d7fe is simple and effective for this purpose, but I wonder if you’d prefer to pull in @unicode/unicode-15.1.0 via regenerate as a devDependency instead. It’s a big package to pull in at install time, but unlike my script, it won’t require fetching anything from the network. Either way, it’ll have negligible impact on the build, just a big regular expression that we fortunately wouldn’t have to maintain by hand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dev dependency is better I think.
Make sure to add the generated file to git ignore and add a .g.ts file extension to the generated file to indicate that it's generated.

@1ec5
Copy link
Contributor Author

1ec5 commented Aug 16, 2024

If this PR lands before maplibre/maplibre-style-spec#778 is fixed, there will be regressions in some styles even for some features that only involve BMP characters.

@HarelM
Copy link
Collaborator

HarelM commented Aug 17, 2024

Style spec has a version of its own, so fixing it there first will be the right approach, after that we can merge this PR only with the upgrade of style-spec to the required version. Bottom line, there's no problem syncing it properly, we just need to make sure it is solved and applied correctly.

1ec5 added 4 commits August 21, 2024 10:47
Generalized glyph management, shaping code to operate on UTF-16 characters instead of code units, no longer splitting surrogate pairs. Enabled local glyph rendering for additional ideographic code blocks in the Supplementary Ideographic Plane and Tertiary Ideographic Plane for improved Chinese, Japanese, Korean, and Vietnamese character coverage.
Enabled local glyph rendering in some additional CJKV code blocks, as well as in Tangut, which is thrash-prone like CJKV. Disabled letter spacing for more cursive scripts.
Removed the error that artificially prevented the glyph manager from requesting glyph PBFs for ranges beyond U+FFFF.
@1ec5 1ec5 force-pushed the astral-cjk branch 2 times, most recently from acdb77d to 40ee424 Compare August 23, 2024 12:20
@HarelM
Copy link
Collaborator

HarelM commented Sep 6, 2024

@1ec5 version 20.3.1 of the style spec was merged to this repo.
Does it require any changes in this repo or this is more of a bug-fix or small improvement?
Is it related to this PR?
I would like to release a new version today-tomorrow but I don't know if I should revert the spec package update or not.
Please let me know.

@1ec5
Copy link
Contributor Author

1ec5 commented Sep 6, 2024

The risk I mentioned before was only if we landed this PR before maplibre/maplibre-style-spec#779, but the other way around should be safe. I don’t think anything will break if you release GL JS with just the upgrade to the style-spec dependency, so there’s no need to revert. This PR needs some additional work on render tests, plus the dependency change in #4550 (comment) that I’ll be staging shortly in a separate PR.

@HarelM
Copy link
Collaborator

HarelM commented Sep 6, 2024

Thanks for the info! I've published a new version - 4.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants