Count surrogate pair as single character #779

1ec5 · 2024-08-17T19:31:24Z

The string expression operators index-of, length, and slice now count UTF-16 surrogate pairs as single characters instead of splitting them up into individual surrogates. Also added unit tests of these expression operators.

Fixes #778.

Launch Checklist

Confirm your changes do not include backports from Mapbox projects (unless with compliant license) - if you are not sure about this, please ask!
Briefly describe the changes in this PR.
Link to related issues.
Include before/after visuals or gifs if this PR includes visual changes.
Write tests for all new functionality.
Document any changes to public APIs.
Post benchmark scores.
Add an entry to CHANGELOG.md under the ## main section.

String expression operators now count UTF-16 surrogate pairs as single characters instead of splitting them up into individual surrogates.

codecov-commenter · 2024-08-17T19:33:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.70%. Comparing base (765e52c) to head (0a465a8).
Report is 48 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #779      +/-   ##
==========================================
+ Coverage   92.60%   92.70%   +0.09%     
==========================================
  Files         105      105              
  Lines        4638     4646       +8     
  Branches     1306     1312       +6     
==========================================
+ Hits         4295     4307      +12     
+ Misses        343      339       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

HarelM

Kudos on all the tests you wrote!

1ec5 · 2024-08-17T19:44:35Z

src/reference/v8.json

@@ -2826,7 +2826,7 @@
        }
      },
      "index-of": {
-        "doc": "Returns the first position at which an item can be found in an array or a substring can be found in a string, or `-1` if the input cannot be found. Accepts an optional index from where to begin the search.",
+        "doc": "Returns the first position at which an item can be found in an array or a substring can be found in a string, or `-1` if the input cannot be found. Accepts an optional index from where to begin the search. In a string, a UTF-16 surrogate pair counts as a single position.",


I worded this a bit vaguely to leave open the possibility of adding support for grapheme clusters in the future. There are several potential real-world use cases for supporting grapheme clusters on maps, for example:

The Esperanto letter ĝ has no precomposed character, so it must be represented as a base letter and a combining diacritic (U+0067 U+0302).

The Hangul syllable 각 may be decomposed as U+1100 U+1161 U+11A8. The best practice is to normalize it into a single precomposed character, U+AC01, but neither vector-tile-js nor maplibre-gl-js normalizes strings, and I’m unsure if any vector tile generator does either.

Some emoji sequences like 🇺🇳 appear as a single character but are composed of multiple underlying characters (U+1F1FA U+1F1F3).

In TypeScript, we could support grapheme clusters using the Intl.Segmenter API, but Firefox only added support for it a few months ago, and I don’t know if it performs well enough for more common cases. On the native platforms, ICU has a similar API that might end up being the easiest solution for maplibre/maplibre-native#2730. I didn’t investigate it further, because we don’t support rendering grapheme clusters directly yet. However, @wipfli’s work on Indic text may create a need for it in the future.

src/expression/definitions/index_of.ts

HarelM · 2024-08-17T19:53:55Z

CC: @louwers - I don't think this is a dramatic change, more in the realm of a bug fix, but we need to make sure this is OK with native.
I'll make sure this ends up as a render test(s) so that we will be able to make sure parity is achieved.

Count surrogate pair as single character

694d441

String expression operators now count UTF-16 surrogate pairs as single characters instead of splitting them up into individual surrogates.

HarelM approved these changes Aug 17, 2024

View reviewed changes

1ec5 commented Aug 17, 2024

View reviewed changes

HarelM reviewed Aug 17, 2024

View reviewed changes

src/expression/definitions/index_of.ts Outdated Show resolved Hide resolved

Removed extraneous empty string case

0a465a8

HarelM merged commit a59e2b3 into maplibre:main Aug 17, 2024
6 checks passed

1ec5 deleted the expression-string-unicode-778 branch August 17, 2024 19:54

1ec5 mentioned this pull request Sep 6, 2024

Render non-BMP CJKV characters locally maplibre/maplibre-gl-js#4550

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count surrogate pair as single character #779

Count surrogate pair as single character #779

1ec5 commented Aug 17, 2024

codecov-commenter commented Aug 17, 2024 •

edited

Loading

HarelM left a comment

1ec5 Aug 17, 2024 •

edited

Loading

HarelM commented Aug 17, 2024

Count surrogate pair as single character #779

Count surrogate pair as single character #779

Conversation

1ec5 commented Aug 17, 2024

Launch Checklist

codecov-commenter commented Aug 17, 2024 • edited Loading

Codecov Report

HarelM left a comment

Choose a reason for hiding this comment

1ec5 Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

HarelM commented Aug 17, 2024

codecov-commenter commented Aug 17, 2024 •

edited

Loading

1ec5 Aug 17, 2024 •

edited

Loading