CodePointInversionList JSON serialization cannot represent all code points #3892

skius · 2023-08-18T19:16:16Z

CPIL cannot JSON-serialize a CPIL such as [0-\uDFFF], because the end of the range is not a valid Rust char. We could check while serializing if such code points exist, and if so, fall back to the OldStyle serialization for human-readable. We could also add escaping support to NewStyle.

This is an issue for transform rules such as InterIndic-Arabic that use sets like $nonword = [^\uE000-\uE0FF];

The text was updated successfully, but these errors were encountered:

robertbastian · 2023-08-21T08:28:38Z

This brings up an interesting question. Even though it's called code point inversion list, it uses char on its API, which doesn't represent code points but scalar values, the difference being that it cannot be a surrogate. So do we actually want/need to store surrogate values in CPIL? If not, we can remove them during construction. Otherwise I think escaping surrogates in JSON is best, as we want to remove OldStyle at some point.

skius · 2023-08-21T10:07:18Z

I was thinking about that as well, especially in connection with UTS#35 UnicodeSets (linking #3893). That spec talks about code points, but it might be worth checking upstream if it's actually scalar values. If UnicodeSets do support all code points, I think that's a good case for CPIL surrogate support

skius added T-bug Type: Bad behavior, security, privacy C-unicode Component: Props, sets, tries help wanted Issue needs an assignee labels Aug 18, 2023

robertbastian self-assigned this Aug 18, 2023

robertbastian mentioned this issue Aug 21, 2023

Handling surrogates in CodePointInversionList JSON #3899

Merged

robertbastian closed this as completed in #3899 Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodePointInversionList JSON serialization cannot represent all code points #3892

CodePointInversionList JSON serialization cannot represent all code points #3892

skius commented Aug 18, 2023 •

edited

Loading

robertbastian commented Aug 21, 2023

skius commented Aug 21, 2023 •

edited

Loading

CodePointInversionList JSON serialization cannot represent all code points #3892

CodePointInversionList JSON serialization cannot represent all code points #3892

Comments

skius commented Aug 18, 2023 • edited Loading

robertbastian commented Aug 21, 2023

skius commented Aug 21, 2023 • edited Loading

skius commented Aug 18, 2023 •

edited

Loading

skius commented Aug 21, 2023 •

edited

Loading