Improve Latin char normalization #60

alexrutar · 2024-11-18T15:18:33Z

This improves the normalization for Latin characters, mainly to address the concerns in #51 . This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the normalize function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.

If someone knows more about proper transliteration, I would be happy if they could take a peek through the transformations; I only applied the transliteration in cases where I was confident and hopefully did not make any controversial normalizations.

Two questions for discussion:

Is chars::normalize a reasonable name? Maybe it would be more precise to call it chars::normalize_latin. But I guess this is quite an annoying breaking change. But the signature is the same so it would be easy enough to include an alias and mark it is #[deprecated].
Is committing to const fn reasonable?

alexrutar · 2024-11-18T16:20:05Z

I realize I removed exactly one normalization that was present before: \u{2184} which was part of an entirely different block Number Forms.

Since the previous block ends at \u{209f} this made table 3 in the previous implementation unnecessarily large.

If \u{2184} is badly missed it could be re-added (and I guess along with the rest of the 'Number Forms' block)?

matcher/src/chars/normalize.rs

alexrutar · 2024-12-07T17:26:51Z

Converted const tables to static; and I guess on 1.65 this means that normalize cannot be a const fn so now it's back to what it was originally.

CI is failing because of the new clippy lint which was fixed in #65...

This improves the normalization for Latin characters by adding a number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.

alexrutar force-pushed the fix-latin-normalization branch 5 times, most recently from ba47630 to 2eb7ec3 Compare November 18, 2024 16:09

alexrutar force-pushed the fix-latin-normalization branch 2 times, most recently from 61781e8 to 1ca8753 Compare November 19, 2024 07:21

pascalkuthe reviewed Dec 7, 2024

View reviewed changes

matcher/src/chars/normalize.rs Outdated Show resolved Hide resolved

pascalkuthe reviewed Dec 7, 2024

View reviewed changes

matcher/src/chars/normalize.rs Show resolved Hide resolved

alexrutar force-pushed the fix-latin-normalization branch 2 times, most recently from f48445e to a02b941 Compare December 7, 2024 17:25

pascalkuthe approved these changes Dec 7, 2024

View reviewed changes

alexrutar force-pushed the fix-latin-normalization branch from a02b941 to 2200109 Compare December 7, 2024 17:35

pascalkuthe merged commit 7ef5c43 into helix-editor:master Dec 7, 2024
5 checks passed

alexrutar mentioned this pull request Dec 7, 2024

nucleo-matcher: Normalization is incomplete. #51

Closed

alexrutar deleted the fix-latin-normalization branch December 9, 2024 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Latin char normalization #60

Improve Latin char normalization #60

alexrutar commented Nov 18, 2024 •

edited

Loading

alexrutar commented Nov 18, 2024

alexrutar commented Dec 7, 2024

Improve Latin char normalization #60

Improve Latin char normalization #60

Conversation

alexrutar commented Nov 18, 2024 • edited Loading

alexrutar commented Nov 18, 2024

alexrutar commented Dec 7, 2024

alexrutar commented Nov 18, 2024 •

edited

Loading