Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Latin char normalization #60

Merged

Conversation

alexrutar
Copy link
Contributor

@alexrutar alexrutar commented Nov 18, 2024

This improves the normalization for Latin characters, mainly to address the concerns in #51 . This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the normalize function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.

If someone knows more about proper transliteration, I would be happy if they could take a peek through the transformations; I only applied the transliteration in cases where I was confident and hopefully did not make any controversial normalizations.

Two questions for discussion:

  1. Is chars::normalize a reasonable name? Maybe it would be more precise to call it chars::normalize_latin. But I guess this is quite an annoying breaking change. But the signature is the same so it would be easy enough to include an alias and mark it is #[deprecated].
  2. Is committing to const fn reasonable?

@alexrutar alexrutar force-pushed the fix-latin-normalization branch 5 times, most recently from ba47630 to 2eb7ec3 Compare November 18, 2024 16:09
@alexrutar
Copy link
Contributor Author

I realize I removed exactly one normalization that was present before: \u{2184} which was part of an entirely different block Number Forms.

Since the previous block ends at \u{209f} this made table 3 in the previous implementation unnecessarily large.

If \u{2184} is badly missed it could be re-added (and I guess along with the rest of the 'Number Forms' block)?

@alexrutar alexrutar force-pushed the fix-latin-normalization branch 2 times, most recently from 61781e8 to 1ca8753 Compare November 19, 2024 07:21
@alexrutar alexrutar force-pushed the fix-latin-normalization branch 2 times, most recently from f48445e to a02b941 Compare December 7, 2024 17:25
@alexrutar
Copy link
Contributor Author

Converted const tables to static; and I guess on 1.65 this means that normalize cannot be a const fn so now it's back to what it was originally.

CI is failing because of the new clippy lint which was fixed in #65...

This improves the normalization for Latin characters by adding a
number of new normalizations, especially in the 'Latin Extended
Additional' block which for some reason was missing every capital
letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
@alexrutar alexrutar force-pushed the fix-latin-normalization branch from a02b941 to 2200109 Compare December 7, 2024 17:35
@pascalkuthe pascalkuthe merged commit 7ef5c43 into helix-editor:master Dec 7, 2024
5 checks passed
@alexrutar alexrutar deleted the fix-latin-normalization branch December 9, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants