Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This improves the normalization for Latin characters, mainly to address the concerns in #51 . This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter.
I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the
normalize
function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous.Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
If someone knows more about proper transliteration, I would be happy if they could take a peek through the transformations; I only applied the transliteration in cases where I was confident and hopefully did not make any controversial normalizations.
Two questions for discussion:
chars::normalize
a reasonable name? Maybe it would be more precise to call itchars::normalize_latin
. But I guess this is quite an annoying breaking change. But the signature is the same so it would be easy enough to include an alias and mark it is#[deprecated]
.const fn
reasonable?