-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example for DecomposingNormalizer source cursor #4900
Conversation
/// }; | ||
/// | ||
/// assert_eq!(get_next(), ('S', 'Š', 0)); | ||
/// assert_eq!(get_next(), ('\u{30C}', 'Š', 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is the reason why the offset only jumps at most by 2 in this example because all of the characters are in a precomposed form in the original input string in the range U+0080 <= ch < U+0800 ? If so, then optional: it might be interesting to append to the input string something in the upper half of the BMP, and maybe something beyond the BMP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I can add non-BMP code points to this example.
I was also hoping maybe you could shed some light on this behavior. Is it always guaranteed that the iterator peeks one code point ahead, as stated in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes lookahead of at most one, but the normalizer can do unbounded lookahead, since the number of reorderable combining characters is potentially unbounded (since nothing guarantees that the input has the "stream-safe" property from UAX 15).
@@ -1864,6 +1864,92 @@ impl DecomposingNormalizer { | |||
|
|||
/// Wraps a delegate iterator into a decomposing iterator | |||
/// adapter by using the data already held by this normalizer. | |||
/// | |||
/// The [`Decomposition`] iterator will peek exactly one character | |||
/// ahead of the character being decomposed, allowing the caller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the character being decomposed is followed by characters whose canonical combining class is not zero, the normalizer will buffer up all of those in order to be able to reorder them in case they aren't already in the right order.
It would be good to have a description of the use case to see if a) the use case can be addressed at all and b) how to best address it. To the extent the purpose is to correlate pieces of input |
Thanks; I thought the invariant upon which my code was based maybe wasn't right, but I couldn't identify or articulate how. Reordering characters makes total sense. The use case is being able to map characters between input and output string with a machine learning use case. My understanding is that it is desirable to identify ranges of source text that were used to make inferences from the model. CC @j-luo93 who can maybe share more. |
Sorry for the long-delayed reply. I don't think there's anything from my use case that would guarantee a lookahead <=1. Just to provide a bit more context: I was looking at this unicode-normalization-alignments crate that is part of the dependencies for tokenizers, which is itself a dependency of the popular In light of this, do you think if it's even possible to achieve a similar goal without modifying the |
I created #5577 for further discussion. I will close this PR since it doesn't work. |
normalize_iter
gives us the ability to keep track of the source string indices while generating the output string. I wrote the example using aRefCell
since theDecomposingNormalizer
takes ownership over the source iterator. There may be a way to avoidRefCell
by adding a function to the library.