Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for DecomposingNormalizer source cursor #4900

Closed
wants to merge 2 commits into from

Conversation

sffc
Copy link
Member

@sffc sffc commented May 14, 2024

normalize_iter gives us the ability to keep track of the source string indices while generating the output string. I wrote the example using a RefCell since the DecomposingNormalizer takes ownership over the source iterator. There may be a way to avoid RefCell by adding a function to the library.

/// };
///
/// assert_eq!(get_next(), ('S', 'Š', 0));
/// assert_eq!(get_next(), ('\u{30C}', 'Š', 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is the reason why the offset only jumps at most by 2 in this example because all of the characters are in a precomposed form in the original input string in the range U+0080 <= ch < U+0800 ? If so, then optional: it might be interesting to append to the input string something in the upper half of the BMP, and maybe something beyond the BMP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I can add non-BMP code points to this example.

I was also hoping maybe you could shed some light on this behavior. Is it always guaranteed that the iterator peeks one code point ahead, as stated in this PR?

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes lookahead of at most one, but the normalizer can do unbounded lookahead, since the number of reorderable combining characters is potentially unbounded (since nothing guarantees that the input has the "stream-safe" property from UAX 15).

@@ -1864,6 +1864,92 @@ impl DecomposingNormalizer {

/// Wraps a delegate iterator into a decomposing iterator
/// adapter by using the data already held by this normalizer.
///
/// The [`Decomposition`] iterator will peek exactly one character
/// ahead of the character being decomposed, allowing the caller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the character being decomposed is followed by characters whose canonical combining class is not zero, the normalizer will buffer up all of those in order to be able to reorder them in case they aren't already in the right order.

@hsivonen
Copy link
Member

It would be good to have a description of the use case to see if a) the use case can be addressed at all and b) how to best address it.

To the extent the purpose is to correlate pieces of input &str with pieces of output &str, it's probably useful to make use of the same implementation detail that IsNormalizedSinkStr makes use of: when the normalizer passes a &str to Write, it for sure is a passthrough that can be correlated back to the input slice by looking at the pointer in the slice. When the normalizer passes a char it may be either a passthrough or a non-passthrough, but every time there is a &str, the &str can be used to resynchronize char passthrough tracking after a non-passthrough char has caused a divergence.

@sffc
Copy link
Member Author

sffc commented May 20, 2024

Thanks; I thought the invariant upon which my code was based maybe wasn't right, but I couldn't identify or articulate how. Reordering characters makes total sense.

The use case is being able to map characters between input and output string with a machine learning use case. My understanding is that it is desirable to identify ranges of source text that were used to make inferences from the model. CC @j-luo93 who can maybe share more.

@robertbastian robertbastian added the waiting-on-author PRs waiting for action from the author for >7 days label Aug 13, 2024
@j-luo93
Copy link

j-luo93 commented Aug 21, 2024

Sorry for the long-delayed reply. I don't think there's anything from my use case that would guarantee a lookahead <=1. Just to provide a bit more context: I was looking at this unicode-normalization-alignments crate that is part of the dependencies for tokenizers, which is itself a dependency of the popular transformers crate from Huggingface. unicode-normalization-alignments is forked from unicode-normalization , with the main change adding alignment information. This change was done in a quite intrusive fashion, but it did so without making further assumptions -- given that Huggingface has to deal with all kinds of texts, I would be surprised if they have a restrictive use case.

In light of this, do you think if it's even possible to achieve a similar goal without modifying the .iter implementation?

@sffc
Copy link
Member Author

sffc commented Sep 23, 2024

I created #5577 for further discussion. I will close this PR since it doesn't work.

@sffc sffc closed this Sep 23, 2024
@sffc sffc deleted the normalizer-cursor branch September 23, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-on-author PRs waiting for action from the author for >7 days
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants