Add example for DecomposingNormalizer source cursor #4900

sffc · 2024-05-14T04:23:39Z

normalize_iter gives us the ability to keep track of the source string indices while generating the output string. I wrote the example using a RefCell since the DecomposingNormalizer takes ownership over the source iterator. There may be a way to avoid RefCell by adding a function to the library.

echeran · 2024-05-17T21:36:41Z

components/normalizer/src/lib.rs

+    /// };
+    ///
+    /// assert_eq!(get_next(), ('S', 'Š', 0));
+    /// assert_eq!(get_next(), ('\u{30C}', 'Š', 0));


question: Is the reason why the offset only jumps at most by 2 in this example because all of the characters are in a precomposed form in the original input string in the range U+0080 <= ch < U+0800 ? If so, then optional: it might be interesting to append to the input string something in the upper half of the BMP, and maybe something beyond the BMP.

Interesting, I can add non-BMP code points to this example.

I was also hoping maybe you could shed some light on this behavior. Is it always guaranteed that the iterator peeks one code point ahead, as stated in this PR?

hsivonen

This assumes lookahead of at most one, but the normalizer can do unbounded lookahead, since the number of reorderable combining characters is potentially unbounded (since nothing guarantees that the input has the "stream-safe" property from UAX 15).

hsivonen · 2024-05-20T15:15:08Z

components/normalizer/src/lib.rs

@@ -1864,6 +1864,92 @@ impl DecomposingNormalizer {

    /// Wraps a delegate iterator into a decomposing iterator
    /// adapter by using the data already held by this normalizer.
+    ///
+    /// The [`Decomposition`] iterator will peek exactly one character
+    /// ahead of the character being decomposed, allowing the caller


If the character being decomposed is followed by characters whose canonical combining class is not zero, the normalizer will buffer up all of those in order to be able to reorder them in case they aren't already in the right order.

hsivonen · 2024-05-20T15:27:44Z

It would be good to have a description of the use case to see if a) the use case can be addressed at all and b) how to best address it.

To the extent the purpose is to correlate pieces of input &str with pieces of output &str, it's probably useful to make use of the same implementation detail that IsNormalizedSinkStr makes use of: when the normalizer passes a &str to Write, it for sure is a passthrough that can be correlated back to the input slice by looking at the pointer in the slice. When the normalizer passes a char it may be either a passthrough or a non-passthrough, but every time there is a &str, the &str can be used to resynchronize char passthrough tracking after a non-passthrough char has caused a divergence.

sffc · 2024-05-20T16:17:47Z

Thanks; I thought the invariant upon which my code was based maybe wasn't right, but I couldn't identify or articulate how. Reordering characters makes total sense.

The use case is being able to map characters between input and output string with a machine learning use case. My understanding is that it is desirable to identify ranges of source text that were used to make inferences from the model. CC @j-luo93 who can maybe share more.

j-luo93 · 2024-08-21T18:33:51Z

Sorry for the long-delayed reply. I don't think there's anything from my use case that would guarantee a lookahead <=1. Just to provide a bit more context: I was looking at this unicode-normalization-alignments crate that is part of the dependencies for tokenizers, which is itself a dependency of the popular transformers crate from Huggingface. unicode-normalization-alignments is forked from unicode-normalization , with the main change adding alignment information. This change was done in a quite intrusive fashion, but it did so without making further assumptions -- given that Huggingface has to deal with all kinds of texts, I would be surprised if they have a restrictive use case.

In light of this, do you think if it's even possible to achieve a similar goal without modifying the .iter implementation?

sffc · 2024-09-23T18:45:05Z

I created #5577 for further discussion. I will close this PR since it doesn't work.

Add example for DecomposingNormalizer source cursor

97b93c1

sffc requested review from hsivonen and echeran as code owners May 14, 2024 04:23

Add comment for the invariant

b6d73e5

echeran approved these changes May 17, 2024

View reviewed changes

hsivonen requested changes May 20, 2024

View reviewed changes

robertbastian added the waiting-on-author PRs waiting for action from the author for >7 days label Aug 13, 2024

sffc mentioned this pull request Sep 23, 2024

Add API to calculate alignment between input and output normalizer strings #5577

Open

sffc closed this Sep 23, 2024

sffc deleted the normalizer-cursor branch September 23, 2024 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example for DecomposingNormalizer source cursor #4900

Add example for DecomposingNormalizer source cursor #4900

sffc commented May 14, 2024

echeran May 17, 2024

sffc May 17, 2024

hsivonen left a comment

hsivonen May 20, 2024

hsivonen commented May 20, 2024

sffc commented May 20, 2024

j-luo93 commented Aug 21, 2024

sffc commented Sep 23, 2024

Add example for DecomposingNormalizer source cursor #4900

Add example for DecomposingNormalizer source cursor #4900

Conversation

sffc commented May 14, 2024

echeran May 17, 2024

Choose a reason for hiding this comment

sffc May 17, 2024

Choose a reason for hiding this comment

hsivonen left a comment

Choose a reason for hiding this comment

hsivonen May 20, 2024

Choose a reason for hiding this comment

hsivonen commented May 20, 2024

sffc commented May 20, 2024

j-luo93 commented Aug 21, 2024

sffc commented Sep 23, 2024