-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add UAX#29 word bounds algorithm to libunicode #24340
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.
...I think... |
Some additional notes: (1) As the discussions in #15628 and rust-lang/rfcs#1054 imply, it's not at all clear that this really belongs in the standard library, since it's a sensible default but not an all-encompassing standard for word breaking. So this PR should really be treated as a trial balloon to answer one implicit question in that conversation, namely, what does the corresponding code actually look like? (2) I realize that My reasoning is this: the name I don't like the name One possibility is to just get rid of |
As I’ve mentioned in #15628 and rust-lang/rfcs#1054 , I don’t think this belong in the standard library and is better served by crates.io. I’d like to see use cases (#15628 (comment)) that justify including either this or the “split on whitespace” variation. That aside, this looks like great work! Thanks @kwantam for doing it. |
Just curios, but would it make sense to implement this as a
|
@Kimundi Interesting observation: another idea would be to "invert" it (I literally have no idea which will work best), offering something like |
@Kimundi, @huonw, either To me it seems pretty reasonable to make use of the infrastructure that |
Nice work @kwantam, this is quite impressive! I, like @huonw, really like @Kimundi's idea, and it would be quite interesting to see how it plays out. Also, like @SimonSapin, I agree that the standard library is probably not the best place for this right now. Would you be interested in a crate on crates.io which implemented this infrastructure? For now it could start out with the methods as implemented and implementations of the |
Sure, moving this to crates.io seems very reasonable. I can imagine two possible approaches. One is to make this new crate a general The other possibility is to call the new crate Also: I suppose this serves to answer #15628 (if this doesn't go in |
We very often have a desire to have an appropriate location to put libunicode-like code in a different location, but we haven't thought it through too too hard just yet. In the past we've assumed that a general I also agree that the best solution to #15628 is likely rust-lang/rfcs#1054 |
It’s a kind of a recurring theme: which is preferable between a large number of single-purpose crates, or fewer, larger crates containing multiple loosely-related features? |
I'd be fine either way! |
OK, I'm going to make three separate crates for this. The reason is that the three pieces of functionality (UAX#29, charwidth, and de/recompositions) have quite different use cases, and de/recompositions ends up doing allocations, while the other two do not and thus can be built with First one is now up, and the corresponding PR to remove the functionality ( |
Thanks for taking the initiative on this! FWIW, I’ve stopped trying to make libraries use |
This patch does the following:
Adds three new structs in libunicode/str.rs:
a. UnicodeWords: a filter on the UWordBounds iterator that yields only
the "words" of a string as defined in Section 4 of Unicode Standard
Annex alias legwork #29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries
b. UWordBounds: an iterator that segments a string on its word
boundaries as defined in UAX#29. Note that this only segments
the string, and does not drop whitespace and other non-word
pieces of the text (that's what UnicodeWords does).
Note that UWordBounds has both a forward and backward iterator
that have total running time (that is, to segment the entire
string) linear in the size of the string. It should be noted that
with pathological inputs the reverse iterator could be about 2x less
efficient than the forward iterator, but on reasonable inputs
their costs are similar.
c. UWordBoundIndices: the above iterator, but returning tuples of
(offset, &str).
Adds three new functions in the
UnicodeStr
trait:a. words_unicode(): returns a UnicodeWords iterator.
b. split_words_uax29(): returns a UWordBounds iterator.
c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.
Updates the
src/etc/unicode.py
script to generate tables necessaryfor running the UWordBounds iterators.
Adds a new script,
src/etc/unicode_gen_breaktests.py
,which processes the grapheme and word break tests published
by the Unicode consortium into a format for inclusion in
libcollectionstest.
Adds new impls in libcollections's
str
corresponding to theUnicodeStr
functions of (2).Note that this new functionality is gated with
feature(unicode)
.Adds tests in libcollectionstest to exercise this new functionality.
In addition, updates the test data for the graphemes test to
correspond to the output from the script of (4). (Note that at the
moment this change is primarily cosmetic.)
This patch does not settle the question raised by @huonw in #15628;
rather, it introduces a new function alongside
words()
that followsUAX#29.
In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves
words()
alone.