add UAX#29 word bounds algorithm to libunicode #24340

kwantam · 2015-04-12T01:48:15Z

This patch does the following:

Adds three new structs in libunicode/str.rs:

a. UnicodeWords: a filter on the UWordBounds iterator that yields only
the "words" of a string as defined in Section 4 of Unicode Standard
Annex alias legwork #29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries

b. UWordBounds: an iterator that segments a string on its word
boundaries as defined in UAX#29. Note that this only segments
the string, and does not drop whitespace and other non-word
pieces of the text (that's what UnicodeWords does).

Note that UWordBounds has both a forward and backward iterator
that have total running time (that is, to segment the entire
string) linear in the size of the string. It should be noted that
with pathological inputs the reverse iterator could be about 2x less
efficient than the forward iterator, but on reasonable inputs
their costs are similar.

c. UWordBoundIndices: the above iterator, but returning tuples of
(offset, &str).
Adds three new functions in the UnicodeStr trait:

a. words_unicode(): returns a UnicodeWords iterator.

b. split_words_uax29(): returns a UWordBounds iterator.

c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.
Updates the src/etc/unicode.py script to generate tables necessary
for running the UWordBounds iterators.
Adds a new script, src/etc/unicode_gen_breaktests.py,
which processes the grapheme and word break tests published
by the Unicode consortium into a format for inclusion in
libcollectionstest.
Adds new impls in libcollections's str corresponding to the
UnicodeStr functions of (2).

Note that this new functionality is gated with feature(unicode).
Adds tests in libcollectionstest to exercise this new functionality.

In addition, updates the test data for the graphemes test to
correspond to the output from the script of (4). (Note that at the
moment this change is primarily cosmetic.)

This patch does not settle the question raised by @huonw in #15628;
rather, it introduces a new function alongside words() that follows
UAX#29.

In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves words()
alone.

rust-highfive · 2015-04-12T01:48:31Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@huonw

This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.

kwantam · 2015-04-12T01:55:20Z

r? @alexcrichton

...I think...

kwantam · 2015-04-12T02:11:54Z

Some additional notes:

(1) As the discussions in #15628 and rust-lang/rfcs#1054 imply, it's not at all clear that this really belongs in the standard library, since it's a sensible default but not an all-encompassing standard for word breaking. So this PR should really be treated as a trial balloon to answer one implicit question in that conversation, namely, what does the corresponding code actually look like?

(2) I realize that split_words_uax29() and words_unicode() seem to be inconsistent in their naming.

My reasoning is this: the name split_words_uax29() refers to the precise concept presented in UAX#29. On the other hand, the name words_unicode() is intended to be a more general description of what the function is doing.

I don't like the name words_uax29() because it exposes too much nerdy detail to the user; on the other hand, split_words_unicode() is a bit vague.

One possibility is to just get rid of words_unicode() entirely. Equivalent functionality is obtainable with s.split_words_uax29().filter(|s| s.any(|c| c.is_alphanumeric())). But it's nice to have short names for things, provided those things are used frequently enough. I don't have a good sense of whether words_unicode() hits the bar for "frequently enough," though.

SimonSapin · 2015-04-12T09:10:10Z

As I’ve mentioned in #15628 and rust-lang/rfcs#1054 , I don’t think this belong in the standard library and is better served by crates.io. I’d like to see use cases (#15628 (comment)) that justify including either this or the “split on whitespace” variation.

That aside, this looks like great work! Thanks @kwantam for doing it.

Kimundi · 2015-04-13T01:01:50Z

Just curios, but would it make sense to implement this as a str::pattern::Pattern?

foo.matches(UnicodeWord) // equivalent to words_unicode()
foo.matches(Uax29Word) // equivalent to split_words_uax49()
foo.match_indices(Uax29Word) // equivalent to split_words_uax49_indices()

huonw · 2015-04-13T02:10:09Z

@Kimundi Interesting observation: another idea would be to "invert" it (I literally have no idea which will work best), offering something like foo.split(WhitespaceSequence) (the same as the current .words) and foo.split(Uax29WordBoundary) etc.

kwantam · 2015-04-13T03:42:12Z

@Kimundi, @huonw, either split or matches is workable and not much extra work from what's currently implemented. @SimonSapin suggests this as one of the alternatives to rust-lang/rfcs#1054 .

To me it seems pretty reasonable to make use of the infrastructure that split and matches offer, so I'll experiment with that implementation in a separate branch and see what comes of it. I'll also plan to pull in the functionality of the current words() function.

alexcrichton · 2015-04-13T17:45:52Z

Nice work @kwantam, this is quite impressive! I, like @huonw, really like @Kimundi's idea, and it would be quite interesting to see how it plays out. Also, like @SimonSapin, I agree that the standard library is probably not the best place for this right now.

Would you be interested in a crate on crates.io which implemented this infrastructure? For now it could start out with the methods as implemented and implementations of the Pattern trait could be behind an Cargo feature to allow it to build on stable Rust for now.

kwantam · 2015-04-13T18:54:46Z

Sure, moving this to crates.io seems very reasonable.

I can imagine two possible approaches. One is to make this new crate a general unicode library, with the aim of eventually providing a subset of libicu's functionality (not already covered by other crates, e.g., time, regex, encoding, etc.). In this case, we would relocate as much of the unstable functionality of the current libunicode as possible (leaving behind only what the compiler needs and renaming the library, say, libunicode_minimal). The removed portion would probably include character width, normalizations, etc.

The other possibility is to call the new crate uax29 (or unicode_uax29) and only implement the UAX#29-related features there. In this case, we should still consider moving the Graphemes iterators as well (and implementing corresponding Patterns there, one assumes).

Also: I suppose this serves to answer #15628 (if this doesn't go in std then words() certainly can't be UAX#29).

alexcrichton · 2015-04-13T20:51:30Z

We very often have a desire to have an appropriate location to put libunicode-like code in a different location, but we haven't thought it through too too hard just yet. In the past we've assumed that a general unicode library would exist, but having more targeted crates would certainly be a possibility! I think that if grapheme/width management moves out it may be good to start out with a general libunicode though?

I also agree that the best solution to #15628 is likely rust-lang/rfcs#1054

SimonSapin · 2015-04-13T21:07:33Z

It’s a kind of a recurring theme: which is preferable between a large number of single-purpose crates, or fewer, larger crates containing multiple loosely-related features?

alexcrichton · 2015-04-13T21:17:08Z

I'd be fine either way!

kwantam · 2015-04-14T04:30:19Z

OK, I'm going to make three separate crates for this. The reason is that the three pieces of functionality (UAX#29, charwidth, and de/recompositions) have quite different use cases, and de/recompositions ends up doing allocations, while the other two do not and thus can be built with no_std.

First one is now up, and the corresponding PR to remove the functionality (width()) is #24402. Comments appreciated, both there and in https://github.com/unicode-rs/unicode_width

SimonSapin · 2015-04-14T07:52:27Z

Thanks for taking the initiative on this!

FWIW, I’ve stopped trying to make libraries use no_std just because someone might want it. no_std has much less use now that the runtime is gone. But I tend to prefer separate crates anyway.

rust-highfive assigned nikomatsakis Apr 12, 2015

kwantam force-pushed the uax29_clean branch from d864250 to 043aca3 Compare April 12, 2015 01:53

rust-highfive assigned alexcrichton and unassigned nikomatsakis Apr 12, 2015

kwantam closed this Apr 14, 2015

This was referenced Apr 15, 2015

deprecate Unicode functions that will be moved to crates.io #24428

Merged

create Patterns for Graphemes and Words unicode-rs/unicode-segmentation#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add UAX#29 word bounds algorithm to libunicode #24340

add UAX#29 word bounds algorithm to libunicode #24340

kwantam commented Apr 12, 2015

rust-highfive commented Apr 12, 2015

kwantam commented Apr 12, 2015

kwantam commented Apr 12, 2015

SimonSapin commented Apr 12, 2015

Kimundi commented Apr 13, 2015

huonw commented Apr 13, 2015

kwantam commented Apr 13, 2015

alexcrichton commented Apr 13, 2015

kwantam commented Apr 13, 2015

alexcrichton commented Apr 13, 2015

SimonSapin commented Apr 13, 2015

alexcrichton commented Apr 13, 2015

kwantam commented Apr 14, 2015

SimonSapin commented Apr 14, 2015

add UAX#29 word bounds algorithm to libunicode #24340

add UAX#29 word bounds algorithm to libunicode #24340

Conversation

kwantam commented Apr 12, 2015

rust-highfive commented Apr 12, 2015

kwantam commented Apr 12, 2015

kwantam commented Apr 12, 2015

SimonSapin commented Apr 12, 2015

Kimundi commented Apr 13, 2015

huonw commented Apr 13, 2015

kwantam commented Apr 13, 2015

alexcrichton commented Apr 13, 2015

kwantam commented Apr 13, 2015

alexcrichton commented Apr 13, 2015

SimonSapin commented Apr 13, 2015

alexcrichton commented Apr 13, 2015

kwantam commented Apr 14, 2015

SimonSapin commented Apr 14, 2015