Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add UAX#29 word bounds algorithm to libunicode #24340

Closed
wants to merge 1 commit into from

Conversation

kwantam
Copy link
Contributor

@kwantam kwantam commented Apr 12, 2015

This patch does the following:

  1. Adds three new structs in libunicode/str.rs:

    a. UnicodeWords: a filter on the UWordBounds iterator that yields only
    the "words" of a string as defined in Section 4 of Unicode Standard
    Annex alias legwork #29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries

    b. UWordBounds: an iterator that segments a string on its word
    boundaries as defined in UAX#29. Note that this only segments
    the string, and does not drop whitespace and other non-word
    pieces of the text (that's what UnicodeWords does).

    Note that UWordBounds has both a forward and backward iterator
    that have total running time (that is, to segment the entire
    string) linear in the size of the string. It should be noted that
    with pathological inputs the reverse iterator could be about 2x less
    efficient than the forward iterator, but on reasonable inputs
    their costs are similar.

    c. UWordBoundIndices: the above iterator, but returning tuples of
    (offset, &str).

  2. Adds three new functions in the UnicodeStr trait:

    a. words_unicode(): returns a UnicodeWords iterator.

    b. split_words_uax29(): returns a UWordBounds iterator.

    c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.

  3. Updates the src/etc/unicode.py script to generate tables necessary
    for running the UWordBounds iterators.

  4. Adds a new script, src/etc/unicode_gen_breaktests.py,
    which processes the grapheme and word break tests published
    by the Unicode consortium into a format for inclusion in
    libcollectionstest.

  5. Adds new impls in libcollections's str corresponding to the
    UnicodeStr functions of (2).

    Note that this new functionality is gated with feature(unicode).

  6. Adds tests in libcollectionstest to exercise this new functionality.

    In addition, updates the test data for the graphemes test to
    correspond to the output from the script of (4). (Note that at the
    moment this change is primarily cosmetic.)

This patch does not settle the question raised by @huonw in #15628;
rather, it introduces a new function alongside words() that follows
UAX#29.

In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves words()
alone.

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

This patch does the following:

1. Adds three new structs in libunicode/str.rs:

   a. UnicodeWords: a filter on the UWordBounds iterator that yields only
      the "words" of a string as defined in Section 4 of Unicode Standard
      Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries

   b. UWordBounds: an iterator that segments a string on its word
      boundaries as defined in UAX#29. Note that this *only* segments
      the string, and does *not* drop whitespace and other non-word
      pieces of the text (that's what UnicodeWords does).

      Note that UWordBounds has both a forward and backward iterator
      that have total running time (that is, to segment the entire
      string) linear in the size of the string. It should be noted that
      with pathological inputs the reverse iterator could be about 2x less
      efficient than the forward iterator, but on reasonable inputs
      their costs are similar.

   c. UWordBoundIndices: the above iterator, but returning tuples of
      (offset, &str).

2. Adds three new functions in the `UnicodeStr` trait:

   a. words_unicode(): returns a UnicodeWords iterator.

   b. split_words_uax29(): returns a UWordBounds iterator.

   c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.

3. Updates the `src/etc/unicode.py` script to generate tables necessary
   for running the UWordBounds iterators.

4. Adds a new script, `src/etc/unicode_gen_breaktests.py`,
   which processes the grapheme and word break tests published
   by the Unicode consortium into a format for inclusion in
   libcollectionstest.

5. Adds new impls in libcollections's `str` corresponding to the
   `UnicodeStr` functions of (2).

   Note that this new functionality is gated with `feature(unicode)`.

6. Adds tests in libcollectionstest to exercise this new functionality.

   In addition, updates the test data for the graphemes test to
   correspond to the output from the script of (4). (Note that at the
   moment this change is primarily cosmetic.)

This patch does not settle the question raised by @huonw in rust-lang#15628;
rather, it introduces a new function alongside `words()` that follows
UAX#29.

In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves `words()`
alone.
@kwantam
Copy link
Contributor Author

kwantam commented Apr 12, 2015

r? @alexcrichton

...I think...

@kwantam
Copy link
Contributor Author

kwantam commented Apr 12, 2015

Some additional notes:

(1) As the discussions in #15628 and rust-lang/rfcs#1054 imply, it's not at all clear that this really belongs in the standard library, since it's a sensible default but not an all-encompassing standard for word breaking. So this PR should really be treated as a trial balloon to answer one implicit question in that conversation, namely, what does the corresponding code actually look like?

(2) I realize that split_words_uax29() and words_unicode() seem to be inconsistent in their naming.

My reasoning is this: the name split_words_uax29() refers to the precise concept presented in UAX#29. On the other hand, the name words_unicode() is intended to be a more general description of what the function is doing.

I don't like the name words_uax29() because it exposes too much nerdy detail to the user; on the other hand, split_words_unicode() is a bit vague.

One possibility is to just get rid of words_unicode() entirely. Equivalent functionality is obtainable with s.split_words_uax29().filter(|s| s.any(|c| c.is_alphanumeric())). But it's nice to have short names for things, provided those things are used frequently enough. I don't have a good sense of whether words_unicode() hits the bar for "frequently enough," though.

@SimonSapin
Copy link
Contributor

As I’ve mentioned in #15628 and rust-lang/rfcs#1054 , I don’t think this belong in the standard library and is better served by crates.io. I’d like to see use cases (#15628 (comment)) that justify including either this or the “split on whitespace” variation.

That aside, this looks like great work! Thanks @kwantam for doing it.

@Kimundi
Copy link
Member

Kimundi commented Apr 13, 2015

Just curios, but would it make sense to implement this as a str::pattern::Pattern?

foo.matches(UnicodeWord) // equivalent to words_unicode()
foo.matches(Uax29Word) // equivalent to split_words_uax49()
foo.match_indices(Uax29Word) // equivalent to split_words_uax49_indices()

@huonw
Copy link
Member

huonw commented Apr 13, 2015

@Kimundi Interesting observation: another idea would be to "invert" it (I literally have no idea which will work best), offering something like foo.split(WhitespaceSequence) (the same as the current .words) and foo.split(Uax29WordBoundary) etc.

@kwantam
Copy link
Contributor Author

kwantam commented Apr 13, 2015

@Kimundi, @huonw, either split or matches is workable and not much extra work from what's currently implemented. @SimonSapin suggests this as one of the alternatives to rust-lang/rfcs#1054 .

To me it seems pretty reasonable to make use of the infrastructure that split and matches offer, so I'll experiment with that implementation in a separate branch and see what comes of it. I'll also plan to pull in the functionality of the current words() function.

@alexcrichton
Copy link
Member

Nice work @kwantam, this is quite impressive! I, like @huonw, really like @Kimundi's idea, and it would be quite interesting to see how it plays out. Also, like @SimonSapin, I agree that the standard library is probably not the best place for this right now.

Would you be interested in a crate on crates.io which implemented this infrastructure? For now it could start out with the methods as implemented and implementations of the Pattern trait could be behind an Cargo feature to allow it to build on stable Rust for now.

@kwantam
Copy link
Contributor Author

kwantam commented Apr 13, 2015

Sure, moving this to crates.io seems very reasonable.

I can imagine two possible approaches. One is to make this new crate a general unicode library, with the aim of eventually providing a subset of libicu's functionality (not already covered by other crates, e.g., time, regex, encoding, etc.). In this case, we would relocate as much of the unstable functionality of the current libunicode as possible (leaving behind only what the compiler needs and renaming the library, say, libunicode_minimal). The removed portion would probably include character width, normalizations, etc.

The other possibility is to call the new crate uax29 (or unicode_uax29) and only implement the UAX#29-related features there. In this case, we should still consider moving the Graphemes iterators as well (and implementing corresponding Patterns there, one assumes).

Also: I suppose this serves to answer #15628 (if this doesn't go in std then words() certainly can't be UAX#29).

@alexcrichton
Copy link
Member

We very often have a desire to have an appropriate location to put libunicode-like code in a different location, but we haven't thought it through too too hard just yet. In the past we've assumed that a general unicode library would exist, but having more targeted crates would certainly be a possibility! I think that if grapheme/width management moves out it may be good to start out with a general libunicode though?

I also agree that the best solution to #15628 is likely rust-lang/rfcs#1054

@SimonSapin
Copy link
Contributor

It’s a kind of a recurring theme: which is preferable between a large number of single-purpose crates, or fewer, larger crates containing multiple loosely-related features?

@alexcrichton
Copy link
Member

I'd be fine either way!

@kwantam
Copy link
Contributor Author

kwantam commented Apr 14, 2015

OK, I'm going to make three separate crates for this. The reason is that the three pieces of functionality (UAX#29, charwidth, and de/recompositions) have quite different use cases, and de/recompositions ends up doing allocations, while the other two do not and thus can be built with no_std.

First one is now up, and the corresponding PR to remove the functionality (width()) is #24402. Comments appreciated, both there and in https://github.com/unicode-rs/unicode_width

@kwantam kwantam closed this Apr 14, 2015
@SimonSapin
Copy link
Contributor

Thanks for taking the initiative on this!

FWIW, I’ve stopped trying to make libraries use no_std just because someone might want it. no_std has much less use now that the runtime is gone. But I tend to prefer separate crates anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants