Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add UAX#29 word bounds algorithm to libunicode #24340

Closed
wants to merge 1 commit into from

Commits on Apr 12, 2015

  1. add impl of UAX#29 word bounds algorithm in libunicode

    This patch does the following:
    
    1. Adds three new structs in libunicode/str.rs:
    
       a. UnicodeWords: a filter on the UWordBounds iterator that yields only
          the "words" of a string as defined in Section 4 of Unicode Standard
          Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries
    
       b. UWordBounds: an iterator that segments a string on its word
          boundaries as defined in UAX#29. Note that this *only* segments
          the string, and does *not* drop whitespace and other non-word
          pieces of the text (that's what UnicodeWords does).
    
          Note that UWordBounds has both a forward and backward iterator
          that have total running time (that is, to segment the entire
          string) linear in the size of the string. It should be noted that
          with pathological inputs the reverse iterator could be about 2x less
          efficient than the forward iterator, but on reasonable inputs
          their costs are similar.
    
       c. UWordBoundIndices: the above iterator, but returning tuples of
          (offset, &str).
    
    2. Adds three new functions in the `UnicodeStr` trait:
    
       a. words_unicode(): returns a UnicodeWords iterator.
    
       b. split_words_uax29(): returns a UWordBounds iterator.
    
       c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.
    
    3. Updates the `src/etc/unicode.py` script to generate tables necessary
       for running the UWordBounds iterators.
    
    4. Adds a new script, `src/etc/unicode_gen_breaktests.py`,
       which processes the grapheme and word break tests published
       by the Unicode consortium into a format for inclusion in
       libcollectionstest.
    
    5. Adds new impls in libcollections's `str` corresponding to the
       `UnicodeStr` functions of (2).
    
       Note that this new functionality is gated with `feature(unicode)`.
    
    6. Adds tests in libcollectionstest to exercise this new functionality.
    
       In addition, updates the test data for the graphemes test to
       correspond to the output from the script of (4). (Note that at the
       moment this change is primarily cosmetic.)
    
    This patch does not settle the question raised by @huonw in rust-lang#15628;
    rather, it introduces a new function alongside `words()` that follows
    UAX#29.
    
    In addition, it does not address the concerns that @SimonSapin raises in
    rust-lang/rfcs#1054 since it leaves `words()`
    alone.
    kwantam committed Apr 12, 2015
    Configuration menu
    Copy the full SHA
    043aca3 View commit details
    Browse the repository at this point in the history