Rename or replace `str::words` to side-step the ambiguity of “a word”. #1054

SimonSapin · 2015-04-10T14:55:07Z

steveklabnik · 2015-04-10T15:09:29Z

👍 from me

@huonw

This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.

@huonw

This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.

SimonSapin · 2015-04-12T08:53:26Z

Some relevant discussion starting here: rust-lang/rust#15628 (comment)

nagisa · 2015-04-12T09:03:55Z

I like the propsed approach. This provides a low-cost convenience wrapper to something many people will be getting wrong otherwise.

How wrong you ask? Splitting over the ASCII space only. Python does not provide any proper convenience wrapper in its standard library and in all code I’ve seen literally all word splitting is done over the ASCII space.

SimonSapin · 2015-04-12T09:16:23Z

“Wrong” depends on what you’re doing. Sometimes, you might be e.g. parsing a space-separated file formet and splitting on non-breaking spaces or other non-ASCII whitespace is not what you want. But s.split(&[' ', '\t'][..]) or similar is available for that situation, and hopefully users will know the difference.

That said, I’m becoming less and less convinced that this should be included at all. I’d like to see use cases. (rust-lang/rust#15628 (comment))

SimonSapin · 2015-04-12T09:19:21Z

And if you’re in a situation where considering only ASCII is wrong, isn’t considering only whitespace also wrong and shouldn’t you be looking for Unicode word boundaries? rust-lang/rust#15628

nagisa · 2015-04-12T10:04:23Z

you might be e.g. parsing a space-separated file formet and splitting on non-breaking spaces or other non-ASCII whitespace is not what you want

you shouldn’t use words/split_whitespace in this case anyway, because it is not split_whitespace’s purpose to split columns in ASCII space-delimited files.

And if you’re in a situation where considering only ASCII is wrong, isn’t considering only whitespace also wrong and shouldn’t you be looking for Unicode word boundaries?

Indeed. Since I’ve previously been doing web development almost exclusively, I found that people think words in prose will always have an ASCII space between them. Considering they will be separated by some unicode whitespace instead is just as wrong.

On the other hand, nothing in split_whitespace suggest it will produce a list of words. So, either just remove words completely, or deprecate it in favour of split_whitespace.

iopq · 2015-04-12T10:21:49Z

I think the standard library should only implement things that have an obvious "best" implementation. If there are too many complexities and questions about what IS best, it should be sorted out in the crates.io ecosystem.

lilyball · 2015-04-13T18:49:06Z

👍 on renaming to split_whitespace.

Just 2 days ago I wrote some code that needed to split a line on whitespace. Having str::split_whitespace would have been very useful to me. I actually forgot about str::words so I ended up writing s.split(" "), which bugged me for the reasons you already listed (only split on space, and didn't collapse sequences of whitespace), although in my particular use-case it was acceptable.

FWIW I think I forgot about str::words because it's not grouped with either the split_* methods or with lines() (though it is grouped with graphemes). So in addition to renaming it, I think it should be moved to be adjacent to the other splitting methods for ease of discovery.

lilyball · 2015-04-13T18:53:43Z

More generally, I think people expect that splitting on whitespace should be an easy thing to do. And I agree with them, it really should. It's not always the correct thing to do, but it's a common enough thing to want.

kwantam · 2015-04-13T20:26:26Z

I agree that splitting on whitespace is a common enough thing that providing a convenience function for it is a good idea, and that calling said function words() is a bit too imprecise.

~~Moving this function from libunicode into libcollections and~~ renaming it split_whitespace() is the right approach here, I think. 👍

lilyball · 2015-04-13T22:23:58Z

Why move it out of libunicode?

kwantam · 2015-04-13T22:36:44Z

No particularly strong reason, but it seems to me that since it doesn't directly rely on any Unicode tables (though it does indirectly, though the char::is_whitespace() function; but this is itself exported), it can just live in libcollections instead.

lilyball · 2015-04-13T22:42:26Z

The problem with moving it into libcollections is that means crates that only use libunicode/libcore (but not liballoc) lose access to this function.

Note that most of libcollections's str methods are actually just wrappers that call into libcore or libunicode. The only str methods implemented in libcollections directly are ones that require allocation (e.g. the methods that return String).

kwantam · 2015-04-13T22:45:29Z

Good point. 👍

tafia · 2015-04-14T06:02:20Z

I personally think using split / regex is enough. It is almost a one-line expression that could easily be found with a simple google search in the future.

lilyball · 2015-04-14T06:35:40Z

A regex is waaaay overkill for this problem, and split() doesn't collapse sequences of adjacent whitespace, so if we say "you have to use split()" then everyone has to remember "and add .filter(|x| !x.is_empty())"

tafia · 2015-04-14T10:51:19Z

You won't say to use split "only". You'll let them google it and then find
the one line split + is_empty (or whatever). For instance I couldn't easily
find an equivalent in c#, I google and quickly found simple answers).

I think std should

remain as simple as possible
focus on basic functionalities with not so basic implementations

But I am clearly not the best adviser, I just wanted to give my opinion.
On 14 Apr 2015 14:35, "Kevin Ballard" notifications@github.com wrote:

A regex is waaaay overkill for this problem, and split() doesn't collapse
sequences of adjacent whitespace, so if we say "you have to use split()"
then everyone has to remember "and add .filter(|x| !x.is_empty())"

—
Reply to this email directly or view it on GitHub
#1054 (comment).

tshepang · 2015-04-14T12:16:05Z

@kballard s.split_whitespace() is easier on the eyes than s.split(char::is_whitespace).filter(|s| !s.is_empty()). It's also very common functionality.

nikomatsakis · 2015-04-16T21:42:04Z

👍 Ship it!

kwantam · 2015-04-16T21:52:35Z

Slight sticking point: while words() is marked unstable, the Words struct is marked stable. So probably a deprecation / transition plan is required here.

huonw · 2015-04-16T23:51:24Z

I wonder if this wouldn't be better be tackled with a pattern API as mentioned in the alternatives, e.g. s.split(std::str::Whitespaces), which would collapse runs of whitespace into one.

One could probably even use an associated constant/type on the str type to avoid having to import more things: s.split(str::Whitespaces) would work anywhere that the str type itself is in scope.

alexcrichton · 2015-04-17T23:50:19Z

Thanks again for the RFC @SimonSapin! The feedback on this has been quite positive, so I'm going to merge this. The fact that Words is #[stable] (as @kwantam pointed out) I believe was just a mistake and we should be fine to revert the stability. I also think that @huonw's idea is a plausible option, but it is probably more ergonomic to have an inherent method anyway (less imports). Regardless we'll have a bit more time after this lands as #[unstable] to tweak it for a bit as well.

lilyball · 2015-04-22T22:36:08Z

Just to note, while I like @huonw's general idea of providing some pre-baked patterns for common things, I don't think that is actually a suitable replacement for str::words() because it won't be able to drop the empty first/last elements when the string has leading/trailing space (e.g. " a b ".split(std::str::Whitespaces) would yield ["", "a", "b", ""]).

kwantam · 2015-04-22T22:57:01Z

It should be possible to provide a pattern for match() that does the right thing, though.

lilyball · 2015-04-25T02:01:03Z

@kwantam Ah hmm, you're right. I actually didn't realize we had a matches() method now, though it seems like an obvious thing to have in hindsight.

But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054

@alexcrichton

* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well. * Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while. * Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`. * Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it. r? @alexcrichton

Rename or replace str::words to side-step the ambiguity of “a word”.

2973b99

This was referenced Apr 10, 2015

&str.words should be using the unicode word boundary algorithm rust-lang/rust#15628

Closed

APIs stabilization metabug rust-lang/rust#24028

Closed

Spelling

885fbda

kwantam mentioned this pull request Apr 12, 2015

add UAX#29 word bounds algorithm to libunicode rust-lang/rust#24340

Closed

kwantam mentioned this pull request Apr 15, 2015

deprecate Unicode functions that will be moved to crates.io rust-lang/rust#24428

Merged

alexcrichton self-assigned this Apr 16, 2015

alexcrichton mentioned this pull request Apr 17, 2015

Rename str::words to str::split_whitespace rust-lang/rust#24543

Closed

alexcrichton merged commit 885fbda into rust-lang:master Apr 17, 2015

SimonSapin mentioned this pull request Jun 5, 2015

Add complex case mapping and title case mapping. rust-lang/rust#26039

Merged

SimonSapin added a commit to SimonSapin/rust that referenced this pull request Jun 6, 2015

Add char::to_titlecase

d316487

But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054

chriskrycho mentioned this pull request Feb 8, 2017

Document all features in the reference rust-lang/rust#38643

Closed

17 tasks

chriskrycho mentioned this pull request Mar 11, 2017

Document all features rust-lang/reference#9

Closed

48 tasks

Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename or replace `str::words` to side-step the ambiguity of “a word”. #1054

Rename or replace `str::words` to side-step the ambiguity of “a word”. #1054

SimonSapin commented Apr 10, 2015

steveklabnik commented Apr 10, 2015

SimonSapin commented Apr 12, 2015

nagisa commented Apr 12, 2015

SimonSapin commented Apr 12, 2015

SimonSapin commented Apr 12, 2015

nagisa commented Apr 12, 2015

iopq commented Apr 12, 2015

lilyball commented Apr 13, 2015

lilyball commented Apr 13, 2015

kwantam commented Apr 13, 2015

lilyball commented Apr 13, 2015

kwantam commented Apr 13, 2015

lilyball commented Apr 13, 2015

kwantam commented Apr 13, 2015

tafia commented Apr 14, 2015

lilyball commented Apr 14, 2015

tafia commented Apr 14, 2015

tshepang commented Apr 14, 2015

nikomatsakis commented Apr 16, 2015

kwantam commented Apr 16, 2015

huonw commented Apr 16, 2015

alexcrichton commented Apr 17, 2015

lilyball commented Apr 22, 2015

kwantam commented Apr 22, 2015

lilyball commented Apr 25, 2015

Rename or replace str::words to side-step the ambiguity of “a word”. #1054

Rename or replace str::words to side-step the ambiguity of “a word”. #1054

Conversation

SimonSapin commented Apr 10, 2015

steveklabnik commented Apr 10, 2015

SimonSapin commented Apr 12, 2015

nagisa commented Apr 12, 2015

SimonSapin commented Apr 12, 2015

SimonSapin commented Apr 12, 2015

nagisa commented Apr 12, 2015

iopq commented Apr 12, 2015

lilyball commented Apr 13, 2015

lilyball commented Apr 13, 2015

kwantam commented Apr 13, 2015

lilyball commented Apr 13, 2015

kwantam commented Apr 13, 2015

lilyball commented Apr 13, 2015

kwantam commented Apr 13, 2015

tafia commented Apr 14, 2015

lilyball commented Apr 14, 2015

tafia commented Apr 14, 2015

tshepang commented Apr 14, 2015

nikomatsakis commented Apr 16, 2015

kwantam commented Apr 16, 2015

huonw commented Apr 16, 2015

alexcrichton commented Apr 17, 2015

lilyball commented Apr 22, 2015

kwantam commented Apr 22, 2015

lilyball commented Apr 25, 2015

Rename or replace `str::words` to side-step the ambiguity of “a word”. #1054

Rename or replace `str::words` to side-step the ambiguity of “a word”. #1054