-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename or replace str::words
to side-step the ambiguity of “a word”.
#1054
Conversation
👍 from me |
This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.
This patch does the following: 1. Adds three new structs in libunicode/str.rs: a. UnicodeWords: a filter on the UWordBounds iterator that yields only the "words" of a string as defined in Section 4 of Unicode Standard Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries b. UWordBounds: an iterator that segments a string on its word boundaries as defined in UAX#29. Note that this *only* segments the string, and does *not* drop whitespace and other non-word pieces of the text (that's what UnicodeWords does). Note that UWordBounds has both a forward and backward iterator that have total running time (that is, to segment the entire string) linear in the size of the string. It should be noted that with pathological inputs the reverse iterator could be about 2x less efficient than the forward iterator, but on reasonable inputs their costs are similar. c. UWordBoundIndices: the above iterator, but returning tuples of (offset, &str). 2. Adds three new functions in the `UnicodeStr` trait: a. words_unicode(): returns a UnicodeWords iterator. b. split_words_uax29(): returns a UWordBounds iterator. c. split_words_uax29_indices(): returns a UWordBoundIndices iterator. 3. Updates the `src/etc/unicode.py` script to generate tables necessary for running the UWordBounds iterators. 4. Adds a new script, `src/etc/unicode_gen_breaktests.py`, which processes the grapheme and word break tests published by the Unicode consortium into a format for inclusion in libcollectionstest. 5. Adds new impls in libcollections's `str` corresponding to the `UnicodeStr` functions of (2). Note that this new functionality is gated with `feature(unicode)`. 6. Adds tests in libcollectionstest to exercise this new functionality. In addition, updates the test data for the graphemes test to correspond to the output from the script of (4). (Note that at the moment this change is primarily cosmetic.) This patch does not settle the question raised by @huonw in rust-lang#15628; rather, it introduces a new function alongside `words()` that follows UAX#29. In addition, it does not address the concerns that @SimonSapin raises in rust-lang/rfcs#1054 since it leaves `words()` alone.
Some relevant discussion starting here: rust-lang/rust#15628 (comment) |
I like the propsed approach. This provides a low-cost convenience wrapper to something many people will be getting wrong otherwise. How wrong you ask? Splitting over the ASCII space only. Python does not provide any proper convenience wrapper in its standard library and in all code I’ve seen literally all word splitting is done over the ASCII space. |
“Wrong” depends on what you’re doing. Sometimes, you might be e.g. parsing a space-separated file formet and splitting on non-breaking spaces or other non-ASCII whitespace is not what you want. But That said, I’m becoming less and less convinced that this should be included at all. I’d like to see use cases. (rust-lang/rust#15628 (comment)) |
And if you’re in a situation where considering only ASCII is wrong, isn’t considering only whitespace also wrong and shouldn’t you be looking for Unicode word boundaries? rust-lang/rust#15628 |
you shouldn’t use
Indeed. Since I’ve previously been doing web development almost exclusively, I found that people think words in prose will always have an ASCII space between them. Considering they will be separated by some unicode whitespace instead is just as wrong. On the other hand, nothing in |
I think the standard library should only implement things that have an obvious "best" implementation. If there are too many complexities and questions about what IS best, it should be sorted out in the crates.io ecosystem. |
👍 on renaming to Just 2 days ago I wrote some code that needed to split a line on whitespace. Having FWIW I think I forgot about |
More generally, I think people expect that splitting on whitespace should be an easy thing to do. And I agree with them, it really should. It's not always the correct thing to do, but it's a common enough thing to want. |
I agree that splitting on whitespace is a common enough thing that providing a convenience function for it is a good idea, and that calling said function
|
Why move it out of libunicode? |
No particularly strong reason, but it seems to me that since it doesn't directly rely on any Unicode tables (though it does indirectly, though the |
The problem with moving it into libcollections is that means crates that only use libunicode/libcore (but not liballoc) lose access to this function. Note that most of libcollections's |
Good point. 👍 |
I personally think using |
A regex is waaaay overkill for this problem, and |
You won't say to use split "only". You'll let them google it and then find I think std should
But I am clearly not the best adviser, I just wanted to give my opinion.
|
@kballard |
👍 Ship it! |
Slight sticking point: while |
I wonder if this wouldn't be better be tackled with a pattern API as mentioned in the alternatives, e.g. One could probably even use an associated constant/type on the |
Thanks again for the RFC @SimonSapin! The feedback on this has been quite positive, so I'm going to merge this. The fact that |
Just to note, while I like @huonw's general idea of providing some pre-baked patterns for common things, I don't think that is actually a suitable replacement for |
It should be possible to provide a pattern for |
@kwantam Ah hmm, you're right. I actually didn't realize we had a |
But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054
* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well. * Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while. * Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`. * Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it. r? @alexcrichton
View.