From 2973b9976c3c9b1726e5755ba999235ae5bfda1e Mon Sep 17 00:00:00 2001 From: Simon Sapin Date: Fri, 10 Apr 2015 16:50:51 +0200 Subject: [PATCH 1/2] =?UTF-8?q?Rename=20or=20replace=20`str::words`=20to?= =?UTF-8?q?=20side-step=20the=20ambiguity=20of=20=E2=80=9Ca=20word?= =?UTF-8?q?=E2=80=9D.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- text/0000-str-words.md | 67 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) create mode 100644 text/0000-str-words.md diff --git a/text/0000-str-words.md b/text/0000-str-words.md new file mode 100644 index 00000000000..f91843d1fc7 --- /dev/null +++ b/text/0000-str-words.md @@ -0,0 +1,67 @@ +- Feature Name: str-words +- Start Date: 2015-04-10 +- RFC PR: +- Rust Issue: + +# Summary + +Rename or replace `str::words` to side-step the ambiguity of “a word”. + + +# Motivation + +The [`str::words`](http://doc.rust-lang.org/std/primitive.str.html#method.words) method +is currently marked `#[unstable(reason = "the precise algorithm to use is unclear")]`. +Indeed, the concept of “a word” is not easy to define in precense of punctuation +or languages with various conventions, including not using spaces at all to separate words. + +[Issue #15628](https://github.com/rust-lang/rust/issues/15628) suggests +changing the algorithm to be based on [the *Word Boundaries* section of +*Unicode Standard Annex #29: Unicode Text Segmentation*](http://www.unicode.org/reports/tr29/#Word_Boundaries). + +While a Rust implemention of UAX#29 would be useful, it belong on crates.io more than in `std`: + +* It carries significant complexity that may be surprising from something that looks as simple + as a parameter-less “words” method in the standard library. + Users may not be aware of how subtle defining “a word” can be. +* It is not a definitive answer. The standard itself notes: + + > It is not possible to provide a uniform set of rules that resolves all issues across languages + > or that handles all ambiguous situations within a given language. + > The goal for the specification presented in this annex is to provide a workable default; + > tailored implementations can be more sophisticated. + + and gives many examples of such ambiguous situations. + +Therefore, `std` would be better off avoiding the question of defining word boundaries entirely. + + +# Detailed design + +Rename the `words` method to `split_whitespace`, and keep the current behavior unchanged. +(That is, return an iterator equivalent to `s.split(char::is_whitespace).filter(|s| !s.is_empty())`.) + +Rename the return type `std::str::Words` to `std::str::SplitWhitespace`. + +Optionally, keep a `words` wrapper method for a while, both `#[deprecated]` and `#[unstable]`, +with an error message that suggests `split_whitespace` or the chosen alternative. + + +# Drawbacks + +`split_whitespace` is very similar to the existing `str::split(&self, P)` method, +and having a separate method seems like weak API design. (But see below.) + + +# Alternatives + +* Replace `str::words` with `struct Whitespace;` with a custom `Pattern` implementation, + which can be used in `str::split`. + However this requires the `Whitespace` symbol to be imported separately. +* Remove `str::words` entirely and tell users to use + `s.split(char::is_whitespace).filter(|s| !s.is_empty())` instead. + + +# Unresolved questions + +Is there a better alternative? From 885fbdaed0c485f8df44182ceb0bbea2e07e6883 Mon Sep 17 00:00:00 2001 From: Simon Sapin Date: Fri, 10 Apr 2015 17:10:27 +0200 Subject: [PATCH 2/2] Spelling --- text/0000-str-words.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0000-str-words.md b/text/0000-str-words.md index f91843d1fc7..04bc7875220 100644 --- a/text/0000-str-words.md +++ b/text/0000-str-words.md @@ -12,14 +12,14 @@ Rename or replace `str::words` to side-step the ambiguity of “a word”. The [`str::words`](http://doc.rust-lang.org/std/primitive.str.html#method.words) method is currently marked `#[unstable(reason = "the precise algorithm to use is unclear")]`. -Indeed, the concept of “a word” is not easy to define in precense of punctuation +Indeed, the concept of “a word” is not easy to define in presence of punctuation or languages with various conventions, including not using spaces at all to separate words. [Issue #15628](https://github.com/rust-lang/rust/issues/15628) suggests changing the algorithm to be based on [the *Word Boundaries* section of *Unicode Standard Annex #29: Unicode Text Segmentation*](http://www.unicode.org/reports/tr29/#Word_Boundaries). -While a Rust implemention of UAX#29 would be useful, it belong on crates.io more than in `std`: +While a Rust implementation of UAX#29 would be useful, it belong on crates.io more than in `std`: * It carries significant complexity that may be surprising from something that looks as simple as a parameter-less “words” method in the standard library.