Separate to_lowercase() into correct Unicode and simple implementations #26244

kornelski · 2015-06-12T13:42:29Z

I think there are two distinct use cases for string lowercasing:

to display a lowercased string to a user
to manipulate strings in string algorithms (e.g. building a "case-insensitive" trie or other kind of index. Only having Unicode-aware case-insensitive comparison function is often not enough.)

Currently the locale-unaware to_lowercase tries to do both, but doesn't do either one quite right. It isn't quite correct for the first case (it handles Greek #26035, but doesn't handle Turkish), and it's quirky which makes it difficult to be used safely in the second case.

Therefore I suggest splitting this function into two, e.g., to_locale_lowercase(locale) and to_partial_lowercase(): one that fully implements Unicode (requires locale specified and is good for displaying strings to people), and another which is incorrect in many cases, shouldn't be displayed to users, but preserves simple invariants of ASCII lowercasing that make it useful and safe for algorithms that need code-point-wise lowercasing.

The partial implementation should meet invariants for every valid string a and b:

lower(a) == lower(upper(a)) // No ß/SS
lower(a) == lower(lower(a))
lower(a) == lower(b) <=> upper(a) == upper(b)
lower(a + b) == lower(a) + lower(b) // No Σ/σ/ς

The text was updated successfully, but these errors were encountered:

petrochenkov · 2015-06-12T22:40:59Z

+1
I usually need fast (code-point-wise) but possibly non-perfect transformations for computational linguistic tasks and ascii-only (byte-wise) transformations for controlled ascii text, but rarely the full precise context sensitive and locale-aware unicode machinery.

petrochenkov · 2017-02-19T21:06:16Z

cc #39659

SimonSapin · 2017-02-22T15:24:17Z

to_ascii_lowercase already exists. Or do you mean something simpler than the current to_lowercase, but that still handles some non-ASCII cases? Is there an exact algorithm specified somewhere that this should implement?

As to locale-aware case mapping, I think this is starting to be outside of standard-library-territory and should be on crates.io instead. Such a library might want not just language-dependent entries of SpecialCasing.txt (only a few are included) but also a full CLDR database. There’s a lot of complexity there, starting with “What is even a locale?”

dtolnay · 2017-11-15T08:50:59Z

I don't believe we need three different variations of to_lowercase in std.

There are four tiers that have been raised in this issue:

Dead simple, such as you might use for converting an md5 hash from uppercase hex to lowercase hex. This is provided by the standard library in str::to_ascii_lowercase.
Best-effort Unicode aware. This is provided by the standard library in str::to_lowercase and is probably what people who don't know what they want, want. Hence the straightforward name.
Somewhere in between the previous two, non-ascii codepointwise lowercase. I see how this could be useful but it is going to be a hard sell adding a third variant of to_lowercase. This feels like a niche use case to me and would be served sufficiently well by a crate.
Locale aware case mapping. I agree with Simon that this involves sophistication beyond what we would be comfortable placing in std. This would be better suited for a crate.

I am closing this issue because I would like to see this explored in a crate instead. Once there is an implementation and more clarity around what algorithm we mean by codepointwise lowercase, if people still believe this needs to be in std, I would be open to reconsidering.

steveklabnik added the A-libs label Jun 12, 2015

steveklabnik added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 15, 2016

steveklabnik removed the A-libs label Mar 24, 2017

Mark-Simulacrum added C-feature-request Category: A feature request, i.e: not implemented / a PR. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Sep 10, 2017

dtolnay closed this as completed Nov 15, 2017

rth mentioned this issue May 8, 2019

Existing work: Text normalization rust-ml/nlp-discussion#2

Open

rth mentioned this issue Nov 19, 2019

Make to_ascii_lowercase optional rth/vtext#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate to_lowercase() into correct Unicode and simple implementations #26244

Separate to_lowercase() into correct Unicode and simple implementations #26244

kornelski commented Jun 12, 2015

petrochenkov commented Jun 12, 2015

petrochenkov commented Feb 19, 2017

SimonSapin commented Feb 22, 2017

dtolnay commented Nov 15, 2017

Separate to_lowercase() into correct Unicode and simple implementations #26244

Separate to_lowercase() into correct Unicode and simple implementations #26244

Comments

kornelski commented Jun 12, 2015

petrochenkov commented Jun 12, 2015

petrochenkov commented Feb 19, 2017

SimonSapin commented Feb 22, 2017

dtolnay commented Nov 15, 2017