Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate to_lowercase() into correct Unicode and simple implementations #26244

Closed
kornelski opened this issue Jun 12, 2015 · 4 comments
Closed
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@kornelski
Copy link
Contributor

I think there are two distinct use cases for string lowercasing:

  1. to display a lowercased string to a user
  2. to manipulate strings in string algorithms (e.g. building a "case-insensitive" trie or other kind of index. Only having Unicode-aware case-insensitive comparison function is often not enough.)

Currently the locale-unaware to_lowercase tries to do both, but doesn't do either one quite right. It isn't quite correct for the first case (it handles Greek #26035, but doesn't handle Turkish), and it's quirky which makes it difficult to be used safely in the second case.

Therefore I suggest splitting this function into two, e.g., to_locale_lowercase(locale) and to_partial_lowercase(): one that fully implements Unicode (requires locale specified and is good for displaying strings to people), and another which is incorrect in many cases, shouldn't be displayed to users, but preserves simple invariants of ASCII lowercasing that make it useful and safe for algorithms that need code-point-wise lowercasing.

The partial implementation should meet invariants for every valid string a and b:

lower(a) == lower(upper(a)) // No ß/SS
lower(a) == lower(lower(a))
lower(a) == lower(b) <=> upper(a) == upper(b)
lower(a + b) == lower(a) + lower(b) // No Σ/σ/ς
@petrochenkov
Copy link
Contributor

+1
I usually need fast (code-point-wise) but possibly non-perfect transformations for computational linguistic tasks and ascii-only (byte-wise) transformations for controlled ascii text, but rarely the full precise context sensitive and locale-aware unicode machinery.

@steveklabnik steveklabnik added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 15, 2016
@petrochenkov
Copy link
Contributor

cc #39659

@SimonSapin
Copy link
Contributor

to_ascii_lowercase already exists. Or do you mean something simpler than the current to_lowercase, but that still handles some non-ASCII cases? Is there an exact algorithm specified somewhere that this should implement?

As to locale-aware case mapping, I think this is starting to be outside of standard-library-territory and should be on crates.io instead. Such a library might want not just language-dependent entries of SpecialCasing.txt (only a few are included) but also a full CLDR database. There’s a lot of complexity there, starting with “What is even a locale?”

@Mark-Simulacrum Mark-Simulacrum added C-feature-request Category: A feature request, i.e: not implemented / a PR. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Sep 10, 2017
@dtolnay
Copy link
Member

dtolnay commented Nov 15, 2017

I don't believe we need three different variations of to_lowercase in std.

There are four tiers that have been raised in this issue:

  • Dead simple, such as you might use for converting an md5 hash from uppercase hex to lowercase hex. This is provided by the standard library in str::to_ascii_lowercase.
  • Best-effort Unicode aware. This is provided by the standard library in str::to_lowercase and is probably what people who don't know what they want, want. Hence the straightforward name.
  • Somewhere in between the previous two, non-ascii codepointwise lowercase. I see how this could be useful but it is going to be a hard sell adding a third variant of to_lowercase. This feels like a niche use case to me and would be served sufficiently well by a crate.
  • Locale aware case mapping. I agree with Simon that this involves sophistication beyond what we would be comfortable placing in std. This would be better suited for a crate.

I am closing this issue because I would like to see this explored in a crate instead. Once there is an implementation and more clarity around what algorithm we mean by codepointwise lowercase, if people still believe this needs to be in std, I would be open to reconsidering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants