Implement full to_{upper,lower}case algorithms #25800

alexcrichton · 2015-05-26T16:59:19Z

Right now we always return an iterator over one character, but the iterator is being returned so one day we can return many characters. This has all yet to be implemented, and this issue will track this implementation.

andrewclarkson · 2015-05-27T02:25:46Z

A couple friends and I dug into this issue as part of a mini-sprint. We'd love a mentor or some guidance to work on this.

As we understand it right now:

Unicode case conversions are done using raw text tables from unicode.org
The raw tables are exported as rust slices in tables.rs using a python script unicode.py
The current to_uppercase methods use a binary search over the single character conversion tables ignoring the multiple character compatibility_table

So what needs to happen then is:

Either join the two tables or search both of them.
to_upper and to_lower in tables.rs then need to be changed to return a slice rather than a single char
to_uppercase and to_lowercase need to use that slice for the iterators it returns

Is that an accurate summary?

Also, we noted there are several external but related crates for unicode but couldn't find any indication on whether these crates were moving into the standard lib or whether the standard lib was moving out.

alexcrichton · 2015-05-27T16:38:14Z

@bitborn that all sounds pretty good! I think we may not want to encode the exact return value of each character (that's a lot of space). It'll be a balancing act to figure out how to encode the data on unicode.org in as compact a form as possible but still having a fast lookup for case conversions.

@alexcrichton

* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well. * Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while. * Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`. * Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it. r? @alexcrichton

alexcrichton added the A-libs label May 26, 2015

alexcrichton mentioned this issue May 26, 2015

char::{to_uppercase, to_lowercase} are broken and should not be marked stable #25729

Closed

SimonSapin mentioned this issue Jun 5, 2015

Add complex case mapping and title case mapping. #26039

Merged

bors closed this as completed in addaa5b Jun 9, 2015

squelart mentioned this issue Jun 5, 2018

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement full to_{upper,lower}case algorithms #25800

Implement full to_{upper,lower}case algorithms #25800

alexcrichton commented May 26, 2015

andrewclarkson commented May 27, 2015

alexcrichton commented May 27, 2015

Implement full to_{upper,lower}case algorithms #25800

Implement full to_{upper,lower}case algorithms #25800

Comments

alexcrichton commented May 26, 2015

andrewclarkson commented May 27, 2015

alexcrichton commented May 27, 2015