Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement full to_{upper,lower}case algorithms #25800

Closed
alexcrichton opened this issue May 26, 2015 · 2 comments
Closed

Implement full to_{upper,lower}case algorithms #25800

alexcrichton opened this issue May 26, 2015 · 2 comments

Comments

@alexcrichton
Copy link
Member

Right now we always return an iterator over one character, but the iterator is being returned so one day we can return many characters. This has all yet to be implemented, and this issue will track this implementation.

@andrewclarkson
Copy link

A couple friends and I dug into this issue as part of a mini-sprint. We'd love a mentor or some guidance to work on this.

As we understand it right now:

  • Unicode case conversions are done using raw text tables from unicode.org
  • The raw tables are exported as rust slices in tables.rs using a python script unicode.py
  • The current to_uppercase methods use a binary search over the single character conversion tables ignoring the multiple character compatibility_table

So what needs to happen then is:

  • Either join the two tables or search both of them.
  • to_upper and to_lower in tables.rs then need to be changed to return a slice rather than a single char
  • to_uppercase and to_lowercase need to use that slice for the iterators it returns

Is that an accurate summary?

Also, we noted there are several external but related crates for unicode but couldn't find any indication on whether these crates were moving into the standard lib or whether the standard lib was moving out.

@alexcrichton
Copy link
Member Author

@bitborn that all sounds pretty good! I think we may not want to encode the exact return value of each character (that's a lot of space). It'll be a balancing act to figure out how to encode the data on unicode.org in as compact a form as possible but still having a fast lookup for case conversions.

bors added a commit that referenced this issue Jun 9, 2015
* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well.
* Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while.
* Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`.
* Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it.

r? @alexcrichton
@bors bors closed this as completed in addaa5b Jun 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants