Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrected very minor documentation detail about Unicode and Japanese #40499

Merged
merged 2 commits into from Mar 17, 2017
Merged

Corrected very minor documentation detail about Unicode and Japanese #40499

merged 2 commits into from Mar 17, 2017

Conversation

ghost
Copy link

@ghost ghost commented Mar 14, 2017

Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");

r? @steveklabnik

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @steveklabnik (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@steveklabnik
Copy link
Member

Interesting, I've never heard of this.

I'm gonna look into it tomorrow but if anyone else wants to r+ before then, feel free.

@ghost
Copy link
Author

ghost commented Mar 14, 2017

The wiki page Unicode Equivalence under the subtitle 'Typographic Conventions' has some more details.

@nagisa
Copy link
Member

nagisa commented Mar 14, 2017

FULL WIDTH LATIN {SMALL,CAPITAL} LETTER A is still a Latin letter from the Latin script. One can attribute exactly 2 scripts to Japanese writing system kanji and kana. Neither of those have case and therefore the previous statement is just fine.

Now, I'm totally fine with making a change like this, but attributing logographs used in the whole CJK to Japanese seems... Unfair I guess?

How about we just use a kana (これ) instead of the current kanji for the example?

@steveklabnik
Copy link
Member

How about we just use a kana (これ) instead of the current kanji for the example?

Sounds good to me.

@ghost
Copy link
Author

ghost commented Mar 14, 2017

One can attribute exactly 2 scripts to Japanese writing system kanji and kana.

Its not that cut and dried. Unicode is hard because we are dealing with human languages in all their complexity. By changing the documentation from 'Japanese' to 'Japanese kanji' we can avoid that complexity.

How about we just use a kana (これ) instead of the current kanji for the example?

I can't see the value in changing from kanji to hiragana, it doesn't change anything. Anyway, 山 is a nice character.

@mzji
Copy link

mzji commented Mar 14, 2017

My little advice: how about using "CJK characters" (or CJKV characters?) instead of "Japanese kanji characters"? Since these characters are used widely in chinese & japanese & korean (and vietnamese), not only japanese.

@ghost
Copy link
Author

ghost commented Mar 14, 2017

How about

/// // Characters that do not have both uppercase and lowercase
/// // convert into themselves.
/// assert_eq!('山'.to_lowercase().to_string(), "山");

?

@mzji
Copy link

mzji commented Mar 15, 2017

How about

/// // Characters that do not have both uppercase and lowercase
/// // convert into themselves.
/// assert_eq!('山'.to_lowercase().to_string(), "山");

?

Looks good.

@steveklabnik
Copy link
Member

@bors: r+ rollup

thanks !

@bors
Copy link
Contributor

bors commented Mar 15, 2017

📌 Commit 18a8494 has been approved by steveklabnik

frewsxcv added a commit to frewsxcv/rust that referenced this pull request Mar 17, 2017
Corrected very minor documentation detail about Unicode and Japanese

Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");`

r? @steveklabnik
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Mar 17, 2017
Corrected very minor documentation detail about Unicode and Japanese

Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");`

r? @steveklabnik
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Mar 17, 2017
Corrected very minor documentation detail about Unicode and Japanese

Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");`

r? @steveklabnik
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Mar 17, 2017
Corrected very minor documentation detail about Unicode and Japanese

Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");`

r? @steveklabnik
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Mar 17, 2017
Corrected very minor documentation detail about Unicode and Japanese

Japanese half-width and full-width romaji characters do have upper and lowercase according Unicode (but other Japanese characters do not). For example,
` assert_eq!('\u{FF21}'.to_lowercase().collect::<String>(),"\u{FF41}");`

r? @steveklabnik
bors added a commit that referenced this pull request Mar 17, 2017
@bors bors merged commit 18a8494 into rust-lang:master Mar 17, 2017
@nodakai
Copy link
Contributor

nodakai commented May 12, 2017

Late to the party, but is this a valid explanation of 'ff'.to_uppercase() yielding "FF"?

FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;
FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF

That is, there's no uppercase ligature FF in Unicode (to be clear, I'm concerned about the wording "do not have both uppercase and lowercase".)

The same almost applies to 'ß'.to_uppercase() yielding "SS"

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;
...
1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF;
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

(Note the asymmetry here --- the uppercase eszett ẞ is non-orthographic in modern German)

@nagisa
Copy link
Member

nagisa commented May 12, 2017

Well, this explanation is discussing the case-less characters. Both of these ligatures are in caseful, it is just the case of unicode having no assigned codepoint for the uppercase variant of the ligatures you’ve given as an example.

@nodakai
Copy link
Contributor

nodakai commented May 14, 2017

@nagisa

Well, this explanation is discussing the case-less characters.

First, that assumption isn't evident from the text. Second, it isn't a good idea to focus on the "caseful/caseless" dichotomy because the input being caseful is only a necessary condition for any of casing conversions to be defined. E.g. ȷ is a Lowercase Letter (Ll) w/o an uppercase version:

0237;LATIN SMALL LETTER DOTLESS J;Ll;0;L;;;;;N;;;;;

I think all we can say is

When uppercase conversion isn't defined for the input character in Unicode, it is returned as-is.

Wdyt?

it is just the case of unicode having no assigned codepoint for the uppercase variant of the ligatures you’ve given as an example.

So... you're actually supporting my claim, right? They "do not have both uppercase and lowercase" and yet don't "convert into themselves."

@nagisa
Copy link
Member

nagisa commented May 14, 2017

So... you're actually supporting my claim, right? They "do not have both uppercase and lowercase" and yet don't "convert into themselves."

In my comment I’ve very purposefully used “character”(1) to mean a real character used in a language out there somewhere and “code point”(2) to mean an assigned code point in Unicode.

That is, what I’m really saying that this text should (and, I think, it currently is, due to its use of the word “character”) be discussing the real world characters. I’m very open to improving the wording and/or making it more obvious.


As per unicode glossary:

(1): The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.
(2): Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF₁₆. Not all code points are assigned to encoded characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants