Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Unicode 14.0 #89614

Merged
merged 3 commits into from
Oct 9, 2021
Merged

Update to Unicode 14.0 #89614

merged 3 commits into from
Oct 9, 2021

Conversation

cuviper
Copy link
Member

@cuviper cuviper commented Oct 7, 2021

The Unicode Standard announced Version 14.0 on September 14, 2021, and this pull request updates the generated tables in core accordingly.

This did require a little prep-work in unicode-table-generator. First, #81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well. Next, I found that the "Alphabetic" property in version 14 was panicking when generating a bitset, "cannot pack 264 into 8 bits". We've been using the skiplist for that anyway, so I changed this to fail gracefully. Finally, I confirmed that the tool still created the exact same tables for 13 before moving to 14.

The "Alphabetic" property in Unicode 14 grew too big for the bitset
representation, panicking "cannot pack 264 into 8 bits". However, we
were already choosing the skiplist for that anyway, so this doesn't need
to be a hard failure. That panic is now a returned `Err`, and then in
`emit_codepoints` we automatically defer to skiplist.
@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 7, 2021
@cuviper
Copy link
Member Author

cuviper commented Oct 7, 2021

For comparison, here's the tool output with version 13 data:

Alphabetic     : 1599 bytes, 132875 codepoints in 695 ranges (65 - 201547) using skiplist
Case_Ignorable : 949 bytes, 2413 codepoints in 410 ranges (39 - 918000) using skiplist
Cased          : 359 bytes, 4286 codepoints in 141 ranges (65 - 127370) using skiplist
Cc             : 9 bytes, 65 codepoints in 2 ranges (0 - 160) using skiplist
Grapheme_Extend: 813 bytes, 1979 codepoints in 344 ranges (768 - 918000) using skiplist
Lowercase      : 867 bytes, 2344 codepoints in 652 ranges (97 - 125252) using bitset
N              : 419 bytes, 1781 codepoints in 133 ranges (48 - 130042) using skiplist
Uppercase      : 777 bytes, 1911 codepoints in 643 ranges (65 - 127370) using bitset
White_Space    : 37 bytes, 25 codepoints in 10 ranges (9 - 12289) using skiplist
Total table sizes: 5829 bytes

And here's with version 14 data:

Alphabetic     : 1649 bytes, 133396 codepoints in 722 ranges (65 - 201547) using skiplist
Case_Ignorable : 995 bytes, 2602 codepoints in 427 ranges (39 - 918000) using skiplist
Cased          : 395 bytes, 4453 codepoints in 155 ranges (65 - 127370) using skiplist
Cc             : 9 bytes, 65 codepoints in 2 ranges (0 - 160) using skiplist
Grapheme_Extend: 835 bytes, 2090 codepoints in 353 ranges (768 - 918000) using skiplist
Lowercase      : 907 bytes, 2471 codepoints in 668 ranges (97 - 125252) using bitset
N              : 421 bytes, 1791 codepoints in 134 ranges (48 - 130042) using skiplist
Uppercase      : 791 bytes, 1951 codepoints in 651 ranges (65 - 127370) using bitset
White_Space    : 37 bytes, 25 codepoints in 10 ranges (9 - 12289) using skiplist
Total table sizes: 6039 bytes

@gilescope
Copy link
Contributor

Sorry! I know for next time. Thank you for all the work you're doing upgrading unicode!

@klensy
Copy link
Contributor

klensy commented Oct 7, 2021

There some unicode crates around, should they be updated too?

@cuviper
Copy link
Member Author

cuviper commented Oct 7, 2021

@klensy sure, other crates will have to be updated individually. e.g. unicode-xid might be good to get rustc to recognize new characters in identifiers.

@joshtriplett
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Oct 9, 2021

📌 Commit 459a7e3 has been approved by joshtriplett

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 9, 2021
@joshtriplett joshtriplett added the relnotes Marks issues that should be documented in the release notes of the next release. label Oct 9, 2021
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Oct 9, 2021
Update to Unicode 14.0

The Unicode Standard [announced Version 14.0](https://home.unicode.org/announcing-the-unicode-standard-version-14-0/) on September 14, 2021, and this pull request updates the generated tables in `core` accordingly.

This did require a little prep-work in `unicode-table-generator`. First, rust-lang#81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well. Next, I found that the "Alphabetic" property in version 14 was panicking when generating a bitset, "cannot pack 264 into 8 bits". We've been using the skiplist for that anyway, so I changed this to fail gracefully. Finally, I confirmed that the tool still created the exact same tables for 13 before moving to 14.
bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 9, 2021
…laumeGomez

Rollup of 6 pull requests

Successful merges:

 - rust-lang#75644 (Add 'core::array::from_fn' and 'core::array::try_from_fn')
 - rust-lang#87528 (stack overflow handler specific openbsd change.)
 - rust-lang#88436 (std: Stabilize command_access)
 - rust-lang#89614 (Update to Unicode 14.0)
 - rust-lang#89664 (Add documentation to boxed conversions)
 - rust-lang#89700 (Fix invalid HTML generation for higher bounds)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@camelid
Copy link
Member

camelid commented Oct 9, 2021

First, #81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well.

Maybe a tidy check should be added that compares the output of the tool to the checked-in file to make sure they're the same?

@bors bors merged commit 21a5101 into rust-lang:master Oct 9, 2021
@rustbot rustbot added this to the 1.57.0 milestone Oct 9, 2021
@gilescope
Copy link
Contributor

gilescope commented Oct 10, 2021 via email

@cuviper
Copy link
Member Author

cuviper commented Oct 12, 2021

@Mark-Simulacrum would mingw-check be a good place to check the generated code? Or maybe a rust-highfive mention?

@camelid
Copy link
Member

camelid commented Oct 13, 2021

Or maybe a rust-highfive mention?

That's a good idea; a rust-highfive ping should be quick to set up and sufficient since the file is not changed often.

@Mark-Simulacrum
Copy link
Member

Highfive definitely seems like a good idea, someone can file a PR adding that rule somewhere here - https://github.com/rust-lang/highfive/blob/master/highfive/configs/rust-lang/rust.json#L130

I think running the script on each CI to check it's still in sync might be a bit much, though, seems not worth it to me. If we want something more than highfive, I would just fail CI if the file doesn't match a hardcoded hash (e.g., in tidy). Easy to update when needed and fast, as well as potentially more broadly useful (we have several such files, I think).

wip-sync pushed a commit to NetBSD/pkgsrc-wip that referenced this pull request Dec 6, 2021
Pkgsrc changes:
 * Adapt a couple of patches

Upstream changes:

Version 1.57.0 (2021-12-02)
==========================

Language
--------

- [Macro attributes may follow `#[derive]` and will see the original
  (pre-`cfg`) input.][87220]
- [Accept curly-brace macros in expressions, like `m!{ .. }.method()`
  and `m!{ .. }?`.][88690]
- [Allow panicking in constant evaluation.][89508]

Compiler
--------

- [Create more accurate debuginfo for vtables.][89597]
- [Add `armv6k-nintendo-3ds` at Tier 3\*.][88529]
- [Add `armv7-unknown-linux-uclibceabihf` at Tier 3\*.][88952]
- [Add `m68k-unknown-linux-gnu` at Tier 3\*.][88321]
- [Add SOLID targets at Tier 3\*:][86191] `aarch64-kmc-solid_asp3`,
  `armv7a-kmc-solid_asp3-eabi`, `armv7a-kmc-solid_asp3-eabihf`

\* Refer to Rust's [platform support page][platform-support-doc] for more
   information on Rust's tiered platform support.

Libraries
---------

- [Avoid allocations and copying in `Vec::leak`][89337]
- [Add `#[repr(i8)]` to `Ordering`][89507]
- [Optimize `File::read_to_end` and `read_to_string`][89582]
- [Update to Unicode 14.0][89614]
- [Many more functions are marked `#[must_use]`][89692], producing a warning
  when ignoring their return value. This helps catch mistakes such as expecting
  a function to mutate a value in place rather than return a new value.

Stabilised APIs
---------------

- [`[T; N]::as_mut_slice`][`array::as_mut_slice`]
- [`[T; N]::as_slice`][`array::as_slice`]
- [`collections::TryReserveError`]
- [`HashMap::try_reserve`]
- [`HashSet::try_reserve`]
- [`String::try_reserve`]
- [`String::try_reserve_exact`]
- [`Vec::try_reserve`]
- [`Vec::try_reserve_exact`]
- [`VecDeque::try_reserve`]
- [`VecDeque::try_reserve_exact`]
- [`Iterator::map_while`]
- [`iter::MapWhile`]
- [`proc_macro::is_available`]
- [`Command::get_program`]
- [`Command::get_args`]
- [`Command::get_envs`]
- [`Command::get_current_dir`]
- [`CommandArgs`]
- [`CommandEnvs`]

These APIs are now usable in const contexts:

- [`hint::unreachable_unchecked`]

Cargo
-----

- [Stabilize custom profiles][cargo/9943]

Compatibility notes
-------------------

Internal changes
----------------
These changes provide no direct user facing benefits, but represent significant
improvements to the internals and overall performance of rustc
and related tools.

- [Added an experimental backend for codegen with `libgccjit`.][87260]

[86191]: rust-lang/rust#86191
[87220]: rust-lang/rust#87220
[87260]: rust-lang/rust#87260
[88243]: rust-lang/rust#88243
[88321]: rust-lang/rust#88321
[88529]: rust-lang/rust#88529
[88690]: rust-lang/rust#88690
[88952]: rust-lang/rust#88952
[89337]: rust-lang/rust#89337
[89507]: rust-lang/rust#89507
[89508]: rust-lang/rust#89508
[89582]: rust-lang/rust#89582
[89597]: rust-lang/rust#89597
[89614]: rust-lang/rust#89614
[89692]: rust-lang/rust#89692
[cargo/9943]: rust-lang/cargo#9943
[`array::as_mut_slice`]: https://doc.rust-lang.org/std/primitive.array.html#method.as_mut_slice
[`array::as_slice`]: https://doc.rust-lang.org/std/primitive.array.html#method.as_slice
[`collections::TryReserveError`]: https://doc.rust-lang.org/std/collections/struct.TryReserveError.html
[`HashMap::try_reserve`]: https://doc.rust-lang.org/std/collections/hash_map/struct.HashMap.html#method.try_reserve
[`HashSet::try_reserve`]: https://doc.rust-lang.org/std/collections/hash_set/struct.HashSet.html#method.try_reserve
[`String::try_reserve`]: https://doc.rust-lang.org/alloc/string/struct.String.html#method.try_reserve
[`String::try_reserve_exact`]: https://doc.rust-lang.org/alloc/string/struct.String.html#method.try_reserve_exact
[`Vec::try_reserve`]: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve
[`Vec::try_reserve_exact`]: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve_exact
[`VecDeque::try_reserve`]: https://doc.rust-lang.org/std/collections/struct.VecDeque.html#method.try_reserve
[`VecDeque::try_reserve_exact`]: https://doc.rust-lang.org/std/collections/struct.VecDeque.html#method.try_reserve_exact
[`Iterator::map_while`]: https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.map_while
[`iter::MapWhile`]: https://doc.rust-lang.org/std/iter/struct.MapWhile.html
[`proc_macro::is_available`]: https://doc.rust-lang.org/proc_macro/fn.is_available.html
[`Command::get_program`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_program
[`Command::get_args`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_args
[`Command::get_envs`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_envs
[`Command::get_current_dir`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_current_dir
[`CommandArgs`]: https://doc.rust-lang.org/std/process/struct.CommandArgs.html
[`CommandEnvs`]: https://doc.rust-lang.org/std/process/struct.CommandEnvs.html
@cuviper cuviper deleted the unicode-14 branch December 8, 2021 07:30
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Jan 22, 2022
Pkgsrc changes:
 * Adjust line numbers in a number of patches
 * remove the --disable-dist-src option, so that we produce
   the rust-src rust component, which we upload to LOCALSRC
   to allow the rust-src package to build, which is needed
   for rust-analyzer.
 * Cargo checksum for vendor/cc no longer needs patching;
   checksum for vendor/libc updated

Upstream changes:

Version 1.57.0 (2021-12-02)
==========================

Language
--------

- [Macro attributes may follow `#[derive]` and will see the original
  (pre-`cfg`) input.][87220]
- [Accept curly-brace macros in expressions, like `m!{ .. }.method()`
  and `m!{ .. }?`.][88690]
- [Allow panicking in constant evaluation.][89508]

Compiler
--------

- [Create more accurate debuginfo for vtables.][89597]
- [Add `armv6k-nintendo-3ds` at Tier 3\*.][88529]
- [Add `armv7-unknown-linux-uclibceabihf` at Tier 3\*.][88952]
- [Add `m68k-unknown-linux-gnu` at Tier 3\*.][88321]
- [Add SOLID targets at Tier 3\*:][86191] `aarch64-kmc-solid_asp3`,
  `armv7a-kmc-solid_asp3-eabi`, `armv7a-kmc-solid_asp3-eabihf`

\* Refer to Rust's [platform support page][platform-support-doc] for more
   information on Rust's tiered platform support.

Libraries
---------

- [Avoid allocations and copying in `Vec::leak`][89337]
- [Add `#[repr(i8)]` to `Ordering`][89507]
- [Optimize `File::read_to_end` and `read_to_string`][89582]
- [Update to Unicode 14.0][89614]
- [Many more functions are marked `#[must_use]`][89692], producing a warning
  when ignoring their return value. This helps catch mistakes such as expecting
  a function to mutate a value in place rather than return a new value.

Stabilised APIs
---------------

- [`[T; N]::as_mut_slice`][`array::as_mut_slice`]
- [`[T; N]::as_slice`][`array::as_slice`]
- [`collections::TryReserveError`]
- [`HashMap::try_reserve`]
- [`HashSet::try_reserve`]
- [`String::try_reserve`]
- [`String::try_reserve_exact`]
- [`Vec::try_reserve`]
- [`Vec::try_reserve_exact`]
- [`VecDeque::try_reserve`]
- [`VecDeque::try_reserve_exact`]
- [`Iterator::map_while`]
- [`iter::MapWhile`]
- [`proc_macro::is_available`]
- [`Command::get_program`]
- [`Command::get_args`]
- [`Command::get_envs`]
- [`Command::get_current_dir`]
- [`CommandArgs`]
- [`CommandEnvs`]

These APIs are now usable in const contexts:

- [`hint::unreachable_unchecked`]

Cargo
-----

- [Stabilize custom profiles][cargo/9943]

Compatibility notes
-------------------

Internal changes
----------------
These changes provide no direct user facing benefits, but represent significant
improvements to the internals and overall performance of rustc
and related tools.

- [Added an experimental backend for codegen with `libgccjit`.][87260]

[86191]: rust-lang/rust#86191
[87220]: rust-lang/rust#87220
[87260]: rust-lang/rust#87260
[88243]: rust-lang/rust#88243
[88321]: rust-lang/rust#88321
[88529]: rust-lang/rust#88529
[88690]: rust-lang/rust#88690
[88952]: rust-lang/rust#88952
[89337]: rust-lang/rust#89337
[89507]: rust-lang/rust#89507
[89508]: rust-lang/rust#89508
[89582]: rust-lang/rust#89582
[89597]: rust-lang/rust#89597
[89614]: rust-lang/rust#89614
[89692]: rust-lang/rust#89692
[cargo/9943]: rust-lang/cargo#9943
[`array::as_mut_slice`]: https://doc.rust-lang.org/std/primitive.array.html#method.as_mut_slice
[`array::as_slice`]: https://doc.rust-lang.org/std/primitive.array.html#method.as_slice
[`collections::TryReserveError`]: https://doc.rust-lang.org/std/collections/struct.TryReserveError.html
[`HashMap::try_reserve`]: https://doc.rust-lang.org/std/collections/hash_map/struct.HashMap.html#method.try_reserve
[`HashSet::try_reserve`]: https://doc.rust-lang.org/std/collections/hash_set/struct.HashSet.html#method.try_reserve
[`String::try_reserve`]: https://doc.rust-lang.org/alloc/string/struct.String.html#method.try_reserve
[`String::try_reserve_exact`]: https://doc.rust-lang.org/alloc/string/struct.String.html#method.try_reserve_exact
[`Vec::try_reserve`]: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve
[`Vec::try_reserve_exact`]: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve_exact
[`VecDeque::try_reserve`]: https://doc.rust-lang.org/std/collections/struct.VecDeque.html#method.try_reserve
[`VecDeque::try_reserve_exact`]: https://doc.rust-lang.org/std/collections/struct.VecDeque.html#method.try_reserve_exact
[`Iterator::map_while`]: https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.map_while
[`iter::MapWhile`]: https://doc.rust-lang.org/std/iter/struct.MapWhile.html
[`proc_macro::is_available`]: https://doc.rust-lang.org/proc_macro/fn.is_available.html
[`Command::get_program`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_program
[`Command::get_args`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_args
[`Command::get_envs`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_envs
[`Command::get_current_dir`]: https://doc.rust-lang.org/std/process/struct.Command.html#method.get_current_dir
[`CommandArgs`]: https://doc.rust-lang.org/std/process/struct.CommandArgs.html
[`CommandEnvs`]: https://doc.rust-lang.org/std/process/struct.CommandEnvs.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants