Skip to content

Commit

Permalink
Reimplement idna on top of ICU4X (servo#923)
Browse files Browse the repository at this point in the history
* Reimplement idna on top of ICU4X

* Add an even faster lower-case ASCII letter path to avoid regressing performance

* Comments and verify_dns_length tweak

* Parametrize internal vs. external Punycode caller; restore external API behavior

* Add bench for to_ascii on an already-Punycode name

* Avoid re-encoding Punycode when possible

* Pass through the input slice in many more cases

* Add testing for the simultaneous mode

* Omit the invalid domain character check on the url side

* Document that Punycode labels must result in non-ASCII

* Rename files called uts46.rs to deprecated.rs

* Rename uts46bis to uts46

* Tweak docs

* Avoid useless copying and useless UTF-8 decode

* Use inline(never) to optimize binary size

* Split CheckHyphens into a separate concern form the ASCII deny list

* Make the ASCII deny list customizable

* Better docs and top-level functions

* Parameter for VerifyDNSLength

* Restore support for transitional processing to minimize breakage

* In the deprecated API, use empty deny list with use_std3_ascii_rules=false

* Tweak docs

* Docs, rename AsciiDenyList::WHATWG to ::URL, tweak top-level functions

* Use idna crate top-level function in the url crate to dogfood the top-level function

* Add an Usage section to the README

* Add an early return to map_transitional for readability

* Document internal vs. external Punycode caller differences

* Per discussion with Valentin, revert deprecated API to the old behavior that does not check hyphens in positions 3 and 4

* Add comments about not fixing deprecated API

* Add a comment explaining FailFast in deprecated.rs

* For future-proofing, add compiled_data cargo feature (currently always required)

Since other changes in this changeset require a semver break anyway, this
change takes a semver break in the case of `default-features = false` in
order to avoid a future semver break if in the future a need to add a
bring-your-own-data (using `icu_provider`) constructor for `Uts46`
shows up.

* Remove remark about spec violation by making root dot permissibility configurable

* Clarify README about IDNA 2003/2008

* Add a historical remark to the README

* Fix typo

* Depend on crates.io versions of icu_normalizer and icu_properties

* Address clippy lints

* Update versions

* Increment dependency versions
  • Loading branch information
hsivonen authored Jun 3, 2024
1 parent de947ab commit 3d6dbbb
Show file tree
Hide file tree
Showing 20 changed files with 8,442 additions and 30,441 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
rust: [1.56.0, stable, beta, nightly]
rust: [1.67.0, stable, beta, nightly]
exclude:
- os: macos-latest
rust: 1.56.0
rust: 1.67.0
- os: windows-latest
rust: 1.56.0
rust: 1.67.0
- os: macos-latest
rust: beta
- os: windows-latest
Expand All @@ -47,7 +47,7 @@ jobs:
- name: Run debugger_visualizer tests
if: |
matrix.os == 'windows-latest' &&
matrix.rust != '1.56.0'
matrix.rust != '1.67.0'
run: cargo test --test debugger_visualizer --features "url/debugger_visualizer,url_debug_tests/debugger_visualizer" -- --test-threads=1
- name: Test `no_std` support
run: cargo test --no-default-features --features=alloc
Expand Down
18 changes: 12 additions & 6 deletions idna/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
[package]
name = "idna"
version = "0.5.0"
version = "1.0.0"
authors = ["The rust-url developers"]
description = "IDNA (Internationalizing Domain Names in Applications) and Punycode."
categories = ["no_std"]
repository = "https://github.com/servo/rust-url/"
license = "MIT OR Apache-2.0"
autotests = false
edition = "2018"
rust-version = "1.51"
rust-version = "1.67"

[lib]
doctest = false

[features]
default = ["std"]
std = ["alloc", "unicode-bidi/std", "unicode-normalization/std"]
default = ["std", "compiled_data"]
std = ["alloc"]
alloc = []
compiled_data = ["icu_normalizer/compiled_data", "icu_properties/compiled_data"]

[[test]]
name = "tests"
Expand All @@ -25,15 +26,20 @@ harness = false
[[test]]
name = "unit"

[[test]]
name = "unitbis"

[dev-dependencies]
assert_matches = "1.3"
bencher = "0.1"
tester = "0.9"
serde_json = "1.0"

[dependencies]
unicode-bidi = { version = "0.3.10", default-features = false, features = ["hardcoded-data"] }
unicode-normalization = { version = "0.1.22", default-features = false }
icu_normalizer = "1.4.3"
icu_properties = "1.4.2"
utf8_iter = "1.0.4"
smallvec = { version = "1.13.1", features = ["const_generics"]}

[[bench]]
name = "all"
Expand Down
38 changes: 38 additions & 0 deletions idna/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# `idna`

IDNA library for Rust implementing [UTS 46: Unicode IDNA Compatibility Processing](https://www.unicode.org/reports/tr46/) as parametrized by the [WHATWG URL Standard](https://url.spec.whatwg.org/#idna).

## What it does

* An implementation of UTS 46 is provided, with configurable ASCII deny list (e.g. STD3 or WHATWG rules).
* A callback mechanism is provided for pluggable logic for deciding if a label is deemed potentially too misleading to render as Unicode in a user interface.
* Errors are marked as U+FFFD REPLACEMENT CHARACTERs in Unicode output so that locations of errors may be illustrated to the user.

## What it does not do

* There is no default/sample policy provided for the callback mechanism mentioned above.
* Only UTS 46 is implemented: There is no API to request strictly IDNA 2008 only or strictly IDNA 2003 only.
* There is no API for categorizing errors beyond there being an error.
* Checks that are configurable in UTS 46 but that the WHATWG URL Standard always set a particular way (regardless of the _beStrict_ flag in the URL Standard) cannot be configured (with the exception of the old deprecated API supporting transitional processing).

## Usage

Apps that need to prepare a hostname for usage in protocols are likely to only need the top-level function `domain_to_ascii_cow` with `AsciiDenyList::URL` as the second argument. Note that this rejects IPv6 addresses, so before this, you need to check if the first byte of the input is `b'['` and, if it is, treat the input as an IPv6 address instead.

Apps that need to display host names to the user should use `uts46::Uts46::to_user_interface`. The _ToUnicode_ operation is rarely appropriate for direct application usage.

## Cargo features

* `alloc` - For future proofing. Currently always required. Currently, the crate internal may allocate heap but for typical inputs do not allocate on the heap (apart from the output `String` when applicable).
* `compiled_data` - For future proofing. Currently always required. (Passed through to ICU4X.)
* `std` - Adds `impl std::error::Error for Errors {}` (and implies `alloc`).
* By default, all of the above are enabled.

## Breaking changes since 0.5.0

* Stricter IDNA 2008 restrictions are no longer supported. Attempting to enable them panics immediately. UTS 46 allows all the names that IDNA 2008 allows, and when transitional processing is disabled, they resolve the same way. There are additional names that IDNA 2008 disallows but UTS 46 maps to names that IDNA 2008 allows (notably, input is mapped to fold-case output). UTS 46 also allows symbols that were allowed in IDNA 2003 as well as newer symbols that are allowed according to the same principle. (Earlier versions of this crate allowed rejecting such symbols. Rejecting characters that UTS 46 maps to IDNA 2008-permitted characters wasn't supported in earlier versions, either.)
* `domain_to_ascii_strict` now performs the _CheckHyphens_ check (matching previous documentation).
* The ContextJ rules are now implemented and always enabled, even when using the old deprecated API, so input that fails those rules is rejected.
* The `Idna::to_ascii_inner` method has been removed. It didn't make sense as a public method, since callers were unable to figure out if there were errors. (A GitHub search found no callers for this method.)
* Punycode labels whose decoding does not yield any non-ASCII characters are now treated as being in error.
* When turning off default cargo features, the cargo feature `compiled_data` needs to be explicitly enabled.
2 changes: 2 additions & 0 deletions idna/benches/all.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#![allow(deprecated)]

#[macro_use]
extern crate bencher;
extern crate idna;
Expand Down
Loading

0 comments on commit 3d6dbbb

Please sign in to comment.