Skip to content

Commit

Permalink
url: update WHATWG URL parser to latest spec
Browse files Browse the repository at this point in the history
- Update to spec
  - Add opaque hosts
  - File state did not correctly deal with lack of base URL
  - Cleanup API for file and non-special URLs
  - Allow % and IPv6 addresses in non-special URL hosts
  - Use specific names for percent-encode sets
  - Add empty host concept for file and non-special URLs
  - Clarify IPv6 serializer
- Fix existing mistakes
  - Add missing ':' to forbidden host code point list.
  - Correct IPv4 parser empty label behavior
- Maintain type equivalence in URLContext with spec
  - scheme, username, and password should always be strings
  - host, port, query, and fragment may be strings or null
  - Align scheme state more closely with the spec
  - Make sure the `special` variable is always synced with
    URL_FLAG_SPECIAL.

PR-URL: #12523
Fixes: #10608
Fixes: #10634
Refs: whatwg/url#185
Refs: whatwg/url#225
Refs: whatwg/url#224
Refs: whatwg/url#218
Refs: whatwg/url#243
Refs: whatwg/url#260
Refs: whatwg/url#268
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
  • Loading branch information
TimothyGu committed Apr 23, 2017
1 parent cf68280 commit e0530f1
Show file tree
Hide file tree
Showing 5 changed files with 830 additions and 728 deletions.
28 changes: 15 additions & 13 deletions doc/api/url.md
Original file line number Diff line number Diff line change
Expand Up @@ -1053,23 +1053,25 @@ located within the structure of the URL. The WHATWG URL Standard uses a more
selective and fine grained approach to selecting encoded characters than that
used by the older [`url.parse()`][] and [`url.format()`][] methods.

The WHATWG algorithm defines three "encoding sets" that describe ranges of
characters that must be percent-encoded:
The WHATWG algorithm defines three "percent-encode sets" that describe ranges
of characters that must be percent-encoded:

* The *simple encode set* includes code points in range U+0000 to U+001F
(inclusive) and all code points greater than U+007E.
* The *C0 control percent-encode set* includes code points in range U+0000 to
U+001F (inclusive) and all code points greater than U+007E.

* The *default encode set* includes the *simple encode set* and code points
U+0020, U+0022, U+0023, U+003C, U+003E, U+003F, U+0060, U+007B, and U+007D.
* The *path percent-encode set* includes the *C0 control percent-encode set*
and code points U+0020, U+0022, U+0023, U+003C, U+003E, U+003F, U+0060,
U+007B, and U+007D.

* The *userinfo encode set* includes the *default encode set* and code points
U+002F, U+003A, U+003B, U+003D, U+0040, U+005B, U+005C, U+005D, U+005E, and
U+007C.
* The *userinfo encode set* includes the *path percent-encode set* and code
points U+002F, U+003A, U+003B, U+003D, U+0040, U+005B, U+005C, U+005D,
U+005E, and U+007C.

The *simple encode set* is used primary for URL fragments and certain specific
conditions for the path. The *userinfo encode set* is used specifically for
username and passwords encoded within the URL. The *default encode set* is used
for all other cases.
The *userinfo percent-encode set* is used exclusively for username and
passwords encoded within the URL. The *path percent-encode set* is used for the
path of most URLs. The *C0 control percent-encode set* is used for all
other cases, including URL fragments in particular, but also host and path
under certain specific conditions.

When non-ASCII characters appear within a hostname, the hostname is encoded
using the [Punycode][] algorithm. Note, however, that a hostname *may* contain
Expand Down
Loading

0 comments on commit e0530f1

Please sign in to comment.