Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36216: Add check for characters in netloc that normalize to separators #12201

Merged
merged 2 commits into from
Mar 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions Doc/library/urllib.parse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,11 @@ or on combining URL components into a URL string.
Unmatched square brackets in the :attr:`netloc` attribute will raise a
:exc:`ValueError`.

Characters in the :attr:`netloc` attribute that decompose under NFKC
normalization (as used by the IDNA encoding) into any of ``/``, ``?``,
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
decomposed before parsing, no error will be raised.

.. versionchanged:: 3.2
Added IPv6 URL parsing capabilities.

Expand All @@ -136,6 +141,10 @@ or on combining URL components into a URL string.
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.

.. versionchanged:: 3.8
Characters that affect netloc parsing under NFKC normalization will
now raise :exc:`ValueError`.


.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None)

Expand Down Expand Up @@ -259,10 +268,19 @@ or on combining URL components into a URL string.
Unmatched square brackets in the :attr:`netloc` attribute will raise a
:exc:`ValueError`.

Characters in the :attr:`netloc` attribute that decompose under NFKC
normalization (as used by the IDNA encoding) into any of ``/``, ``?``,
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
decomposed before parsing, no error will be raised.

.. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.

.. versionchanged:: 3.8
Characters that affect netloc parsing under NFKC normalization will
now raise :exc:`ValueError`.


.. function:: urlunsplit(parts)

Expand Down
23 changes: 23 additions & 0 deletions Lib/test/test_urlparse.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import sys
import unicodedata
import unittest
import urllib.parse

Expand Down Expand Up @@ -994,6 +996,27 @@ def test_all(self):
expected.append(name)
self.assertCountEqual(urllib.parse.__all__, expected)

def test_urlsplit_normalization(self):
# Certain characters should never occur in the netloc,
# including under normalization.
# Ensure that ALL of them are detected and cause an error
illegal_chars = '/:#?@'
hex_chars = {'{:04X}'.format(ord(c)) for c in illegal_chars}
denorm_chars = [
c for c in map(chr, range(128, sys.maxunicode))
if (hex_chars & set(unicodedata.decomposition(c).split()))
and c not in illegal_chars
]
# Sanity check that we found at least one such character
self.assertIn('\u2100', denorm_chars)
self.assertIn('\uFF03', denorm_chars)

for scheme in ["http", "https", "ftp"]:
for c in denorm_chars:
url = "{}://netloc{}false.netloc/path".format(scheme, c)
with self.subTest(url=url, char='{:04X}'.format(ord(c))):
with self.assertRaises(ValueError):
urllib.parse.urlsplit(url)

class Utility_Tests(unittest.TestCase):
"""Testcase to test the various utility functions in the urllib."""
Expand Down
17 changes: 17 additions & 0 deletions Lib/urllib/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,21 @@ def _splitnetloc(url, start=0):
delim = min(delim, wdelim) # use earliest delim position
return url[start:delim], url[delim:] # return (domain, rest)

def _checknetloc(netloc):
if not netloc or netloc.isascii():
return
# looking for characters like \u2100 that expand to 'a/c'
# IDNA uses NFKC equivalence, so normalize for this check
import unicodedata
netloc2 = unicodedata.normalize('NFKC', netloc)
if netloc == netloc2:
return
_, _, netloc = netloc.rpartition('@') # anything to the left of '@' is okay
Copy link
Contributor

@mcepl mcepl May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zooba @tiran Could you tell me something about this line (it is now https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L405)? It seems to me that it exactly makes the first example from https://bugs.python.org/issue36216 fail as before:

>>> u = "https://example.com\uFF03@bing.com"
>>> urlsplit(u).netloc.rpartition("@")[2]
bing.com

for c in '/?#@:':
if c in netloc2:
raise ValueError("netloc '" + netloc2 + "' contains invalid " +
"characters under NFKC normalization")

def urlsplit(url, scheme='', allow_fragments=True):
"""Parse a URL into 5 components:
<scheme>://<netloc>/<path>?<query>#<fragment>
Expand Down Expand Up @@ -424,6 +439,7 @@ def urlsplit(url, scheme='', allow_fragments=True):
url, fragment = url.split('#', 1)
if '?' in url:
url, query = url.split('?', 1)
_checknetloc(netloc)
v = SplitResult('http', netloc, url, query, fragment)
_parse_cache[key] = v
return _coerce_result(v)
Expand All @@ -447,6 +463,7 @@ def urlsplit(url, scheme='', allow_fragments=True):
url, fragment = url.split('#', 1)
if '?' in url:
url, query = url.split('?', 1)
_checknetloc(netloc)
v = SplitResult(scheme, netloc, url, query, fragment)
_parse_cache[key] = v
return _coerce_result(v)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Changes urlsplit() to raise ValueError when the URL contains characters that
decompose under IDNA encoding (NFKC-normalization) into characters that
affect how the URL is parsed.