-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urllib.parse doesn't round-trip file URI's with multiple leading slashes #78457
Comments
urllib.parse doesn't seem to round-trip file URI's containing multiple leading slashes. For example, this-- import urllib.parse
def round_trip(url):
parsed = urllib.parse.urlsplit(url)
new_url = urllib.parse.urlunsplit(parsed)
print(f'{url} [{parsed}]\n{new_url}')
print('ROUNDTRIP: {}\n'.format(url == new_url))
for i in range(4):
round_trip('file://{}root/a'.format(i * '/')) results in--
URI's of the form file:////<host>/<share>/<path> occur, for example, when one wants to git-clone a UNC path on Windows: Here is where CPython defines urlunsplit(): Lines 465 to 482 in 4e11c46
(The '//' special-casing seems to occur in this line here: Line 473 in 4e11c46
And here is where the round-tripping is tested: cpython/Lib/test/test_urlparse.py Line 156 in 4e11c46
(Three initial leading slashes is tested, but not the problem case of four or more.) |
This is an issue with Python 2 too which I hope can be fixed too. The original logic in the code was committed around 16 years back : bbc0568 and tests are also around 10 years old too. ➜ cpython git:(2bea771) ✗ ./python.exe
file:///root/a [SplitResult(scheme='file', netloc='', path='/root/a', query='', fragment='')] file:////root/a [SplitResult(scheme='file', netloc='', path='//root/a', query='', fragment='')] file://///root/a [SplitResult(scheme='file', netloc='', path='///root/a', query='', fragment='')] Thanks |
I just checked back the behavior on Perl's https://github.com/libwww-perl/URI/ . It seems to handle that along with other additional cases. Maybe some of the tests can be adopted from there for better coverage too (https://github.com/libwww-perl/URI/blob/master/t/split.t) $ cat bpo34276.pl
use URI::Split qw(uri_split uri_join); sub print_url{ print_url("file://root/a"); $ perl bpo34276.pl
original uri file://root/a
returned uri file://root/a
original uri file:///root/a
returned uri file:///root/a
original uri file:////root/a
returned uri file:////root/a
original uri file://///root/a
returned uri file://///root/a Thanks |
This may be a very old regression (from 2002) caused by bpo-591713 and Mercurial rev. 554f975073a0. The original check for the double slash, added in 0d6bd391acd8, “escapes” a path beginning with a double slash by prefixing it with two more slashes (empty “netloc”). This should round-trip Chris’s problem URLs. I think the logic in “urlsplit” should always add the extra double slash for the netloc, regardless of path, at least if a scheme is present and it is registered in “uses_netloc”. This should fix Chris’s instance of the bug, since “file:” is registered. There is already a patch in bpo-1722348 which should do this (although it includes other changes as well). The double slash should also be escaped if no scheme is present. (The empty scheme string is already in “uses_netloc”.) This might satisfy bpo-23505. IMO it would be better to do the escaping by default, for all schemes unknown to “urllib”, and to blacklist specific schemes like “mailto:” instead. But that would be out of scope for a bug fix. |
Thanks for all the extra info. A couple more comments:
However, I don't think that means Python shouldn't try to roundtrip it successfully. Also, git-clone is apparently okay with URLs of this form, and does the right thing with them. |
I think your URLs are valid by RFC 3986. "When authority is not present" refers to URLs without the double-slash prefix, like the "urn:example:animal:ferret:nose". The RFC treats empty authority and no authority as different cases. If authority is present, the format for hier-part has to be "//" authority path-abempty Authority may be an empty string: authority = [userinfo "@"] host [":" port]
host = IP-literal / IPv4address / reg-name
reg-name = *(unreserved / pct-encoded / sub-delims) ; May be empty Path-abempty may begin with two slashes if the first two segments are empty strings: path-abempty = *("/" segment) |
I'm not well-versed on this. But I guess this means urllib.parse doesn't support this distinction. For example: >>> urllib.parse.urlsplit('file:/foo')
SplitResult(scheme='file', netloc='', path='/foo', query='', fragment='')
>>> urllib.parse.urlsplit('file:///foo')
SplitResult(scheme='file', netloc='', path='/foo', query='', fragment='')
>>> urllib.parse.urlsplit('file:/foo') == \
urllib.parse.urlsplit('file:///foo')
True Both have authority / netloc equal to the empty string, even though in the first example the authority isn't present per your comment. |
Yes urllib doesn’t distinguish a missing authority/netloc from an empty string. The same for the ?query and #fragment parts. There is bpo-22852 open about that. |
file URI scheme is covered by RFC8089, specifically https://tools.ietf.org/html/rfc8089#appendix-E.3.2. |
Initially the condition was added in f3963b1, specially to handle f3963b1#diff-c272b33cfc076b56637b73ae6fecd13ca0b999883a0a48b16a5c564b24bdb4c3R115 if netloc or (scheme in uses_netloc and url[:2] == '//'): But it was wrong for 7dfb6e2#diff-c272b33cfc076b56637b73ae6fecd13ca0b999883a0a48b16a5c564b24bdb4c3R131 if netloc or (scheme in uses_netloc and url[:2] != '//'): Since the first change was without tests, its initial purpose was lost. I believe that the right condition should be if netloc or (scheme and scheme in uses_netloc) or url[:2] == '//': Even if |
After seeing that the “mailto:” RFC 6068 says slashes (/) have to be encoded in e.g. mailto:%2F%2Flocal@domain, I agree with adding the extra double-slash (//) regardless of scheme. (Previously I believed mailto://local@domain to be valid for an address of //local@domain, but I take that back.) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: