-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-87389: avoid treating path as URI with netloc #93894
base: main
Are you sure you want to change the base?
Conversation
still needs test for urlunsplit() change
When you're done making the requested changes, leave the comment: |
I have made the requested changes; please review again. I added |
Thanks for making the requested changes! @gpshead: please review the changes made to this pull request. |
The pathsplit() function will work correctly on relative paths too so don't say "absolute paths". Improve comment for _get_redirect_url().
Lib/http/server.py
Outdated
# reported in gh-87389, a path starting with a double slash should not be | ||
# treated as a relative URI. Also, a path with a colon in the first | ||
# component could also be parsed wrongly. | ||
parts = urllib.parse.pathsplit(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately BaseHTTPRequestHandler self.path
isn't guaranteed to be only a pathname by anything.
GET http://netloc/path/to/thing HTTP/1.0
Is a valid HTTP query to many http servers. Our existing http server code ignores the scheme and netloc on that and serves up /path/to/thing properly. This code might break things doing that? ie: if http://netloc winds up in the returned path, the it seems like that'd move through pathsplit and urlunsplit properly only by chance rather than design just given the name and purpose of the pathsplit API vs what was just handed to it as input.
Check out my updated test in #87389.
I have no idea if anything depends on this as it should be unusual for any client not pretending to be talking to an HTTP proxy to make requests that way as that isn't normal per the http/1.0 and 1.1 RFCs but as it worked before and other servers like Apache continue to support it, I don't think we can break that - at least not in a bugfix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my reading of the RFC, GET http://netloc/path/to/thing HTTP/1.0
should technically be allowed. However, http.server
doesn't handle it, before or after this PR. Instead, it tries to use the second word as an absolute file path. I feel like we should not attempt to fix http.server
to handle the above. For HTTP 1.0, using only a file path was required. For curl
I notice that even with the --http1.1
flag, it passes a path and not the URL with the scheme+netloc.
The bug causing #87389, IMHO, is that http.server
was not consistent in how it was treating self.path
. The translate_path()
method was treating it as a path and not a URL. But, the redirect to add a trailing slash (done only if a folder with that name was found), parsed it as allowing a full URL. It needs to do it one way or the other consistently. The _request_path_split()
method I added does this. It doesn't make sense to add this to urllib.parse
since what http.server
is doing is not modern HTTP.
Misc/NEWS.d/next/Security/2022-06-16-12-13-55.gh-issue-87389.MS9wAR.rst
Outdated
Show resolved
Hide resolved
This avoids "manual parsing" of the Request-URI part of the request and matches what _get_redirect_url() does.
Possible that someone could override this so a method is nicer.
Since pathsplit() doesn't seem like a generally useful public API, remove it. Instead, add a _request_path_split() method. This ensures that the redirect logic and the translate_path() method use the same path parsing.
@gpshead what do you think about landing this change for 3.12 only as an enhancement? |
Yeah I think this landing in 3.12 as a feature makes sense. this PR branch needs syncing now that the other PR has gone in. |
A few thoughts about this. I can rework the PR so that changes to urlib.parse are separated from the httpserver change. The fix made is 3.11 is sufficient for fixing the security bug so the change in this PR isn't strictly required. The change to urllib.parse.urlunsplit needs careful consideration. Not merging it into 3.11 was the right move. Thinking about it more, I'm having doubts that we can change this. Maybe we need a new API or maybe we just can't change it. It seems almost certain that some code expects the existing (IMHO insecure) behaviour of
We could try changing it an alpha and see what breaks. Or, we can try to analyze publicly available code and see how Ideally we should analyze all the arguments of |
Agreed, it's a challenge. I'm not personally motivated to try and push for this change to the level of doing code analysis of code-at-large's urlunsplit API calls. I like your thinking that further analysis of all the safety checks required being good. There are probably useful things to learn from what other libraries (in any language) do for URL construction from their parts libraries. Adding this is always easier as a I suggest this get tracked as its own Enhancement request Issue. This PR can presumably be tied to that and closed for now given more not yet fully defined work is needed if it is going to be done at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be three behaviour changes proposed:
- Changing urlunsplit(('', '', '//path', '', '')) to return '/path' instead of '//path'
Another option is to prefix with double slash, representing an empty host e.g. '////path'. This is proposed in issue #78457 and pull request #113563. I think I prefer that four-slash option, because it also fixes some urlsplit → urlunsplit round-trip cases.
- Changing urlunsplit(('', '', 'colon:path', '', '')) → './colon:path'
This seems a reasonable change, and it is kind of suggested in RFC 3986. (Another option might be to encode the first colon, and return 'colon%3Apath'.)
- SimpleHTTPRequestHandler’s handling of
GET https://example.net/dir
This is a legal HTTP 1.0 and 1.1 request, but is mainly for proxy servers, which is not what SimpleHTTPRequestHandler does. In this case the server looks up https:/example.net/dir as a path in its filesystem (which is not in spirit of HTTP), and decides to redirect with a trailing slash.
Currently it looks like the code sends Location: https://example.net/dir/
. I don’t think there is anything really wrong with that.
The proposed changes would send Location: ./https://example.net/dir/
. This new redirect is a path-relative URL. The base URL is supposed to be the original target https://example.net/dir
, so the redirect would resolve to https://example.net/https://example.net/dir/
, which is not intended.
If you want to fix anything in the HTTP server, I would make the server ignore the scheme and authority components, and just look up the path component. But I don’t think anyone is complaining about that, so it may not be worth fixing.
If the urllib.parse changes are too disruptive, perhaps a deprecation warning is the best way forward, and either add an opt-in way to get the new behaviour, or change the warning to an exception in the future?
@@ -491,14 +491,32 @@ def urlunparse(components): | |||
url = "%s;%s" % (url, params) | |||
return _coerce_result(urlunsplit((scheme, netloc, url, query, fragment))) | |||
|
|||
# Returns true if path can confused with a scheme. I.e. a relative path | |||
# without leading dot that includes a colon in the first component. | |||
_is_scheme_like = re.compile(r'[^/.][^/]*:').match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the special allowance for a leading dot? Is there a test case for it? Yes, a scheme cannot start with a dot, but a path-noscheme component of .:
is no more legal than https:
according to RFC 3986.
# expected result is a relative URL without netloc and scheme | ||
(('', 'a', '', '', ''), '//a'), | ||
# extra leading slashes need to be stripped to avoid confusion | ||
# with a relative URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confusion with a protocol-relative URL? [as opposed to a host-relative URL]
I believe this is a more correct fix for gh-87389 and may fix security bugs in other applications using
urlunsplit()
. We should not be confusing thepath
argument as a netloc, IMHO. I fixedhttp.server
by not usingurllib.parse
on thepath
. Forhttp.server
, that part of the HTTP request is not treated as a full URL but instead a path + optional query + optional fragment. So just parsing it "by hand" and not usingurllib.parse
avoids the bug.