Normalising percent-encoded characters in paths can cause issues with non-compliant server implementations. #1646
-
Hi everyone, i'm maintaining a project using httpx (great lib i love your work 😍) and i noticed that some requests are failing because they reach the maxRedirects counter, i took the time to track this thing and long story short it turned out that it's a problem with the url encoding (the issue is also present in aiohttp and in requests). Basically some webservers try to force redirect you to a non rfc3986 compliant url and when the redirect url is processed by the httpx Url class, it gets rewritten into a compliant one by: Line 573 in 9b17671 Because the relative_uri gets normalized by the rfc3986 library here: https://github.com/python-hyper/rfc3986/blob/d8af43866e0c0a843df0f9429b9e84dc150c3fbf/src/rfc3986/_mixin.py#L302-L307 And gets rewritten by this method: https://github.com/python-hyper/rfc3986/blob/d8af43866e0c0a843df0f9429b9e84dc150c3fbf/src/rfc3986/normalizers.py#L99-L108 And since the webserver doesn't accept that, it keeps infinitely redirecting. I created a reproducible test case so you can check it out. I definitely thought that the behavior is correct at first until i tried the same request in curl/browsers/postman/node libraries and noticed that it worked correctly, so i'm not sure if it's considered a httpx issue, an rfc3986 one (probably not) or if it's an issue at all. Thanks for your time and sorry if there is already an issue related to this because i searched and didn't find any. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
So, yeah you've kind identified what's going on here... The server is redirecting any path with uppercased percent encoded characters to the same path but with lowercased percent encoded characters. Make the request using $ curl -v https://www.reezocar.com/occasion/annonce-willys%2Foverland-overland-RZCCSTFR187849.htm
# 301 to: https://www.reezocar.com/occasion/annonce-willys%2foverland-overland-RZCCSTFR187849.htm But 200 in the lowercased variant: $ curl -v https://www.reezocar.com/occasion/annonce-willys%2foverland-overland-RZCCSTFR187849.htm
# 200 Now, actually, the lowercase variant isn't rfc3986 compliant, so when you're using You can see this for example, just by instantiating a URL, and inspecting the path... >>> u = httpx.URL('https://www.reezocar.com/occasion/annonce-willys%2foverland-overland-RZCCSTFR187849.htm')
>>> u
URL('https://www.reezocar.com/occasion/annonce-willys%2Foverland-overland-RZCCSTFR187849.htm')
>>> u.raw_path
b'/occasion/annonce-willys%2Foverland-overland-RZCCSTFR187849.htm'
>>> u.raw
(b'https', b'www.reezocar.com', None, b'/occasion/annonce-willys%2Foverland-overland-RZCCSTFR187849.htm') If we work down to the low level Transport API we can actually confirm that if we can send the network request with either the lowercase variant, or the uppercase variant, and the server will respond differently in each case... * >>> t = httpx.HTTPTransport()
>>> url_upper = (b'https', b'www.reezocar.com', None, b'/occasion/annonce-willys%2Foverland-overland-RZCCSTFR187849.htm')
>>> r = t.handle_request(b'GET', url_upper, headers=[(b'Host', 'www.reezocar.com')], stream=httpx.ByteStream(b''), extensions={})
>>> r[0]
301
>>> url_lower = (b'https', b'www.reezocar.com', None, b'/occasion/annonce-willys%2foverland-overland-RZCCSTFR187849.htm')
>>> r = t.handle_request(b'GET', url_lower, headers=[(b'Host', 'www.reezocar.com')], stream=httpx.ByteStream(b''), extensions={})
>>> r[0]
200 Tho when we're using a client we only ever get the (RFC3986 spec compliant) case. So, what do we want to do about this (if anything)? The server is exhibiting incorrect behaviour, but we might feasibly want to preserve the casing of percent encoded characters on the network request rather than normalising them. Would be worth digging through the * Actually kinda a fantastic example in favour of the design decision for our Transport API to use a pre-parsed URL format. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the fast response. Exactly it's definitely the server's problem but noticing that other implementations are correctly working made me open this discussion. I'll look into requests issues and check if there is a similar problem. |
Beta Was this translation helpful? Give feedback.
So, yeah you've kind identified what's going on here...
The server is redirecting any path with uppercased percent encoded characters to the same path but with lowercased percent encoded characters.
Make the request using
curl
, and it'll redirect in the uppercased variant:$ curl -v https://www.reezocar.com/occasion/annonce-willys%2Foverland-overland-RZCCSTFR187849.htm # 301 to: https://www.reezocar.com/occasion/annonce-willys%2foverland-overland-RZCCSTFR187849.htm
But 200 in the lowercased variant:
$ curl -v https://www.reezocar.com/occasion/annonce-willys%2foverland-overland-RZCCSTFR187849.htm # 200
Now, actually, the lowercase variant isn't rfc3986 compliant, so when you're using
requests