-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
url.parse seems to change depending on the characters used in the domain name #5832
Comments
I think some characters are disallowed by spec or potentially dangerous etc. cc @nodejs/http probably. |
For what its worth, chrome also parses |
@Fishrock123 there are multiple specs, and we reference two behavioural sources (I hestitate to call them specs), browsers (and we don't do it like Chrome, for example), and the other I link above, where we also follow a fairly random subset of it. If you look at the code, you can see the code seems to be written as to work as I expected it would... and then near the end there is a call to a fairly incomplete validate function that then rearranges what we just parsed :-( |
I would expect that the parser either following the liberal-ness of the WhatWG spec, or completely reject the URL, rather than be in the weird in-between state it is currently. |
At this point, for better or worse, the WhatWG spec is likely the most authoritative source. Specifically, this: https://url.spec.whatwg.org/ A fairly comprehensive corpus of tests have been put together here: https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json We should definitely be working to validate against that suite. |
/cc @nodejs/testing |
Ref: #5885 |
Unfortunately, the tests in #5858 don't includ a host with From my reading of https://url.spec.whatwg.org/#host-parsing:
|
which was not intended to suggest importing them isn't a great idea, @jasnell |
The current answer to this issue is: use the new WHATWG URL parser as an alternative as this is not likely to be fixed. Closing. We can reopen if necessary. |
First, the behaviour:
I would expect that in all of the above, that the path would be
/path
. From my point of view, random non-URL syntax characters are being pushed into the path, and its pretty surprising.There are some statements in the test code that make this appear
to be deliberate, but they don't justify the behaviour.
While it is true that
*
is not a valid domain, according to the host parsing rules quoted, neither is-
, or0
or.
. I would expect.
to be treated as.
, returned in the host string.I would not expect the url parser to validate that domain names are well formed, though I would expect characters that are defined as part of the URL syntax to of course not be valid.
I can implement my own url parser that allows
*
, and I will for backwards compat, but I think this is a bit odd. URL is generally very lax in its parsing, it gives you the syntactic bits, and you get to validate whether they are correct for your use-case, this is the first time its failed my expectations.The text was updated successfully, but these errors were encountered: