-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
url: ignore IDN errors when domainname have two hyphens #12966
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This seems to be how the other URL parsers work.
@@ -487,6 +487,16 @@ const parseTests = { | |||
path: '/bar' | |||
}, | |||
|
|||
'https://r4---sn-a5mlrn7s.gevideo.com/bar': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a test here, can you add an entry to test/fixtures/url-idna.js instead? Those cases are tested through test-whatwg-url-domainto.js. Setting mode
to 'ascii'
should be enough in this case.
Also, please add test cases with leading and trailing hyphens (and both) as well.
src/node_i18n.cc
Outdated
@@ -516,6 +516,7 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf, | |||
info.errors &= ~UIDNA_ERROR_EMPTY_LABEL; | |||
info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG; | |||
info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG; | |||
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the following code, which filters out several more conditions mandated by the UTS#46 draft, and adds an explicit comment explaining why this is:
// These error conditions are mandated unconditionally by UTS #46 version
// 9.0.0 (rev. 17), but were found to be incompatible with many actual domain
// names in the wild. As such, in the current UTS #46 draft (rev. 18) these
// checks are made optional depending on the CheckHyphens flag, which will be
// disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46
// draft becomes standard.
// Refs:
// - https://github.com/whatwg/url/issues/53
// - http://www.unicode.org/review/pri317/
// - http://www.unicode.org/reports/tr46/tr46-18.html
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN;
info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN;
I also suspect that these lines of code need to be duplicated for the ToUnicode method in the file, since there is a set of identical changes to ToUnicode in the UTS#46 draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Will do.
I checked those too. But was not able to find any real world example of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyGu : Made changes as suggested. Filtering errors for trailing and leading hyphens.
What is the policy for landing changes that aren't spec-compliant? Or is the intention to move the spec first? |
c8622a2
to
e97bc58
Compare
src/node_i18n.cc
Outdated
@@ -516,6 +531,22 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf, | |||
info.errors &= ~UIDNA_ERROR_EMPTY_LABEL; | |||
info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG; | |||
info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG; | |||
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this line now that you've added my suggestion :)
// 9.0.0 (rev. 17), but were found to be incompatible with many actual domain | ||
// names in the wild. As such, in the current UTS #46 draft (rev. 18) these | ||
// checks are made optional depending on the CheckHyphens flag, which will be | ||
// disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
domain to Unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. done all changes.
@domenic I don't think we have encountered such situations before, but generally yes I think the intention is to change spec first. In this case, however, there is a disparity between the spec and what browsers do, so I think it is justified to make this change, given that upstream is aware of the problem. |
Yeah, I guess that makes sense, although with luck we can use this to prod @annevk into action on a spec fix to at least match what Node.js intends to do here. Note that you're not matching browsers with this patch, as per whatwg/url#267 they seem to do something quite different. |
e97bc58
to
928c796
Compare
Hmm... I would definitely like to see this fixed but we need to make sure that (a) the behavior matches the browsers and (b) the spec issues gets resolved. I think I'd rather hold off on landing until we get confirmation on (a) and acknowledgement that (b) will happen. |
@jasnell : I just created a jsbin with all above style urls and https://jsbin.com/negusoweyu/edit?html,js,console,output . Browser (I tested on chrome, Safari, Firefox). They all parse these domains and resolves them and fires request. I added entries in /etc/hosts to get them resolved. Do you thing thats enough or any other test would be needed?. Let me know. |
fdf73f6
to
69283aa
Compare
To be clear, from reading browser source code and running more exhaustive tests than those simple ones, you'll find that browsers do not apply ToASCII at all for domains with only-ASCII characters. That is not what this PR does. (Nor is it what the spec does.) See more detail at the issue whatwg/url#267 I linked to above. |
It's unfortunate that what browsers do doesn't make any sense. It's the kind Latin-alphabet-favoritism we try to get rid of, although maybe it's minor enough not to matter... But yeah, if you want to match what browsers do, only run ToASCII if you found a code point > U+007F. (I think that's a per domain determination, not per domain label.) I don't really want to change the URL standard until I heard from implementers what we can actually do here, although I should probably add a warning. |
@domenic , @annevk , @TimothyGu : Will look at the browser code, tests and submit a change. Shall I file tickets to browser vendors for other cases?, or you are already having that discussion?. |
We're already having that discussion... again, please look at the issue I've pointed to twice now :) |
whatwg/url#309 has some additional information. |
Looks like we will need to wait for the spec for the time being. We have a "Waiting on Spec" column in the Github project for WHATWG URL...for now I'll move this one there. |
FWIW, if it's possible for you to iterate quickly I would recommend implementing what browsers do now ("fast path" for ASCII) and later align with any standard changes if they end up being different. It doesn't seem worth it to inconvenience end users over this. |
@annevk : Thats what I was planning to do. Will do it over weekend. |
@annevk : I tried to get the fast toASCII for all ascii domain. But corresponding ToUnicode will still fail for domain names with hyphens at 3&4 places or with trailing or leading hyphens. We can special case this by not running ToUnicode on domain labels not having "xn--" prefix. Will send a different PR |
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
ToUnicode can never fail. It can only signal errors, but it never fails. See http://www.unicode.org/reports/tr46/#ToUnicode for details. |
@annevk, hmm I missed that line when reading (step 2). Will note in the future. |
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309
Reopening this based on this comment. |
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4; | ||
info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN; | ||
info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I was mistaken in my understanding of ToUnicode, which means that the entire block (ll. 464-485) shouldn't be necessary. While at it, can you delete this block, the erroneous || (!lenient && info.errors != 0)
condition below, as well as the now-unnecessary lenient
parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR LGTM except the ToUnicode issue above, but we can fix that at a later date if this PR should get in ASAP.
8e11070
to
5187a90
Compare
There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Fixes: nodejs#12965
There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Also filters IDN errors when domain label start or end with hyphen. Fixes: nodejs#12965 Refs: https://www.icann.org/news/announcement-2000-01-07-en
There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Also filters IDN errors when domain label start or end with hyphen. Fixes: nodejs#12965 Refs: https://www.icann.org/news/announcement-2000-01-07-en
There are valid domain names with hyphens at 3 and 4th position, new node WHATWG URL parser was failing for it assume its an invalid IDN. Also filters IDN errors when domain label start or end with hyphen. Also fix error in ToUnicode Fixes: nodejs#12965 Refs: https://www.icann.org/news/announcement-2000-01-07-en Refs: whatwg/url#309 (comment)
5187a90
to
5e0f58c
Compare
We had to do hacks to get around this problems with try/catch and using older url module. |
@TimothyGu : I made the above changes. Also based on commit to #309 shall we add checkHyphen param, will do that as separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better! LGTM.
@TimothyGu, @watilde : Can one you start a CI run for this please?. |
@TimothyGu : Anything more to get this landed?. |
Nope, landed in 06a617a. Note that for future contributions, the first line of the commit message should be under 50 characters per our commit message guidelines. I've amended the message this time, however. Thanks again for staying through on this PR! |
This commit contains three separate changes: - Always return a string from ToUnicode no matter if an error occurred. - Disable CheckHyphens boolean flag. This flag will soon be enabled in the URL Standard, but is implemented manually by selectively ignoring certain errors. - Enable CheckBidi boolean flag per URL Standard update. This allows domain names with hyphens at 3 and 4th position, as well as those with leading and trailing hyphens. They are technically invalid, but seen in the wild. Tests are updated and simplified accordingly. PR-URL: #12966 Fixes: #12965 Refs: whatwg/url#309 Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com> Reviewed-By: Timothy Gu <timothygu99@gmail.com>
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Fixes: #12965
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesAffected core subsystem(s)