-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Regression in unicode handling in url.parse #1149
Comments
Yeah, this happened because of a fix to prevent delimiters and other characters from getting in there. This is one area where JavaScript's lack of proper unicode support on RegExps is really painful. It seems like we can maybe just blacklist known-disallowed characters in the 0x00-0x80 range, since even punctuation outside of ASCII seems to be allowed in a domain name. |
Or maybe a solution is to implement IDNA encoding, and process the encoding first, then apply the current blacklist on the result? Because we have another issue regarding unicode handling in http. Even with node 0.4.6 when I request http://➡.ws, it end up requesting http://â�¡.ws which of course fail because we should encode the domain. What would be nice is that url.parse idna-encode the hostname before applying validation. But idna encoding is a big deal. Not sure if Node can leverage this from somewhere else (Python or something). |
Yeah, I think if we're going to do IDNA (which seems necessary for http to properly request utf hostnames), we'll need to do it at the url.parse layer. I'm thinking that url.parse should automatically convert to punycode, and thus, url.format will never output a non-ascii hostname. Since browsers treat punycode and utf hostnames the same, it should be the safest approach, if not the prettiest output. |
I don't know how do you plan to do IDNA but there is this interesting lib by Google around URL parsing: http://code.google.com/p/google-url/ — It's used in Chromium. But, I'm no expert, I don't know if it can be used within Node. |
Yeah, I don't know if we want to pull the url parsing/formatting logic into the C layer. That's a pretty big library, and does a lot of stuff that node doesn't really need. |
See #1174 |
Using @bnoordhuis's punycode lib. Close nodejs#1174 also
Hi there,
I've detected a regression since Node 0.4.6 regarding the handling of unicode characters when parsing url. A test case is worth a thousand words so here we go:
Here's the gist: https://gist.github.com/1006680
The code:
With Node 0.4.6:
With Node 0.4.8:
And I also think node does not deal very well with IDNA (see http://tools.ietf.org/html/rfc3490.html) but I will fill that seperatly with a test case.
The text was updated successfully, but these errors were encountered: