Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Regression in unicode handling in url.parse #1149

Closed
jeremys opened this issue Jun 3, 2011 · 6 comments
Closed

Regression in unicode handling in url.parse #1149

jeremys opened this issue Jun 3, 2011 · 6 comments

Comments

@jeremys
Copy link

jeremys commented Jun 3, 2011

Hi there,

I've detected a regression since Node 0.4.6 regarding the handling of unicode characters when parsing url. A test case is worth a thousand words so here we go:

Here's the gist: https://gist.github.com/1006680
The code:

var url = require('url'),
    assert = require('assert');
var test = 'http://➡.ws/pageloads';
var parsed = url.parse(test);
assert.equal(parsed.host, '➡.ws');
console.log('URL parsed correctly.');

With Node 0.4.6:

j:Code jeremy$ node -v
v0.4.6
j:Code jeremy$ node hostunicode.js 
URL parsed correctly.

With Node 0.4.8:

j:Code jeremy$ node -v
v0.4.8
j:Code jeremy$ node hostunicode.js 

node.js:134
        throw e; // process.nextTick error, or 'error' event on first tick
        ^
AssertionError: "➡.ws" == ""
    at Object.<anonymous> (/Users/jeremy/Code/hostunicode.js:5:8)
    at Module._compile (module.js:407:26)
    at Object..js (module.js:413:10)
    at Module.load (module.js:339:31)
    at Function._load (module.js:298:12)
    at Array.<anonymous> (module.js:426:10)
    at EventEmitter._tickCallback (node.js:126:26)

And I also think node does not deal very well with IDNA (see http://tools.ietf.org/html/rfc3490.html) but I will fill that seperatly with a test case.

@isaacs
Copy link

isaacs commented Jun 3, 2011

Yeah, this happened because of a fix to prevent delimiters and other characters from getting in there.

This is one area where JavaScript's lack of proper unicode support on RegExps is really painful. It seems like we can maybe just blacklist known-disallowed characters in the 0x00-0x80 range, since even punctuation outside of ASCII seems to be allowed in a domain name.

@jeremys
Copy link
Author

jeremys commented Jun 6, 2011

Or maybe a solution is to implement IDNA encoding, and process the encoding first, then apply the current blacklist on the result?

Because we have another issue regarding unicode handling in http. Even with node 0.4.6 when I request http://➡.ws, it end up requesting http://â�¡.ws which of course fail because we should encode the domain.

What would be nice is that url.parse idna-encode the hostname before applying validation. But idna encoding is a big deal. Not sure if Node can leverage this from somewhere else (Python or something).

@isaacs
Copy link

isaacs commented Jun 6, 2011

Yeah, I think if we're going to do IDNA (which seems necessary for http to properly request utf hostnames), we'll need to do it at the url.parse layer. I'm thinking that url.parse should automatically convert to punycode, and thus, url.format will never output a non-ascii hostname. Since browsers treat punycode and utf hostnames the same, it should be the safest approach, if not the prettiest output.

@jeremys
Copy link
Author

jeremys commented Jun 6, 2011

I don't know how do you plan to do IDNA but there is this interesting lib by Google around URL parsing: http://code.google.com/p/google-url/ — It's used in Chromium. But, I'm no expert, I don't know if it can be used within Node.

@isaacs
Copy link

isaacs commented Jun 6, 2011

Yeah, I don't know if we want to pull the url parsing/formatting logic into the C layer. That's a pretty big library, and does a lot of stuff that node doesn't really need.

@ry
Copy link

ry commented Jun 13, 2011

See #1174

@isaacs isaacs closed this as completed in 2a848fa Jul 6, 2011
isaacs pushed a commit to isaacs/node-v0.x-archive that referenced this issue Jul 19, 2011
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants