Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: enforce valid UTF-8 in WHATWG parser #11436

Closed
wants to merge 4 commits into from

Conversation

TimothyGu
Copy link
Member

This commit implements the Web IDL USVString conversion, which mandates all unpaired Unicode surrogates be turned into U+FFFD REPLACEMENT CHARACTER. It also disallows Symbols to be used as USVString, per spec.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

url

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. dont-land-on-v4.x whatwg-url Issues and PRs related to the WHATWG URL implementation. labels Feb 17, 2017
@TimothyGu
Copy link
Member Author

@TimothyGu
Copy link
Member Author

TimothyGu commented Feb 17, 2017

Some background on why I chose to implement it in C++. I first implemented a fairly optimized JS version, but found out that there is a significant disparity between the best-case performance (validsurr) and the worst-case performance (allinvalid). This, to me, is unacceptable. This C++ implementation is a bit slower for the best case, but a lot faster for the worst case.

JS implementation
function toUSVString(V) {
  if (typeof V === 'symbol')
    throw new TypeError();

  const S = String(V);
  const n = S.length;
  var U = '';
  var lastPos = 0;
  for (var i = 0; i < n; ++i) {
    const c = S.charCodeAt(i);

    if (c < 0xD800 || c > 0xDFFF) {
      continue;
    } else if (0xDC00 <= c && c <= 0xDFFF || i === n - 1) {
      if (lastPos < i)
        U += S.slice(lastPos, i);
      lastPos = i + 1;
      U += '\ufffd';
    } else {
      const d = S.charCodeAt(i + 1);
      if (0xDC00 <= d && d <= 0xDFFF) {
        ++i;
        continue;
      } else {
        if (lastPos < i)
          U += S.slice(lastPos, i);
        lastPos = i + 1;
        U += '\ufffd';
      }
    }
  }
  if (lastPos === 0)
    return V;
  if (lastPos < n)
    return U + S.slice(lastPos, n);
  return U;
}
Benchmark results
h/usvstring.js n=50000000 input="valid" type="js": 15,533,604.407488512
h/usvstring.js n=50000000 input="validsurr" type="js": 17,423,172.615309086
h/usvstring.js n=50000000 input="someinvalid" type="js": 3,411,650.9908266817
h/usvstring.js n=50000000 input="allinvalid" type="js": 2,352,265.169112214
h/usvstring.js n=50000000 input="valid" type="cpp": 16,506,065.368628675
h/usvstring.js n=50000000 input="validsurr" type="cpp": 15,627,668.561148917
h/usvstring.js n=50000000 input="someinvalid" type="cpp": 8,808,868.726046104
h/usvstring.js n=50000000 input="allinvalid" type="cpp": 8,750,259.202803183

@targos
Copy link
Member

targos commented Feb 17, 2017

I think we already have somehow a valid implementation for toUSVString in the project. It is used when you call buffer.toString('utf8').

@TimothyGu
Copy link
Member Author

@targos, it is a lot slower (10x slower for validsurr) because of string-to-Buffer conversion

@targos
Copy link
Member

targos commented Feb 17, 2017

Of course, but can't we reuse it directly on the string, without the conversion?

@TimothyGu
Copy link
Member Author

but can't we reuse it directly on the string, without the conversion?

No, unfortunately.

> Buffer.prototype.toString.call('a')
TypeError: this.utf8Slice is not a function
> Buffer.prototype.utf8Slice.call('a')
TypeError: argument should be a Buffer

That notwithstanding, the actual conversion is done under-the-hood at buffer creation-time in StringBytes::Write, which uses the same mechanism as the Utf8Value class, which would work as well, except it is less efficient than the implementation in this PR.

@targos
Copy link
Member

targos commented Feb 17, 2017

I didn't mean use the JS function, rather the underlying V8 C++ implementation. I'm sorry for the confusion, I didn't see your comment in node_url.cc that explains why you are not doing that.

@@ -48,8 +48,7 @@ TwoByteValue::TwoByteValue(Isolate* isolate, Local<Value> value) {
const size_t storage = string->Length() + 1;
AllocateSufficientStorage(storage);

const int flags =
String::NO_NULL_TERMINATION | String::REPLACE_INVALID_UTF8;
const int flags = String::NO_NULL_TERMINATION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is right, since this would affect more than just the WHATWG URL implementation?

Copy link
Member

@bnoordhuis bnoordhuis Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is not an acceptable change. You could turn the flags into an optional argument that defaults to what is now or make it a template trait.

EDIT: Objection withdrawn.

Copy link
Member Author

@TimothyGu TimothyGu Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've explained in 424722af4541cd0eec1bf299aa8afcdff6284f52, this flag is actually a no-op. TwoByteValue::TwoByteValue uses v8::String::Write as opposed to v8::String::WriteUtf8. The flag is only respected by WriteUtf8, and hence misleading being applied here in TwoByteValue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right, I missed that it's the UTF-16 version. Objection withdrawn.

src/node_url.cc Outdated
uint16_t c = value[i];
if (c < 0xD800 || c > 0xDFFF) {
continue;
} else if (0xDC00 <= c || i == n - 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid mixing normal and Yoda-style comparisons in the same block of code, it makes it harder to understand.

src/node_url.cc Outdated
value[i] = 0xFFFD;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably get a bit of speedup if you split this loop in a simple linear scan for invalid characters, and a second one that fixes them up.

That optimizes for the common case that of valid UTF-8 and because you can do the scan in JS land with a regex (/[\uD800-\uDBFF]?[\uDC00-\uDFFF]/ if I read the logic correctly), you can avoid a call into C++ land.

Doing the scan in JS land penalizes the uncommon case because C++ lands needs to scan again. Whether the first scan should be in done in JS or C++ land depends on how common the uncommon case is. Not very, I expect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis, I'll benchmark this, but is a regex faster than character search these days? Also, the regex is more like /([^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF]([^\uDC00-\uDFFF]|$)/, though if there's a way to simplify that will probably help performance-wise as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macros like this might help to make it a bit easier to read in the C++ land...

#define IS_HIGH_SURROGATE(c) (((c) & 0xFC00) == 0xD800)
#define IS_LOW_SURROGATE(c)  (((c) & 0xFC00) == 0xDC00)

I think the i++ part is a bit implicit(if I understand correctly this would make the second branch "unexpected" so "changed = true"). Testing for an unexpected low surrogate first, then for high surrogates checks the following character(or EOL) immediately instead of waiting until the next cycle could be easier to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joyeecheung, I've taken up on your suggestion of using macros. However, after doing so, the branching looks like:

      uint16_t d = buf[i + 1];
      if (IS_SURROGATE_TRAIL(d)) {
        i++;
      } else {
        changed = true;
        buf[i] = 0xFFFD;
      }

I personally consider it a better style to always use the positive expression (without !) when an else clause is involved, though I'll see if it is necessary to perhaps add some likely/unlikely annotations if it helps performance.

Copy link
Member

@joyeecheung joyeecheung Feb 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mine is something like (sorry for the code dump):

for (size_t i = 0; i < n; ++i) {
  uint16_t c = value[i];
  // paired low surrogate would have been consumed in the previous cycle
  // so this must be unpaired.
  if (IS_LOW_SURROGATE(c)) {
    changed = true;
    value[i] = UTF_REPLACEMENT;  // or break if we optimize for the common case
  } else if (IS_HIGH_SURROGATE(c)) {
    ++i;  // check the next character immediately
    if (i == n) {  // ends with high surrogate, unpaired
      changed = true;
      value[i - 1] = UTF_REPLACEMENT;  // or break
    } else {
      uint16_t d = value[i];
      if (IS_LOW_SURROGATE(d)) {  // paired
        continue;
      } else {  // high surrogate without a following low surrogate, unpaired
         changed = true;
         value[i - 1] = UTF_REPLACEMENT; // or break
      }
    }
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: d should be value[i]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and..no need to replace the value[i] if value[i-1] is unpaired high surrogate

Copy link
Member

@bnoordhuis bnoordhuis Feb 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll benchmark this, but is a regex faster than character search these days?

I expect the answer is 'yes' for short strings. Calling into C++ land and copying out the string has quite a bit of overhead associated with it. It's amortized for long strings but it's probably quite substantial for short strings (and I assume most inputs will be relatively short.)

Also, the regex is more like /([^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF]([^\uDC00-\uDFFF]|$)/, though if there's a way to simplify that will probably help performance-wise as well.

A simple scan for /[\uD800-\uDFFF]/ followed by a call into C++ land when it tests true may already be good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joyeecheung, I don't think your code snippet would work for a "high high low" sequence, since in the process of checking for "high" i is advanced twice, so it would go directly to "low" for the next iteration.

@bnoordhuis, indeed a regex is a lot faster than iterating through the string in JS. The seemingly long regex doesn't actually have much adverse performance effect if the last ([^\uDC00-\uDFFF]|$) is turned into a negative lookahead. PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that is just the general idea typed in the comment box...though I think this can be solved by adding a nested branch for high surrogate and decrease the pointer?

(Just to be clear, I don't mean to block this, just think avoiding hex inside the code could make it easier to understand so threw my 2cents. It's kinda weird to do this kind of review inside comments :P)

@@ -48,8 +48,7 @@ TwoByteValue::TwoByteValue(Isolate* isolate, Local<Value> value) {
const size_t storage = string->Length() + 1;
AllocateSufficientStorage(storage);

const int flags =
String::NO_NULL_TERMINATION | String::REPLACE_INVALID_UTF8;
const int flags = String::NO_NULL_TERMINATION;
Copy link
Member

@bnoordhuis bnoordhuis Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is not an acceptable change. You could turn the flags into an optional argument that defaults to what is now or make it a template trait.

EDIT: Objection withdrawn.

@@ -598,8 +598,7 @@ exports.WPT = {
try {
fn();
} catch (err) {
if (err instanceof Error)
err.message = `In ${desc}:\n ${err.message}`;
console.error(`In ${desc}:`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left-over debug code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was intentional, since it seems like after the Error is constructed changing the message doesn't change stack. (See 56f6b8dbad36a46f07668b2b35574dc367e6d146.)

Copy link
Member

@joyeecheung joyeecheung Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, to patch the error message we would have to extend the Error class, something like:

class WPTError extends Error {
  constructor(desc, err) {
    super(`In ${desc}, ${err.name}: ${err.message}`);
    Error.captureStackTrace(this, WPTError);
  }
}

throw new WPTError(desc, err);

But then console.error() works too, though we would see two error traces this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. it appears to:

> var m = new Error('test')
undefined
> m.message = 'foo'
'foo'
> m
Error: foo
    at repl:1:9
    at ContextifyScript.Script.runInThisContext (vm.js:23:33)
    at REPLServer.defaultEval (repl.js:340:29)
    at bound (domain.js:280:14)
    at REPLServer.runBound [as eval] (domain.js:293:12)
    at REPLServer.onLine (repl.js:537:10)
    at emitOne (events.js:101:20)
    at REPLServer.emit (events.js:189:7)
    at REPLServer.Interface._onLine (readline.js:238:10)
    at REPLServer.Interface._line (readline.js:582:8)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, looks like it's the Error.captureStackTrace call in the AssertionError constructor "freezes" the stack. To recapture the stack, we need to:

diff --git a/test/common.js b/test/common.js
index 5f7dc25..f74d5f0 100644
--- a/test/common.js
+++ b/test/common.js
@@ -598,8 +598,10 @@ exports.WPT = {
     try {
       fn();
     } catch (err) {
-      if (err instanceof Error)
+      if (err instanceof Error) {
         err.message = `In ${desc}:\n  ${err.message}`;
+        Error.captureStackTrace(err);
+      }
       throw err;
     }
   },

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joyeecheung, calling Error.captureStackTrace() there will only capture the stack of WPT.test, not the actual stack from err.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, this probably deserves another PR to sort out. I am fine with a console.error work around.

@jasnell
Copy link
Member

jasnell commented Feb 17, 2017

I'm wondering if a more efficient solution would be to make a USVStringValue alternative to UTF8Value (see https://github.com/nodejs/node/blob/master/src/node_url.cc#L1321). When the input is provided to the Parse() function within node_url.cc, it is currently interpreted as a UTF-8 string using UTF8Value. This ends up writing the string bytes out which means if we do the toUSVString() and then pass it off to Parse(), we'll be doing to write twice. By doing the conversion at that point, we can avoid the extra trip across the js/c++ boundary, we cover all of the setters, and there would be no need to export the toUSVString() function.

@TimothyGu
Copy link
Member Author

TimothyGu commented Feb 17, 2017

I'm wondering if a more efficient solution would be to make a USVStringValue alternative to UTF8Value (see https://github.com/nodejs/node/blob/master/src/node_url.cc#L1321).

Certainly. I'll see what I can do for this.

there would be no need to export the toUSVString() function.

I think it'll still be needed for the URLSearchParams interface, which is implemented entirely in JS.

@jasnell
Copy link
Member

jasnell commented Feb 17, 2017

I think it'll still be needed for the URLSearchParams interface

True. That said, I'm working on a C/C++ querystring parser implementation and evaluating the performance. It will likely make the most sense to keep that impl in JS land but we'll see what kind of numbers I can get.

@TimothyGu
Copy link
Member Author

@jasnell, after investigating, I realized that a TwoByteValue-based class that USVString is not helpful, at least in context of this PR. The current C++ Parse() method used by the setters already uses Utf8Value, which does the same thing as ToUSVString(), so nothing to change there. On the other hand, certain setters operate on the stringified parameter in JS before calling the binding's Parse(), so an efficient ToUSVString() method is needed for those methods anyway.

While the double conversion cannot be gotten rid of in the setters, the domainTo*() JS methods are a different case. I've simply gotten rid of the JS wrappers to handle String conversion fully in C++.

@TimothyGu TimothyGu force-pushed the url-usvstring branch 2 times, most recently from 0098309 to 032f311 Compare February 20, 2017 06:23
@TimothyGu
Copy link
Member Author

Tried to address comments as far as possible. Changes:

PTAL.

CI: https://ci.nodejs.org/job/node-test-pull-request/6504/

src/node_url.cc Outdated
value[i] = UNICODE_REPLACEMENT_CHARACTER;
} else {
uint16_t d = value[i + 1];
if (U16_IS_TRAIL(d)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two flavors of macros? If I understand correctly, U16_IS_SURROGATE_TRAIL means "we already know it is a surrogate, now we are testing if it is the trail surrogate", and U16_IS_TRAIL means "we are testing if it is the trail surrogate". If so, the first one can just be replaced by the second one?

Copy link
Member Author

@TimothyGu TimothyGu Feb 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably micro-optimization, but U16_IS_SURROGATE_LEAD/TRAIL can be reduced to two instructions (on x86 at least, and + j(n)z), while U16_IS_LEAD/TRAIL each requires three (and + cmp + je). In most cases, the macros come for free anyway from ICU4C, which is why I took the liberty to use them.

throw new TypeError('Cannot convert a Symbol value to a string');
const str = String(val);
// As of V8 5.5, str.search (and usvRe[@@search]) are slower than usvRe.exec.
const match = usvRe.exec(str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just .test because we don't need the match to do anything further

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, missed the match.index part below, never mind.

@TimothyGu TimothyGu dismissed bnoordhuis’s stale review February 22, 2017 00:39

Review of an older version

@TimothyGu
Copy link
Member Author

function toUSVString(val) {
if (typeof val === 'symbol')
throw new TypeError('Cannot convert a Symbol value to a string');
const str = String(val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just val = '' + val if you want to coerce, that way you don't have to duplicate the 'is symbol?' check.

src/node_url.cc Outdated
#define U16_IS_SURROGATE(c) (((c) & 0xF800) == 0xD800)
#define U16_IS_SURROGATE_LEAD(c) (((c) & 0x400) == 0)
#define U16_IS_SURROGATE_TRAIL(c) (((c) & 0x400) != 0)
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd turn these into inline functions. It adds an extra bit of type checking and avoids possible divergence between intl and non-intl builds.

src/node_url.cc Outdated
Local<String> str;
if (!args[0]->ToString(env->context()).ToLocal(&str))
return;
Utf8Value value(env->isolate(), str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unrelated change...

src/node_url.cc Outdated
Local<String> str;
if (!args[0]->ToString(env->context()).ToLocal(&str))
return;
Utf8Value value(env->isolate(), str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...as does this.

@TimothyGu
Copy link
Member Author

@bnoordhuis, comments addressed. PTAL.

CI: https://ci.nodejs.org/job/node-test-pull-request/6561/

@TimothyGu
Copy link
Member Author

@bnoordhuis, ping?

@@ -23,6 +23,17 @@ const IteratorPrototype = Object.getPrototypeOf(
Object.getPrototypeOf([][Symbol.iterator]())
);

const usvRe =
/([^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment here explaining this regexp is testing if the string contains unpaired surrogates, or rename it to something self-explanatory..(for a name like usvRe I would expect it is testing all the characters are in the range of unicode scalar values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Changed it to unpairedSurrogateRe.

@@ -1351,6 +1368,41 @@ namespace url {
v8::NewStringType::kNormal).ToLocalChecked());
}

static void ToUSVString(const FunctionCallbackInfo<Value>& args) {
Environment* env = Environment::GetCurrent(args);
CHECK_GE(args.Length(), 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK_EQ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the existing functions use CHECK_GE, and I don't see a reason to be stricter than what this function actually uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there are functions in other files doing EQ...not sure if we have a convention or not, just think GE is implying there could be more args, which doesn't seem to be the case for this one, though I don't feel very strongly about this.

const size_t n = value.length();

const int64_t start = args[1]->IntegerValue(env->context()).FromJust();
CHECK_GE(start, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another CHECK_LT(start, n)? Doesn't do any harm even if start is larger though.

Copy link
Member Author

@TimothyGu TimothyGu Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only checking start >= 0 because I'm converting start to a size_t, which is an unsigned type. In C++, signed-to-unsigned conversion, though a defined operation, acts oddly when the signed value is negative. On the other hand, n >= start is a more benign case, and I don't think requires the full effects of a runtime assertion (i.e. crashing).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep, not worth a full on abort ;)

@@ -25,6 +25,9 @@ const expected = ['toString',
'username', 'password', 'host', 'hostname', 'port',
'pathname', 'search', 'searchParams', 'hash', 'toJSON'];

const obj = { toString() { throw new Error('toString'); } };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests can be in test-whatwg-url-setters.js? This file is testing the property attributes are defined properly, the added tests are testing the setters are implemented properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test-whatwg-url-setters is a WPT-derived file, and therefore doesn't pertain to Node.js-specific things like error messages, for which these tests check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have already added non-WPT tests to that file? Most error message validation tests are placed this way(after WPT) though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I overlooked those other tests that have Node.js-specific ones following WPT. Should now be addressed in 786aead7de3898012512915a10f4685a13f478d0

Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a style nit.

@@ -206,8 +217,10 @@ function onParseHashComplete(flags, protocol, username, password,

class URL {
constructor(input, base) {
// toUSVString unneeded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you punctuate these comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extremely minor nit: s/unneeded/is not needed. here and throughout

@@ -206,8 +217,10 @@ function onParseHashComplete(flags, protocol, username, password,

class URL {
constructor(input, base) {
// toUSVString unneeded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extremely minor nit: s/unneeded/is not needed. here and throughout

Changing err.message after the construction of Error doesn't seem to
change err.stack.
String::REPLACE_INVALID_UTF8 is only applied in V8's
String::WriteUtf8() (i.e. Utf8Value).
This commit implements the Web IDL USVString conversion, which mandates
all unpaired Unicode surrogates be turned into U+FFFD REPLACEMENT
CHARACTER. It also disallows Symbols to be used as USVString per spec.

Certain functions call into C++ methods in the binding that use the
Utf8Value class to access string arguments. Utf8Value already does the
normalization using V8's String::Write, so in those cases, instead of
doing the full USVString normalization, only a symbol check is done
(`'' + val`, which uses ES's ToString, versus `String()` which has
special provisions for symbols).
@TimothyGu
Copy link
Member Author

@bnoordhuis, @jasnell, comments addressed. Will land tomorrow if nothing comes up.

CI: https://ci.nodejs.org/job/node-test-commit/8160/

@TimothyGu
Copy link
Member Author

Landed in 7ceea2a...6123ed5.

@TimothyGu TimothyGu closed this Mar 1, 2017
@TimothyGu TimothyGu deleted the url-usvstring branch March 1, 2017 02:40
TimothyGu added a commit that referenced this pull request Mar 1, 2017
Changing err.message after the construction of Error doesn't seem to
change err.stack.

PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
TimothyGu added a commit that referenced this pull request Mar 1, 2017
String::REPLACE_INVALID_UTF8 is only applied in V8's
String::WriteUtf8() (i.e. Utf8Value).

PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
TimothyGu added a commit that referenced this pull request Mar 1, 2017
This commit implements the Web IDL USVString conversion, which mandates
all unpaired Unicode surrogates be turned into U+FFFD REPLACEMENT
CHARACTER. It also disallows Symbols to be used as USVString per spec.

Certain functions call into C++ methods in the binding that use the
Utf8Value class to access string arguments. Utf8Value already does the
normalization using V8's String::Write, so in those cases, instead of
doing the full USVString normalization, only a symbol check is done
(`'' + val`, which uses ES's ToString, versus `String()` which has
special provisions for symbols).

PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
TimothyGu added a commit that referenced this pull request Mar 1, 2017
PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
@evanlucas
Copy link
Contributor

This is not landing cleanly on v7.x-staging. Want to submit a backport PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants