-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffer: implement WHATWG Encoding Standard API #13644
Conversation
What's special about iso-8859-16? |
It's not implemented by ICU at all, even with full-icu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments. I'll need some time to digest the intricacies of BOM and the various flags related to it.
lib/internal/encoding.js
Outdated
|
||
function normalizeEncoding(encoding) { | ||
if (encoding === undefined) | ||
return 'utf-8'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't have to handle defaults -- nor should it. The two places that use this function both ensure encoding !== undefined
. I also somewhat prefer the signature function getEncodingFromLabel(label)
to align better with the spec.
lib/internal/encoding.js
Outdated
return 'utf-8'; | ||
let enc = encodings.get(encoding); | ||
if (enc !== undefined) return enc; | ||
enc = encodings.get(encoding.toLowerCase()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASCII whitespace (and ASCII whitespace only, not Unicode) needs to be trimmed as well per step 1 of the spec.
lib/internal/encoding.js
Outdated
let enc = encodings.get(encoding); | ||
if (enc !== undefined) return enc; | ||
enc = encodings.get(encoding.toLowerCase()); | ||
if (enc !== undefined) return enc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just return enc
w/o if
should be fine and simpler.
src/node_i18n.cc
Outdated
|
||
UErrorCode status = U_ZERO_ERROR; | ||
UConverter* conv = ucnv_open(*label, &status); | ||
args.GetReturnValue().Set(U_SUCCESS(status) == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just !!U_SUCCESS(status)
? The == 1
looks kinda fragile.
lib/internal/encoding.js
Outdated
if (handle === undefined) | ||
throw new errors.Error('ERR_ENCODING_NOT_SUPPORTED', encoding); | ||
|
||
Object.defineProperties(this, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object.defineProperties
is pretty slow to be used on per object construction. I don't see any harm in making kHandle
configurable, enumerable, or writable, as it's a symbol property already (and it's what we do for URL). encoding
, fatal
, and ignoreBOM
should all be defined on the prototype as getter functions per spec.
src/node_i18n.cc
Outdated
ConverterObject* converter; | ||
ASSIGN_OR_RETURN_UNWRAP(&converter, args[0].As<Object>()); | ||
SPREAD_BUFFER_ARG(args[1], input_obj); | ||
int flags = args[2]->Uint32Value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the non-deprecated Uint32Value(Local<Context>)
signature of the function.
} | ||
|
||
Local<ObjectTemplate> t = ObjectTemplate::New(env->isolate()); | ||
t->SetInternalFieldCount(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of this PR, we should find a way to cache this ObjectTemplate.
Potentially dumb question: why not use ObjectWrap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping this question...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For no particular reason other than ObjectWrap
really isn't used anywhere else in core. I've actually been contemplating whether or not it is something we could eliminate at some point.
src/node_i18n.cc
Outdated
NULL, flush, &status); | ||
|
||
if (U_SUCCESS(status)) { | ||
result.SetLength(target - &result[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetLength
, too, takes number of entries as argument, not number of bytes.
tools/icu/icu-generic.gyp
Outdated
@@ -35,7 +35,7 @@ | |||
'defines': [ | |||
# ICU cannot swap the initial data without this. | |||
# http://bugs.icu-project.org/trac/ticket/11046 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the comment still apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Was planning to remove the whole section.
doc/api/buffer.md
Outdated
Per the [WHATWG Encoding Standard][], the encodings supported by the | ||
`TextDecoder` API are outlined in the table below. For each encoding, | ||
one or more aliases may be used. Support for some encodings is enabled | ||
only when Node.js is using the full ICU data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it documented anywhere how to get full ICU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/nodejs/node/blob/master/BUILDING.md#intl-ecma-402-support
+
https://github.com/nodejs/node/wiki/Intl (there is a link to it in the first fragment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be better documented.
lib/internal/encoding.js
Outdated
|
||
encode(input) { | ||
const buf = lazyBuffer().from(String(input)); | ||
return new Uint8Array(buf.buffer, buf.offset, buf.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per spec, the returned Uint8Array must be unpooled and have an entire block of ArrayBuffer dedicated to it.
doc/api/buffer.md
Outdated
`http.get()`, if the returned charset is one of those listed in the WHATWG spec | ||
it's possible that the server actually returned win-1252-encoded data, and | ||
using `'latin1'` encoding may incorrectly decode the characters. | ||
*Note*: Today's browsers follow the [WHATWG Encoding Standard] which aliases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[WHATWG Encoding Standard][]
(+ second pair of brackets) for consistency?
doc/api/buffer.md
Outdated
const decoder = new TextDecoder('shift_jis'); | ||
let string = ''; | ||
let buffer; | ||
while(buffer = getNextChunkSomehow()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while(
-> while (
?
doc/api/buffer.md
Outdated
let string = ''; | ||
let buffer; | ||
while(buffer = getNextChunkSomehow()) { | ||
string += decoder.decode(buffer, { stream:true }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream:true
-> stream: true
?
doc/api/buffer.md
Outdated
`false`. | ||
* `ignoreBOM` {boolean} When `true`, the `TextDecoder` will include the byte | ||
order mark in the decoded result. This option is only used when `encoding` | ||
is `'utf-8'`, `'utf-16be'` or `'utf-16le'`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note about defaults (false
= strip the BOM)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(
false
= strip the BOM)
That's correct.
var dec = new TextDecoder('utf-16be', { ignoreBOM: true });
[...dec.decode(new Uint8Array([0xfe, 0xff, 0x00, 0x20]))].map(str => str.codePointAt(0).toString(16));
// ["feff", "20"]
var dec = new TextDecoder('utf-16be', { ignoreBOM: false });
[...dec.decode(new Uint8Array([0xfe, 0xff, 0x00, 0x20]))].map(str => str.codePointAt(0).toString(16));
// ["20"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean maybe it is worth to add a note about the default value.
|
||
const common = require('../common'); | ||
const assert = require('assert'); | ||
const { Buffer, TextDecoder, TextEncoder } = require('buffer'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to require Buffer
if we state in the doc:
The
Buffer
class is a global within Node.js, making it unlikely that one would need to ever userequire('buffer').Buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in a lint rule we use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyGu Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use it only in the lib/.eslintrc.yaml and not in the test/.eslintrc.yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing here: tests are supposed to be self-contained scripts, while everything in lib
is user-visible and -contaminatable, like a seemingly innocent var Buffer = {}
line in REPL. See nodejs/node-convergence-archive#21.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyGu Thank you.
Updated |
doc/api/buffer.md
Outdated
one or more aliases may be used. Support for some encodings is enabled | ||
only when Node.js is using the full ICU data. | ||
|
||
<table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If https://help.github.com/articles/organizing-information-with-tables/ works as advertised, github markdown tables would be easier to read in-source and to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand (and I do not see anything in the referenced link) .. markdown tables do not support rowspan.
doc/api/buffer.md
Outdated
* `options` {object} | ||
* `fatal` {boolean} `true` if decoding failures are fatal. Defaults to | ||
`false`. | ||
* `ignoreBOM` {boolean} When `true`, the `TextDecoder` will include the byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if withBOM
(EDIT: or leaveBOM
) would make more sense for this, if its not a fixed name due to the spec? I thought the docs might have had a typo and reversed the sense, because if the BOM is ignored it wouldn't be in the output, then I thought about it, and realized the sense of ignore is "don't remove from input stream if present". I'm not sure what would happen if the BOM is not in the input stream, though, would it get added to the decoded result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoreBOM
is fixed within the spec. To change this would require a change in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I know this is all experimental, but FYI, its not clear to me from current docs whether a BOM is added if not present, or just passed through ("ignored") if present.
lib/internal/encoding.js
Outdated
throw new errors.TypeError('ERR_INVALID_ARG_TYPE', 'input', | ||
['ArrayBuffer', 'ArrayBufferView']); | ||
} | ||
if (options === null || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Web IDL spec treats null
the same way as undefined
. See https://heycam.github.io/webidl/#es-dictionary, step 4.1.2.
lib/internal/encoding.js
Outdated
throw new errors.TypeError('ERR_INVALID_ARG_TYPE', 'options', 'object'); | ||
} | ||
|
||
var flags = (options.stream === true) ? 0 : CONVERTER_FLAGS_FLUSH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDL recognizes all truthy values as true
, so just options.stream ?
.
lib/internal/encoding.js
Outdated
class TextDecoder { | ||
constructor(encoding = 'utf-8', options = {}) { | ||
if (typeof encoding !== 'string') | ||
throw new errors.Error('ERR_INVALID_ARG_TYPE', 'encoding', 'string'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should coerce encoding
to string rather than check its type, per IDL rules. See here for how it's done in URL.
lib/internal/encoding.js
Outdated
constructor(encoding = 'utf-8', options = {}) { | ||
if (typeof encoding !== 'string') | ||
throw new errors.Error('ERR_INVALID_ARG_TYPE', 'encoding', 'string'); | ||
if (options !== undefined && typeof options !== 'object') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too: null
should be treated just like undefined
. Also, options
cannot be undefined
here due to the default parameter above.
lib/internal/encoding.js
Outdated
} | ||
|
||
decode(input = empty, options = {}) { | ||
if (isAnyArrayBuffer(input)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SharedArrayBuffer is not supported at this moment in Web platform APIs that do not have [AllowShared]
extended attribute. See https://heycam.github.io/webidl/#es-buffer-source-types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, understood. Unfortunately, the separate process.binding('util').isArrayBuffer()
and process.binding('util').isSharedArrayBuffer()
method no longer exist. It looks like those were condensed recently into a single IsAnyArrayBuffer()
. I think this is one we can safely ignore for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems bad to ignore, as becoming spec-complaint later will be a backward-incompatible breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. I've added the isArrayBuffer()
util method back in to process.binding('util')
src/node_i18n.cc
Outdated
static void Decode(const FunctionCallbackInfo<Value>& args) { | ||
Environment* env = Environment::GetCurrent(args); | ||
|
||
CHECK_GE(args.Length(), 3); // Converter, ArrayBuffer, Flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ArrayBuffer/Buffer/
src/node_i18n.cc
Outdated
|
||
CHECK_GE(args.Length(), 2); | ||
Utf8Value label(env->isolate(), args[0]); | ||
int flags = args[1]->Uint32Value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the Local<Context>
version as well.
src/node_i18n.cc
Outdated
const char* source = input_obj_data; | ||
size_t source_length = input_obj_length; | ||
|
||
if (converter->unicode_ && !converter->bomSeen_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put && !converter->ignoreBOM_
here as well?
src/node_i18n.cc
Outdated
} | ||
source += bomOffset; | ||
source_length -= bomOffset; | ||
converter->bomSeen_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag should only be set when there actually is a BOM, i.e. bomOffset != 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite ...
In: https://encoding.spec.whatwg.org/#interface-textdecoder
* If encoding is UTF-8, UTF-16BE, or UTF-16LE, and ignore BOM flag and BOM seen flag are unset, then:
* If token is U+FEFF, then set BOM seen flag.
* Otherwise, if token is not end-of-stream, then set BOM seen flag and append token to output.
* Otherwise, return output.
UChar* target = *result; | ||
ucnv_toUnicode(converter->conv, | ||
&target, target + (limit * sizeof(UChar)), | ||
&source, source + source_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ICU docs say:
For most Unicode charsets it is also possible to ignore the indicated number of initial stream bytes and start converting after them. However, there are stateful Unicode charsets (UTF-7 and BOCU-1) for which this will not work. Therefore, it is best to ignore the first output UChar instead of the input signature bytes.
AFAICT source
is incremented by the signature bytes, so we are doing what the ICU docs advise not to do. Are the negative consequences applicable to us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe so. TextDecoder
does not support UTF-7 nor BOCU-1. We are only skipping the signature bytes for UTF-8, UTF-16le, and UTF-16be, none of which have this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, in that case LGTM.
lib/internal/encoding.js
Outdated
return 'utf-8'; | ||
} | ||
|
||
encode(input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input
is an optional parameter too. See https://encoding.spec.whatwg.org/#interface-textencoder
lib/internal/encoding.js
Outdated
} | ||
|
||
encode(input) { | ||
return new Uint8Array(lazyBuffer().from(String(input))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to use template literal syntax to prevent symbols from getting converted to strings. See
Line 56 in 448c4c6
const str = `${val}`; |
lib/internal/encoding.js
Outdated
} | ||
|
||
class TextEncoder { | ||
constructor() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be omitted?
1e8402f
to
97f8476
Compare
Updated :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much better, thanks a lot!
doc/api/buffer.md
Outdated
|
||
### textDecoder.decode([input[, options]]) | ||
|
||
* `input` {ArrayBuffer|TypedArray} An `ArrayBuffer` or `%TypedArray%` instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also DataView
. The %TypedArray%
notation is generally reserved to specs, so do you think it is appropriate to use it here?
doc/api/buffer.md
Outdated
|
||
* `input` {ArrayBuffer|TypedArray} An `ArrayBuffer` or `%TypedArray%` instance | ||
containing the encoded data. | ||
* `options` {object} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object
doc/api/buffer.md
Outdated
```js | ||
const { TextEncoder } = require('buffer'); | ||
const encoder = new TextEncoder(); | ||
const int8array = encoder.encode('this is some data'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint8array
?
lib/internal/encoding.js
Outdated
|
||
class TextDecoder { | ||
constructor(encoding = 'utf-8', options = {}) { | ||
encoding = String(encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Template string literal is needed here as well (to disallow symbols).
lib/internal/encoding.js
Outdated
|
||
[inspect](depth, opts) { | ||
if (this == null || this[kEncoding] === undefined) { | ||
throw new errors.TypeError('ERR_INVALID_THIS', 'TextDecoder'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should technically be used for the user-visible getters and the methods as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for now I'd rather leave it just here tho as it's not a check we commonly do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we are doing this for URLs (see Web IDL create a operation function step 2.1.2.4). I'll defer to your opinion regarding these classes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend adding appropriate type-checking here if possible; not a big deal, but seems unnecessary to diverge from the spec and from URL in that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@domenic ... just want to make sure I'm clear on what kind of checking you're recommending? instanceof
or is the more efficient hidden-symbol check ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hidden symbol check for sure. Although ideally one that distinguishes encoder vs. decoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping this.
lib/internal/encoding.js
Outdated
// This is not perfect, but there's really nothing else to key off since | ||
// there are no internal properties specific to TextEncoder | ||
if (this == null || this[Symbol.toStringTag] !== 'TextEncoder') { | ||
throw new errors.TypeError('ERR_INVALID_THIS', 'TextEncoder'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. encoding
and encode
should have this check too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being, I'd prefer not.
lib/internal/encoding.js
Outdated
|
||
[inspect](depth, opts) { | ||
// This is not perfect, but there's really nothing else to key off since | ||
// there are no internal properties specific to TextEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:/ In this case I'd rather add a dummy property through a constructor, just so that subclasses that extend TextEncoder but override Symbol.toStringTag
can work.
} | ||
|
||
Local<ObjectTemplate> t = ObjectTemplate::New(env->isolate()); | ||
t->SetInternalFieldCount(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping this question...
var s = 0; | ||
var e = label.length; | ||
while (s < e && ( | ||
label[s] === '\u0009' || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a regex like /[\t\n\f\r ]/
be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in my experience.
A bug I noticed: https://encoding.spec.whatwg.org/#dom-textdecoder-decode step 1 unsets the BOM seen flag if do not flush flag resulted from a previous run is not set (i.e. this is either the first call to > var dec = new buffer.TextDecoder('utf-8');
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ]
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ 'feff', '20' ] The correct behavior seems to also be the one implemented by Chrome: > var dec = new TextDecoder('utf-8');
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ]
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more observation...
lib/internal/encoding.js
Outdated
// ['hz-gb-2312', 'replacement'], | ||
// ['iso-2022-cn', 'replacement'], | ||
// ['iso-2022-cn-ext', 'replacement'], | ||
// ['iso-2022-kr', 'replacement'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than mentioning these as "unsupported", we can in fact say they don't necessarily need to be supported, since TextDecoder
errors out on 'replacement'
.
lib/internal/encoding.js
Outdated
throw new errors.Error('ERR_INVALID_ARG_TYPE', 'options', 'object'); | ||
|
||
const enc = getEncodingFromLabel(encoding); | ||
if (enc === undefined || enc === 'replacement') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make sure no entries in encodings
Map actually map to 'replacement'
as value, we don't have to explicitly check for it here.
@TimothyGu ... updated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as is, but some further observations are below. WPT integration can be worked on after this PR is merged.
I do have one more suggested edit: TimothyGu@0d0f7fc, which eliminates the need to copy from a Buffer
to an Uint8Array
.
lib/internal/encoding.js
Outdated
|
||
[inspect](depth, opts) { | ||
if (this == null || this[kEncoding] === undefined) { | ||
throw new errors.TypeError('ERR_INVALID_THIS', 'TextDecoder'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we are doing this for URLs (see Web IDL create a operation function step 2.1.2.4). I'll defer to your opinion regarding these classes though.
lib/internal/encoding.js
Outdated
} | ||
|
||
[inspect](depth, opts) { | ||
if (this == null || this[kEncoding] === undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unfortunately will not distinguish TextEncoder and TextDecoder objects, as both have this property defined.
This doesn’t land cleanly on 8.x; if you can, please follow the guide and raise a backport PR. |
Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (`TextDecoder` and `TextEncoder`). The is the same API implemented on the browser side. By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported. This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default. A process warning will be emitted on first use to indicate that the API is still experimental. No runtime flag is required to use the feature. Refs: https://encoding.spec.whatwg.org/ PR-URL: nodejs#13644 Reviewed-By: Timothy Gu <timothygu99@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (`TextDecoder` and `TextEncoder`). The is the same API implemented on the browser side. By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported. This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default. A process warning will be emitted on first use to indicate that the API is still experimental. No runtime flag is required to use the feature. Backport-PR-URL: #14585 Backport-Reviewed-By: Anna Henningsen <anna@addaleax.net> Refs: https://encoding.spec.whatwg.org/ PR-URL: #13644 Reviewed-By: Timothy Gu <timothygu99@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
'defines': [ | ||
# ICU cannot swap the initial data without this. | ||
# http://bugs.icu-project.org/trac/ticket/11046 | ||
'UCONFIG_NO_LEGACY_CONVERSION=1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, Coverity is none too happy about this being re-enabled.
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu * [gabrielschulhof](https://github.com/gabrielschulhof) – Gabriel Schulhof
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu * [gabrielschulhof](https://github.com/gabrielschulhof) – Gabriel Schulhof Conflicts: src/node_version.h
PR-URL: nodejs#13916 Refs: nodejs#13644 (comment) Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
PR-URL: #13916 Refs: #13644 (comment) Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
Release team decided not to land on v6.x, if you disagree let us know. |
Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (
TextDecoder
andTextEncoder
). The is the same API implemented on the browser side.By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported.
This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default.
The implementation is added without changing any of the existing encoding support in core.
Refs: https://encoding.spec.whatwg.org/
/cc @domenic @TimothyGu @addaleax
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesAffected core subsystem(s)
buffer