Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to use StringDecoder in combination with stream.Transform #15369

Closed
mcollina opened this issue Sep 12, 2017 · 13 comments
Closed

Document how to use StringDecoder in combination with stream.Transform #15369

mcollina opened this issue Sep 12, 2017 · 13 comments
Labels
doc Issues and PRs related to the documentations. good first issue Issues that are suitable for first-time contributors. stream Issues and PRs related to the stream subsystem.

Comments

@mcollina
Copy link
Member

In #7315 and #7425, it was discussed how to add string decoding capabilities to Writable. However, that is very tricky to implement without a performance regression. At the bare minimum, we should document this inside the streams API docs.

See https://github.com/mcollina/split2/blob/master/index.js as an example.

@mcollina mcollina added doc Issues and PRs related to the documentations. good first issue Issues that are suitable for first-time contributors. stream Issues and PRs related to the stream subsystem. labels Sep 12, 2017
@chungngoops
Copy link
Contributor

I would like to work on this one.

@code4cake
Copy link

is this issue still open? I would like to tackle it , but I see @chungngoops, you might me already working on it?

@chungngoops
Copy link
Contributor

@dantesolis just go ahead, I'm working on it but it's ok.

@dicearr
Copy link
Contributor

dicearr commented Oct 19, 2017

@mcollina I am digging into this issue and I would love to know if some of my conclusions are right before proposing a PR.

  1. Major performance regressions comes due to multiple checks needed (type of chunk and encoding), rather than the decoding process.
  2. The decoding process is tricky because of incomplete multibyte characters. They are not handled by StringDecoder.
  3. Documentation should reflect previous conclusions and show an example of a custom Writable that decodes the Buffers when writing. It could be a new section inside implementing-a-writable-stream.

@chungngoops @dantesolis are you actively working on it?

@mcollina
Copy link
Member Author

  1. the major perf regression happens if we implement the decoding part within `Writable``
  2. incomplete multibyte characters are handled within StringDecoder, and they must be flushed down in _final.
  3. very likely yes ;).

@dicearr
Copy link
Contributor

dicearr commented Oct 20, 2017

the major perf regression happens if we implement the decoding part within Writable

AFAIK this is due to the increasing number of checks that should be done in decodeChunk. At least that is what I have been able to conclude watching #7425.

very likely yes ;).

Something like this?

class StringWritable extends Writable {
  constructor (options) {
    super(options)
    const state = this._writableState
    this._decoder = new StringDecoder(state.defaultEncoding)
    this._data = ''
  }
  _write (chunk, encoding, callback) {
    if (encoding === 'buffer') {
      chunk = this._decoder.write(chunk)
    }
    this._data += chunk
    callback()
  }
}

@mcollina
Copy link
Member Author

You are missing some data that might be left within the stringdecoder. Call this._decoder.end() to get it.

@jasnell
Copy link
Member

jasnell commented Oct 20, 2017

I'd be curious to see the performance comparison when using util.TextDecoder instead. It is making use of the ICU converter which has historically been much more efficient.

@addaleax
Copy link
Member

@jasnell util.TextDecoder is not a streaming decoder, unlike StringDecoder, and you really need to be careful when you’re comparing perf to take that into account :)

@jasnell
Copy link
Member

jasnell commented Oct 20, 2017

@addaleax ... sure it is :-)

const { TextDecoder } = require('util');

class StringWritable extends Writable {
  constructor (options) {
    super(options)
    const state = this._writableState
    this._decoder = new TextDecoder()
    this._data = ''
  }

  _write (chunk, encoding, callback) {
    if (encoding === 'buffer') {
      chunk = this._decoder.decode(chunk, { stream: true })
    }
    this._data += chunk
    callback()
  }
}

See the stream option documented here: https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode

@addaleax
Copy link
Member

@jasnell Oh, nice. In that case we might actually want to do away with StringDecoder at some point/merge the implementations?

@TimothyGu
Copy link
Member

@addaleax FYI currently StringDecoder is actually used as a fallback for TextDecoder when ICU is disabled:

class TextDecoder {
constructor(encoding = 'utf-8', options = {}) {
if (!warned) {
warned = true;
process.emitWarning(experimental, 'ExperimentalWarning');
}
encoding = `${encoding}`;
if (typeof options !== 'object')
throw new errors.Error('ERR_INVALID_ARG_TYPE', 'options', 'object');
const enc = getEncodingFromLabel(encoding);
if (enc === undefined || !hasConverter(enc))
throw new errors.RangeError('ERR_ENCODING_NOT_SUPPORTED', encoding);
var flags = 0;
if (options !== null) {
if (options.fatal) {
throw new errors.TypeError('ERR_NO_ICU', '"fatal" option');
}
flags |= options.ignoreBOM ? CONVERTER_FLAGS_IGNORE_BOM : 0;
}
this[kDecoder] = true;
// StringDecoder will normalize WHATWG encoding to Node.js encoding.
this[kHandle] = new (lazyStringDecoder())(enc);
this[kFlags] = flags;
this[kEncoding] = enc;
this[kBOMSeen] = false;
}
decode(input = empty, options = {}) {
if (this == null || this[kDecoder] !== true)
throw new errors.TypeError('ERR_INVALID_THIS', 'TextDecoder');
if (isArrayBuffer(input)) {
input = lazyBuffer().from(input);
} else if (isArrayBufferView(input)) {
input = lazyBuffer().from(input.buffer, input.byteOffset,
input.byteLength);
} else {
throw new errors.TypeError('ERR_INVALID_ARG_TYPE', 'input',
['ArrayBuffer', 'ArrayBufferView']);
}
if (typeof options !== 'object') {
throw new errors.TypeError('ERR_INVALID_ARG_TYPE', 'options', 'object');
}
if (this[kFlags] & CONVERTER_FLAGS_FLUSH) {
this[kBOMSeen] = false;
}
if (options !== null && options.stream) {
this[kFlags] &= ~CONVERTER_FLAGS_FLUSH;
} else {
this[kFlags] |= CONVERTER_FLAGS_FLUSH;
}
if (!this[kBOMSeen] && !(this[kFlags] & CONVERTER_FLAGS_IGNORE_BOM)) {
if (this[kEncoding] === 'utf-8') {
if (input.length >= 3 &&
input[0] === 0xEF && input[1] === 0xBB && input[2] === 0xBF) {
input = input.slice(3);
}
} else if (this[kEncoding] === 'utf-16le') {
if (input.length >= 2 && input[0] === 0xFF && input[1] === 0xFE) {
input = input.slice(2);
}
}
this[kBOMSeen] = true;
}
if (this[kFlags] & CONVERTER_FLAGS_FLUSH) {
return this[kHandle].end(input);
}
return this[kHandle].write(input);
}
}

@jasnell
Copy link
Member

jasnell commented Oct 20, 2017

Can't do that entirely yet because TextDecoder does not support hex or base64

Qard pushed a commit to ayojs/ayo that referenced this issue Nov 2, 2017
Improved stream documentation with an example of how to decode buffers
to strings within a custom Writable.

Fixes: nodejs/node#15369
PR-URL: nodejs/node#16403
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Qard pushed a commit to ayojs/ayo that referenced this issue Nov 2, 2017
Improved stream documentation with an example of how to decode buffers
to strings within a custom Writable.

Fixes: nodejs/node#15369
PR-URL: nodejs/node#16403
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
cjihrig pushed a commit to cjihrig/node that referenced this issue Nov 6, 2017
Improved stream documentation with an example of how to decode buffers
to strings within a custom Writable.

Fixes: nodejs#15369
PR-URL: nodejs#16403
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
gibfahn pushed a commit that referenced this issue Nov 14, 2017
Improved stream documentation with an example of how to decode buffers
to strings within a custom Writable.

Fixes: #15369
PR-URL: #16403
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
MylesBorins pushed a commit that referenced this issue Nov 14, 2017
Improved stream documentation with an example of how to decode buffers
to strings within a custom Writable.

Fixes: #15369
PR-URL: #16403
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
addaleax pushed a commit to ayojs/ayo that referenced this issue Dec 7, 2017
Improved stream documentation with an example of how to decode buffers
to strings within a custom Writable.

Fixes: nodejs/node#15369
PR-URL: nodejs/node#16403
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Issues and PRs related to the documentations. good first issue Issues that are suitable for first-time contributors. stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants