-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use TextDecoder API for decoding UTF-8 from binary data #184
Comments
"native" is
|
For short strings, |
Wow, an order of magnitude worse for arguably the most common use (short strings). I bet 95% of all strings in all protobufs ever created are less than 256 characters. |
To be clear, I haven't actually run into this bug myself. This came up as I was trying to improve the test coverage of the protobuf-ts/runtime package. There were no specific tests for the On a side note, I wasn't exactly sure how to properly test this in a way that would work for both test environments (node + browser). The protobuf.js tests rely on reading the two files from disk into a nodejs buffer and comparing the decoded results. But we can't really do the same thing as it would fail in the browser. The way I got it working at all was to follow these steps for each text file fixture: utf8.txt and surrogate_pair_bug.txt:
Then the actual test file looks like this: import { base64decode, utf8read } from '../src';
import { surrogatePairBug_b64, surrogatePairBug_text } from './support/surrogate_pair_bug';
import { utf8_b64, utf8_text } from './support/utf8';
const utf8data = base64decode(utf8_b64);
const surrogatePairData = base64decode(surrogatePairBug_b64);
describe('utf8read()', () => {
it('should decode an empty buffer to an empty string', () => {
expect(utf8read(new Uint8Array())).toBe('');
});
it('should decode utf8 properly', () => {
expect(utf8read(utf8data)).toBe(utf8_text);
});
it('should avoid the surrogate pair bug', () => {
expect(utf8read(surrogatePairData)).toBe(surrogatePairBug_text);
});
}); The first two tests pass :) so good news there. If you have any ideas on how this stuff could be tested in both environments w/o needing to duplicate all the data as base64 strings I'd love to hear it. |
According to nodejs/node#39879 the const textDecoder = new TextDecoder();
// Use Buffer#toString() if running inside nodejs because it's slightly faster
export const utf8read = typeof global === 'object' && typeof Buffer === 'function'
? (bytes: Uint8Array): string => Buffer.from(bytes).toString()
: (bytes: Uint8Array): string => textDecoder.decode(bytes) This code passes all the tests. I copied the benchmarking code from the linked issue and just added console.time('small Buffer toString')
for (let i = 0; i < 100000; i++) Buffer.from(smallUint8).toString();
console.timeEnd('small Buffer toString')
console.time('big Buffer toString')
for (let i = 0; i < 100000; i++) Buffer.from(bigUint8).toString()
console.timeEnd('big Buffer toString') These results on node 14 show it to be about 3x faster than using
|
The naive
From the Node.js docs:
Which means it should be possible to use For the surrogate pair bug, storing the data in a protobuf-ts/packages/runtime/spec/support/wire-data.ts Lines 83 to 84 in fbce12b
|
for (let fix of fixtures) {
console.log(`### ${fix.characters} characters`);
const encoded = nativeEncode(fix.text);
bench('@protobufjs/utf8 ', () => protobufJsDecode(encoded));
bench('Buffer#toString ', () => Buffer.from(encoded).toString());
bench('UInt8Array#toString ', () => encoded.toString());
bench('TextDecoder ', () => nativeDecode(encoded));
} Edit: So it's not longer an order of magnitude difference. Slightly better than half speed at 8 char and becomes faster at 64 chars. It would be interesting to see how Edit 2: Wow, the @protobufjs/utf8 package is actually using the old implementation, benching it again against the new one shows that the new one is even faster than the old one. These number are pretty unbelievable.
Full bench code: const protobufJsUtf8 = require("@protobufjs/utf8");
function utf8_read(buffer, start, end) {
if (end - start < 1) {
return "";
}
var str = "";
for (var i = start; i < end;) {
var t = buffer[i++];
if (t <= 0x7F) {
str += String.fromCharCode(t);
} else if (t >= 0xC0 && t < 0xE0) {
str += String.fromCharCode((t & 0x1F) << 6 | buffer[i++] & 0x3F);
} else if (t >= 0xE0 && t < 0xF0) {
str += String.fromCharCode((t & 0xF) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F);
} else if (t >= 0xF0) {
var t2 = ((t & 7) << 18 | (buffer[i++] & 0x3F) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F) - 0x10000;
str += String.fromCharCode(0xD800 + (t2 >> 10));
str += String.fromCharCode(0xDC00 + (t2 & 0x3FF));
}
}
return str;
}
function bench(name, fn, durationSeconds = 2) {
let startTs = performance.now();
let endTs = startTs + durationSeconds * 1000;
let samples = 0;
while (performance.now() < endTs) {
fn();
samples++;
}
let durationMs = performance.now() - startTs;
let opsPerSecond = 1000 / (durationMs / samples);
console.log(`${name}: ${Math.round(opsPerSecond * 100) / 100} ops/s`);
}
let textDecoder = new TextDecoder();
let textEncoder = new TextEncoder();
function nativeEncode(text) {
return textEncoder.encode(text);
}
function nativeDecode(bytes) {
return textDecoder.decode(bytes, {stream: false})
}
function protobufJsDecode(bytes) {
return protobufJsUtf8.read(bytes, 0, bytes.length);
}
const fixtures = [];
for (let i = 3; i <= 14; i++) {
let j = Math.pow(2, i);
fixtures.push({
characters: j,
text: "hello 🌍".repeat(j / 8),
});
}
for (let fix of fixtures) {
console.log(`### ${fix.characters} characters`);
const encoded = nativeEncode(fix.text);
bench('NEW protobufjs utf8 ', () => utf8_read(encoded));
bench('@protobufjs/utf8 ', () => protobufJsDecode(encoded));
bench('Buffer#toString ', () => Buffer.from(encoded).toString());
bench('TextDecoder#decode ', () => nativeDecode(encoded));
} |
While investigating this I also discovered that the original implementation actually belongs to Google's closure library licensed under Apache-2.0 and was just copied over into protobuf.js. The tests available for the Google's closure library pass using protobuf.js's new implementation. What I'm really looking for is some comprehensive test of the utf8 decoding that would fail when using the non-native code, but succeed when using If we cannot find such a test to prove that the new protobuf.js implementation is wrong I propose that we simply adopt this newer (and somehow much faster) implementation. Edit: Perhaps the reason that shortcuts are allowed to be made in the non-native code is that it only has to be responsible for decoding complete utf8 strings and thus it doesn't need to concern itself with handling partial ones that might end in the middle of a surrogate pair which would then require replacement characters. |
I've found an example that decodes differently for the native decoder vs. both non-native versions: // "Overlong" sequence should never be generated by a valid UTF-8 encoder
const bytes = new Uint8Array([0xc1, 0xbf]); // [ 193, 191 ]
const native = new TextDecoder();
const encoder = new TextEncoder();
const native_decoded = native.decode(bytes); // "��" - "replacement" characters U+FFFD
const old_pbjs_decode = old_pbjs_read(bytes); // "\u007f"
const new_pbjs_decode = new_pbjs_read(bytes); // "\u007f"
const native_round_trip = encoder.encode(native_decoded); // [ 239, 191, 189, 239, 191, 189 ]
const old_pbjs_round_trip = encoder.encode(old_pbjs_decode); // [ 127 ]
const new_pbjs_round_trip = encoder.encode(new_pbjs_decode); // [ 127 ] Based on my reading of https://www.unicode.org/faq/private_use.html#nonchar9 the answer to the question of "Which behavior is correct?" seem to be "It depends." But neither native nor non-native versions preserve the input bytes during a round trip. Edit: I also verified that the official google protobuf javascript implementation produces the same results as both pbjs versions. |
Does Node.js respect the In Chrome: new TextDecoder("utf-8", {fatal: true}).decode(new Uint8Array([0xc1, 0xbf]));
Uncaught TypeError: Failed to execute 'decode' on 'TextDecoder': The encoded data was not valid. |
I wonder if this could be a compiler option like in nanopb:
But we'd need to replace "small [...] penalty" with "large [...] penalty" :) Perhaps a compile option like:
|
So it turns out that the benchmark code I posted above had a giant bug in it which basically made the benchmark for the new protobufjs utf8 decoder a no-op (no wonder it was so fast). This line bench('NEW protobufjs utf8 ', () => utf8_read(encoded)); The problem is that Turns out that the new function is basically the same as the old version when properly benchmarked. I've also run the same benchmark code in Firefox 95.0.2, Safari 15.0.0, and Chrome 96.0.4664.110. In Firefox and Safari the native I think it may be worth just using Chrome
Firefox
Safari
Here is the new benchmark code you can run yourself: function utf8_read(buffer, start, end) {
if (end - start < 1) {
return "";
}
var str = "";
for (var i = start; i < end;) {
var t = buffer[i++];
if (t <= 0x7F) {
str += String.fromCharCode(t);
} else if (t >= 0xC0 && t < 0xE0) {
str += String.fromCharCode((t & 0x1F) << 6 | buffer[i++] & 0x3F);
} else if (t >= 0xE0 && t < 0xF0) {
str += String.fromCharCode((t & 0xF) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F);
} else if (t >= 0xF0) {
var t2 = ((t & 7) << 18 | (buffer[i++] & 0x3F) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F) - 0x10000;
str += String.fromCharCode(0xD800 + (t2 >> 10));
str += String.fromCharCode(0xDC00 + (t2 & 0x3FF));
}
}
return str;
}
function bench(name, fn, durationSeconds = 2) {
let startTs = performance.now();
let endTs = startTs + durationSeconds * 1000;
let samples = 0;
while (performance.now() < endTs) {
fn();
samples++;
}
let durationMs = performance.now() - startTs;
let opsPerSecond = 1000 / (durationMs / samples);
console.log(`${name}: ${Math.round(opsPerSecond * 100) / 100} ops/s`);
}
let textDecoder = new TextDecoder();
let textEncoder = new TextEncoder();
function nativeEncode(text) {
return textEncoder.encode(text);
}
function nativeDecode(bytes) {
return textDecoder.decode(bytes);
}
function newJsDecode(bytes) {
return utf8_read(bytes, 0, bytes.length);
}
const fixtures = [];
for (let i = 3; i <= 14; i++) {
let j = Math.pow(2, i);
fixtures.push({
characters: j,
text: "hello 🌍".repeat(j / 8),
});
}
for (let fix of fixtures) {
console.log(`### ${fix.characters} characters`);
const encoded = nativeEncode(fix.text);
bench('NEW protobufjs utf8 ', () => newJsDecode(encoded));
bench('TextDecoder#decode ', () => nativeDecode(encoded));
} |
Thank you for the benchmarks, @jcready! Taking a step back: 1. invalid UTF-8 Going with
2. Performance We let users pass a I think we probably want to be a bit more lax with the typings to make it easier to use, and make defaults accessible at some point in the future. |
Your link points to the |
e1cd360 switches to the TextDecoder API. I'll close this issue when it is released. Updated the manual: JavaScript uses UTF-16 for strings, but protobuf uses UTF-8. In order Note that the protobuf language guide states:
If an invalid UTF-8 string is encoded in the binary format, protobuf-ts As of January 2022, performance of TextDecoder on Node.js falls behind const nodeBinaryReadOptions = {
readerFactory: (bytes: Uint8Array) => new BinaryReader(bytes, {
decode(input?: Uint8Array): string {
return input ? (input as Buffer).toString("utf8") : "";
}
})
};
MyMessage.fromBinary(bytes, nodeBinaryReadOptions); |
Released as v2.2.0-alpha.0 in the |
Just notice this issue was linked to my issue on the nodejs repo. I always prefer native feature that isn't tied to using internal NodeJS modules as it will make it more cross comp friendly and work in more environments |
Thanks for the shout, @jimmywarting. Agree with your stance. Node.js is picking up the Web Streams API and the fetch API too. |
Just an update here. It looks like as of node v18.13 the performance of
|
TextEncoder/Decoder are not supported in react native. Is there any workaround or does protobuf-ts have any flag to use Buffer.toString? |
You can use a polyfill like https://github.com/samthor/fast-text-encoding or you can specify your own import {BinaryReader, BinaryWriter} from "@protobuf-ts/runtime";
export const binaryReadOptions = {
readerFactory: (bytes: Uint8Array) => new BinaryReader(bytes, {
decode(input?: Uint8Array): string {
return input ? (input as Buffer).toString("utf8") : "";
}
})
};
export const binaryWriteOptions = {
writerFactory: () => new BinaryWriter({
encode(input?: string): Uint8Array {
return Buffer.from(input || "", "utf8");
}
})
};
// elsewhere
object = MyMessage.fromBinary(bytes, binaryReadOptions);
bytes = MyMessage.toBinary(object, binaryWriteOptions); |
@jcready Sorry for the late response. I thought I wrote to you before my summer vacations :(
Will try anothe polyfill. |
I added fatal mode to the polyfill. |
MDN comp table shows that Text encoder/decoder are pretty wildly available. is a polyfill really necessary? |
@jimmywarting React Native. |
React Native might be considered more nitched compared to web and backend technologies, and its user base might not be as extensive. when i install/use protobuf then i don't need a polyfill. React Native developers could including a TextEncoder/TextDecoder polyfill while also installing protobuf. I personally prefer not to add unnecessary dependencies that I don't actually require. |
@jimmywarting please see the whole last 4-5 messages why a polyfill in RN is needed. |
Identical bug to protobufjs/protobuf.js#1473 as protobuf-ts is using the same (old) implementation. Look like it may have been fixed in protobufjs/protobuf.js#1486, but this comment claims otherwise: protobufjs/protobuf.js#1473 (comment)
I'm not sure what the perf hit would be for just going with TextDecoder, but it may be worth investigating.
I did test out the current implementation against the test fixture: https://github.com/protobufjs/protobuf.js/blob/master/lib/utf8/tests/data/surrogate_pair_bug.txt and it did seem to suffer from the same issue.
The text was updated successfully, but these errors were encountered: