src: allow simdutf::convert_* functions to return zero #47471

lemire · 2023-04-07T19:57:57Z

When transcoding with the simdutf library, you first scan the input to determine the size of the output (e.g., you scan the UTF-8 input to determine the size of the UTF-16 output). In a second step, you call a transcoding function. This transcoding function normally returns how many words were written. This number of words should match the size of the output computed during the first scan.

So you get three-line routines like as follow (scan, allocate, transcode):

size_t expected_utf16_length =
        simdutf::utf16_length_from_utf8(string.data(), string.length());
MaybeStackBuffer<char16_t> buffer(expected_utf16_length);
size_t utf16_length = simdutf::convert_utf8_to_utf16(
        string.data(), string.length(), buffer.out());

The scan to determine the size of the output does not validate the Unicode input: the validation occurs during the transcoding. For performance purposes, it will only seek to tell you how much memory you need to allocate, counting on the transcoding step to do the validation.

When the transcoding fails, the simdutf::convert_utf8_to_utf16 and simdutf::convert_utf16_to_utf8 functions return zero by convention, indicating an error. So you either have a successful transcoding (from valid Unicode to valid Unicode) in which case the transcoding function returns the number of written words, which matches exactly the expected number of output words, or you get zero, indicating that the input is invalid Unicode.

Currently, the simdutf library is used within src/inspector/node_string.cc with checks such as CHECK_EQ(expected_utf16_length, utf16_length);. In effect, these checks are true if and only if the inputs are valid Unicode. That should almost always be the case within Node. However, @danpeixoto reports that the check fail in their case, see #47457

I cannot reproduce @danpeixoto's issue. See my comments on the issue. Nevertheless, it seems warranted to make the code more robust in case we do have bad Unicode inputs.

This is what this PR does: it checks whether the transcoding functions return 0, and if it does, then it assumes that the input was invalid.

By convention, the routines return the empty string or a null, when the input was invalid. This could be changed to some other convention.

Trott

Should we write a test that fails with the current main branch but passes with this change?

Trott · 2023-04-07T20:28:34Z

Should we write a test that fails with the current main branch but passes with this change?

I guess the answer depends on what is found once someone is able to reproduce #47457 and find the source of the invalid UTF-8.

lemire · 2023-04-07T20:47:33Z

Should we write a test that fails with the current main branch

Such a test would trigger a CHECK_EQ in the current node and thus it would abort in the current node, but continue with this PR.

It seems that the current tests, those that stress the functions in src/inspector/node_string.cc, assume that the input is valid Unicode. For all I know, we want exactly this assumption and my PR is undesirable.

It is possible that in issue #47457, we want the user to come to node and complain, rather than hide the issue as the current PR would do.

src/inspector/node_string.cc

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

nodejs-github-bot · 2023-04-08T01:18:21Z

CI: https://ci.nodejs.org/job/node-test-pull-request/51061/

nodejs-github-bot · 2023-04-08T02:52:57Z

CI: https://ci.nodejs.org/job/node-test-pull-request/51062/

nodejs-github-bot · 2023-04-09T20:06:35Z

Landed in 63ee335

PR-URL: #47471 Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

PR-URL: nodejs#47471 Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Apr 7, 2023

lemire changed the title ~~src: allow simdutf::convert_* functions to return zero in case of invalid unicode inputs~~ src: allow simdutf::convert_* functions to return zero Apr 7, 2023

lemire added 4 commits April 7, 2023 16:01

src: allow simdutf::convert_* functions to return zero

1879bf9

src: replacing some CHECK_EQ with assert to simplify the code

5c8cb88

src: adding spaces before comments

848782b

src: reformat

2e851ec

lemire mentioned this pull request Apr 7, 2023

Assertion `(expected_utf16_length) == (utf16_length)' failed #47457

Closed

Trott approved these changes Apr 7, 2023

View reviewed changes

anonrig approved these changes Apr 7, 2023

View reviewed changes

anonrig reviewed Apr 7, 2023

View reviewed changes

src/inspector/node_string.cc Outdated Show resolved Hide resolved

src/inspector/node_string.cc Outdated Show resolved Hide resolved

lemire and others added 2 commits April 7, 2023 18:24

Update src/inspector/node_string.cc

a2f8863

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

Update src/inspector/node_string.cc

89c85b3

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

anonrig approved these changes Apr 8, 2023

View reviewed changes

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Apr 8, 2023

Trott added the commit-queue Add this label to land a pull request using GitHub Actions. label Apr 8, 2023

lpinca approved these changes Apr 8, 2023

View reviewed changes

nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Apr 9, 2023

nodejs-github-bot merged commit 63ee335 into nodejs:main Apr 9, 2023

RafaelGSS mentioned this pull request Apr 11, 2023

2023-04-18, Version 20.0.0 (Current) #47381

Merged

github-actions bot mentioned this pull request Apr 12, 2023

CI Reliability 2023-04-12 nodejs/reliability#541

Open

31 tasks

RafaelGSS pushed a commit that referenced this pull request Apr 13, 2023

src: allow simdutf::convert_* functions to return zero

78c7475

PR-URL: #47471 Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

danielleadams pushed a commit that referenced this pull request Jul 6, 2023

src: allow simdutf::convert_* functions to return zero

dafea39

PR-URL: #47471 Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

danielleadams mentioned this pull request Jul 10, 2023

v18.17.0 release proposal #48694

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: allow simdutf::convert_* functions to return zero #47471

src: allow simdutf::convert_* functions to return zero #47471

lemire commented Apr 7, 2023

Trott left a comment

Trott commented Apr 7, 2023

lemire commented Apr 7, 2023 •

edited

Loading

nodejs-github-bot commented Apr 8, 2023

nodejs-github-bot commented Apr 8, 2023

nodejs-github-bot commented Apr 9, 2023

src: allow simdutf::convert_* functions to return zero #47471

src: allow simdutf::convert_* functions to return zero #47471

Conversation

lemire commented Apr 7, 2023

Trott left a comment

Choose a reason for hiding this comment

Trott commented Apr 7, 2023

lemire commented Apr 7, 2023 • edited Loading

nodejs-github-bot commented Apr 8, 2023

nodejs-github-bot commented Apr 8, 2023

nodejs-github-bot commented Apr 9, 2023

lemire commented Apr 7, 2023 •

edited

Loading