Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issue with UTF8ToString (#12517) #12544

Conversation

LanderlYoung
Copy link

@LanderlYoung LanderlYoung commented Oct 19, 2020

- add extra parameter exactStringLength to UTF8ToString,
This parameter is optional, and if given, the maxBytesToRead parameter is ignored.
to keep consistent API flavor, UTF16ToString, UTF32ToString, is also changed.

Fix performance issue with UTF8ToString (#12517)

introduce new functions
- UTF8ToStringWithLength
- UTF16ToStringWithLength
- UTF32ToStringWithLength

Decode string exactly length with given length, any '\0' in between will
be kept as-is.

Those functions require an argument `lengthInBytes`, so no need to iterator
the heap to find a null-terminator, thus have better performance.

@welcome
Copy link

welcome bot commented Oct 19, 2020

Thank you for submitting a pull request! If this is your first PR, make sure to add yourself to AUTHORS.

@juj
Copy link
Collaborator

juj commented Oct 19, 2020

This regresses generated code size for users that do not use the new variant.

Also unless I am mistaken, this does not really fix any performance issue, because no caller code site is actually changed to use the exact length form. The PR just slows down the functions when exact length is not specified.

Maybe this fixes a performance issue in your JS library code calls that are updated to use this form, at the expense of slowing down generated code for every other user?

I would recommend implementing a separate set of functions UTF*ToStringExactNBytes where maxBytesToRead would be replaced with a non-optional parameter exactBytesToRead. Then callers can use that variant when they need to marshal exactly requested number of bytes.

For the new variant, add a clear comment that if there are null bytes in the middle of the string, then those null bytes will also be deserialized into the generated string.

@LanderlYoung LanderlYoung marked this pull request as ready for review October 19, 2020 07:57
@LanderlYoung
Copy link
Author

@juj Thanks for the suggestion. BTW: previous discussion here #12517

This does not fix the performance issue, because the function still needs to be backward compatible.
But it provides a way to be more efficient.

...at the expense of slowing down generated code for every other user?

I don't think this PR would slow down other's code. Just added some simple if checks.

I would recommend implementing a separate set of functions

This is nice.

@juj
Copy link
Collaborator

juj commented Oct 19, 2020

I don't think this PR would slow down other's code. Just added some simple if checks.

The loop condition changed from

while (idx < endPtr)

to

    while (hasExactLength ? idx < endPtr : !(idx >= endIdx)) {

which will slow down code for all users.

Let's go with a separate set of functions, that will be best for code size and performance in each case.

@LanderlYoung
Copy link
Author

Ok, new commit latter

@LanderlYoung LanderlYoung force-pushed the feature/UTF8ToString_performance_issue_12517 branch from 41ce84e to d626038 Compare October 20, 2020 08:47
@LanderlYoung
Copy link
Author

@juj I rewrote this PR and added corresponding UnitTest, please review.

@LanderlYoung
Copy link
Author

BTW, the tests/utf32.cpp file didn't properly formated with clang-format, and since this file is small, and my test now makes up more lines, I formatted it with project's .clang-format config.

@LanderlYoung LanderlYoung force-pushed the feature/UTF8ToString_performance_issue_12517 branch 2 times, most recently from 7bac16d to 7dfe671 Compare October 20, 2020 11:00
@@ -118,6 +118,85 @@ function UTF8ToString(ptr, maxBytesToRead) {
#endif
}


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all the added content in this file over to src/library_strings.js. The file runtime_strings.js should not receive any more content, it exists for historical reasons. All new functions can go to library_strings.js.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted

Copy link
Author

@LanderlYoung LanderlYoung Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to move code to src/library_strings.js, but it seems to have something to be related to the MINIMAL_RUNTIME flag, no matter how the unit test complains ReferenceError: UTF8ToStringNBytes is not defined.

The half-done commit is here LanderlYoung@f9149d7

If there is any doc related, please let me know, thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the library_xxx.js system works is that symbols are only ever included on on demand.

There are several ways a library symbol can get included:

  1. By being directly referenced by native code
  2. By being add to -s DEFAULT_LIBRARY_FUNCS_TO_INCLUDE list.
  3. Via the __deps entry to another symbol that is incuded.

(1) doesn't apply to JS functions that are only callable from JS such as these ones so you can ignore that. To signal these are JS-only functions you should prefix them with $.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like src/library_strings.js is only currently included in MINIMAL_RUNTIME mode. Its job is currently to make the functions in runtime_strings_extra.js available as JS library, but only MINIMAL_RUNTIME mode.

runtime_strings_extra.js is include directly in normal (non-MINIMAL) mode.

So some refactoring would need to be doing before these functions can be make universally available via the library system.

I'd be tempted to add all these new functions to runtime_strings_extra.js + src/library_strings.js for now. At least then they will be used in library form under MINIMAL_RUNTIME.

The alternative is that we do some refactoring such at src/library_strings.js is used in all modes.

@@ -77,6 +77,37 @@ function UTF16ToString(ptr, maxBytesToRead) {
#endif // TEXTDECODER
}

function UTF16ToStringWithLength(ptr, lengthInBytes) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all the added code from this file into file src/library_strings.js. This file exists only for historical compatibility reasons, but all new code can be added to src/library_strings.js.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted

@juj
Copy link
Collaborator

juj commented Oct 20, 2020

Great work! Thanks for championing this. One code organization comment only, we don't need to use the runtime_x.js files for new code, but can use the proper library_x.js structuring there.

@LanderlYoung
Copy link
Author

Thanks for the advice, will do it tomorrow. (I'm from China, it's night now😃)

@LanderlYoung LanderlYoung force-pushed the feature/UTF8ToString_performance_issue_12517 branch from 7dfe671 to ad88e7d Compare October 21, 2020 12:01
@LanderlYoung
Copy link
Author

update:

Fixed cr issues:

  1. renamed UTF*ToStringWithLength to UTF*ToStringNBytes
  2. fixed typo in doc
  3. add docs to preamble.js.rst

BUT, failed to move code to library_strings.js, more detail explained in cr conversation.

@LanderlYoung LanderlYoung force-pushed the feature/UTF8ToString_performance_issue_12517 branch 2 times, most recently from 9599cb2 to 10fce0d Compare October 23, 2020 02:43
introduce new functions:
- UTF8ToStringNBytes
- UTF16ToStringNBytes
- UTF32ToStringNBytes

add docs to preamble.js.rst

Decode string exactly length with given length, any '\0' in between will
be kept as-is.

Those functions require an argument `lengthInBytes`, so no need to iterator
the heap to find a null-terminator, thus have better performance.
@LanderlYoung LanderlYoung force-pushed the feature/UTF8ToString_performance_issue_12517 branch from 10fce0d to 5a20826 Compare October 26, 2020 02:20
Base automatically changed from master to main March 8, 2021 23:49
@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants