Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jsrt: Modify signature of JsCopyString #3433

Merged
merged 1 commit into from
Jul 27, 2017
Merged

Conversation

kunalspathak
Copy link
Contributor

@kunalspathak kunalspathak commented Jul 26, 2017

Modified signature of JsCopyString to also return actual count of UTF8 bytes present in jsString.
With this information, host can simply allocate a buffer assuming all characters are ascii and
based on writtenLength and actualLength values returned by the API, it can decide if the assumption was correct
i.e. writtenLength == actualLength or it should take slow path to call JsCopyString again by passing bigger buffer
equal to size ofactualLength.

Today, if host wants to copy UTF8 representation of javascript string into a buffer, here is how it is done:

   size_t length = 0;
   JsCopyString(jsString, nullptr, 0, &length);

   char* buffer = malloc(length);

   size_t written = 0;
   JsCopyString(jsString, buffer, length, &written);
   assert(written == length);

can be changed to

   size_t actualLength = 0;
   size_t writtenLength = 0;
   size_t strLength = 0;
   JsStringToPointer(strRef, nullptr, &strLength);
   char* buffer = malloc(strLength + 1);
   JsCopyString(jsString, buffer, strLength, &writtenLength, &actualLength);

   // slow path if jsString contains non-ascii characters
   if(writtenLength != actualLength) {
      free(buffer);
      buffer = malloc(actualLength + 1);

      JsCopyString(jsString, buffer, actualLength, &writtenLength, &actualLength);
   }

@MSLaguana
Copy link
Contributor

Looks like there are still some references to JsCopyString with 4 arguments in ch, and you need to initialize writtenLength to appease prefast.

/// <param name="writtenLength">Total number of characters written. This is only
/// populated when passed with non-null `buffer`.
/// </param>
/// <param name="actualLength">Total number of UTF8 characters present in `value`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this "utf8 characters"/codepoints, or "bytes"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be number of UTF8 decoded "bytes" corresponding to code-points present in "value". i will fix that up.

@kunalspathak
Copy link
Contributor Author

Yes, the fix for ch is on the way.

@@ -292,7 +292,13 @@ CHAKRA_API
/// <param name="value">JavascriptString value</param>
/// <param name="buffer">Pointer to buffer</param>
/// <param name="bufferSize">Buffer size</param>
/// <param name="written">Total number of characters written</param>
/// <param name="writtenLength">Total number of characters written. This is only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be clarified in the same way as the actualLength parameter.

Might also be worth clarifying that there is no trailing null character?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also just noticed in the remarks section above it still refers to written instead of actualLength.

@MSLaguana
Copy link
Contributor

For some reason the xplat builds can't see JsStringToPointer

@kunalspathak
Copy link
Contributor Author

kunalspathak commented Jul 26, 2017

Yes, just realized that JsStringToPointer is windows only API. The reason i was using it was to easily get the string length (no. of ascii characters) on which my assumption is based on. I could use JsGetStringLength API, but that returns length in int rather than size_t which i think is dangerous. Other option i could use is read property length of string object, but it might trigger getter, trap, etc.
Instead I am proposing a new API JsGetStringLengthSizeT (I am open to have a better name for this). I will add that API under experimental and use it for ch to do what I am trying to do.

@liminzhu
Copy link
Collaborator

liminzhu commented Jul 26, 2017

@kunalspathak since JsCopyString would return the actual length, how about calling JsCopyString(jsString, nullptr, 0, &wlength, &alength) to determine the buffer size needed instead of exposing JsGetStringLengthSizeT? It would look like this.

   size_t wlength = 0;
   size_t alength = 0;
   JsCopyString(jsString, nullptr, 0, &wlength, &alength);
   char* buffer = malloc(alength+1);
   JsCopyString(jsString, buffer, alength+1, &wlength, &alength);

@obastemur
Copy link
Collaborator

@kunalspathak sorry for joining conversation late but isn't this very expensive to check if the source string is all ascii?

@obastemur
Copy link
Collaborator

I just didn't understand the scenario.

@MSLaguana
Copy link
Contributor

Currently in node we do roughly what @liminzhu mentions, calling JsCopyString twice, once with no buffer to get the length, and then once with a sufficiently large buffer to get the string out. However, JsCopyString internally performs allocations to put the utf8-encoded string somewhere, since we store as utf-16 internally. One of the main motivations of this change is to try and reduce the number of those allocations by having fewer calls to JsCopyString.

If we have the length (in utf16 characters) of the string, then we can guess that maybe the string is made of characters with a 1 byte utf8 encoding (often true), and so if we provide a buffer of that size up front we may be able to avoid a second call unless it turns out we actually need more space.

@obastemur
Copy link
Collaborator

Our initial version was a single call (previous method was allocating necessary space)

This version will call Utf8Str and that will end up allocating memory anyways? If you know the string length (as in number of letters (ascii and utf8)) You can safely pass a buffer with (length * 3) + 1 (if buffer is char) otherwise (length * 2) + 1 (if buffer is utf16)

Actually internal Utf8Str does a similar trick to make things faster.

Checking a string (whether if it's all ASCII or not) is a basic loop that you may also introduce before calling this too?

@liminzhu
Copy link
Collaborator

liminzhu commented Jul 26, 2017

If there's significant overhead of calling JsCopyString, maybe we can have a fast path for when buffer is nullptr to just get the written/actual length (you need to pay the cost somewhere and get those anyways from Kunal's opening comment)? The calling JsCopyString twice pattern is relatively concise and consistent with how we use other APIs like JsSerializeScript.

Do we need to do anything for JsCopyStringUtf16?

@obastemur
Copy link
Collaborator

IMHO, calling malloc once or twice doesn't make much difference. Loop is the most important part here. If we know the string length, we don't need to change anything. Otherwise, I'm not sure the perf gain here.

@kunalspathak
Copy link
Contributor Author

kunalspathak commented Jul 26, 2017

IMHO, calling malloc once or twice doesn't make much difference.

@obastemur - it makes a difference. From what I see there is 300MB worth of heap allocation and free happening inside JsCopyString which can be reduced to half. I thought about a solution to just initialize a buffer of (strLen * 3) + 1, but in most of the cases we will be allocating more memory than needed. To give you an example, in one of the sample app that I was running, 108K times string contains all ascii characters and only 60 times they contain Unicode characters (mainly characters in script file like this). So definitely avoiding a call to JsCopyString helps.

Checking a string (whether if it's all ASCII or not) is a basic loop that you may also introduce before calling this too?

I am not sure what you mean when you mean by this. Could you elaborate?

I spoke offline with @MSLaguana about this and we think here is the problem. Today to perform a copy, we do 3 allocations

  1. By NarrowWideConverter to calculate the length
  2. Allocate the buffer of size length
  3. By NarrowWideConverter to copy the contents into buffer

What should happen inside JsCopyString is

  1. If user doesn't pass a buffer, skip allocation inside NarrowWideConverter and have different way of calculating the length.
  2. If user pass a buffer, use it to copy contents to, instead of creating own inside NarrowWideConverter.
  3. If the buffer that was passed was not big enough, then return back the offset of utf16 upto which contents were copied. Host can then do a realloc with proper length and again call JsCopyString with the offset and resume copying. In this step too, we will avoid allocating buffer inside NarrowWideConverter.

Thus, we can replace 3 malloc/free with just 1 in best case but that is beyond the scope of this PR. I still would like to get in new JsGetStringLengthSizeT and changes to JsCopyString into 1.7 and from there into node-chakracore. We can come back and fine tune JsCopyString later.

@obastemur
Copy link
Collaborator

See https://github.com/Microsoft/ChakraCore/blob/master/lib/Common/Codex/Utf8Codex.cpp#L326 for checking ASCII. If we just need to know whether a string has any multi byte chars in it, this could be the approach? Current design also help for fail-safe. In case of a basic ASCII check fails.

it makes a difference. From what I see there is 300MB worth of heap allocation and free happening inside JsCopyString which can be reduced to half.

I would be really surprised if this costs something tangible. IIRC; we were loosing majority of the time on double looping inside the Utf8ToStr. Now we don't.

@jianchun
Copy link

jianchun commented Jul 27, 2017

@kunalspathak Sorry I chime in late too. To me adding a JsGetStringLengthSizeT looks a bit awkward when we already have JsGetStringLength. Although int isn't ideal, it works at least for now and probably long in future. We have other APIs taking int parameters (JsCopyStringUtf16).

Also there is another workaround for your purpose: JsCopyStringUtf16(strVar, 0, -1, nullptr, &length).

If you really prefer an API... We don't have an API to get Array length, do we? How about a common JsGetLength, that works on String, Array, (TypedArray)... any object?

[Update]: The String APIs and implementations were put up quickly and likely not optimized for many scenarios. I recall node has a code path that only interested in utf8 byte length? For that you can skip NarrowWideConverter, instead change utf8 converter function into a template that takes a dummy output iterator which only counts output bytes. Use that when only requesting utf8 byte length. No memory allocation at all.

One question: You mentioned add API under experimental -- What's our mechanism for experimental? Do we #ifdef or mark some compile time attributes or anything?

@kunalspathak
Copy link
Contributor Author

Thanks @jianchun for the suggestion. I think you are right about ok to have length as int rather than introducing a new API for size_t. I will use JsGetStringLength() instead.
Regarding optimizing converter, I agree and what I mentioned here but I would keep that as a separate PR.

@kunalspathak
Copy link
Contributor Author

Regarding experimental - I haven't seen any flag except mentioning it in wiki.

Modified signature of `JsCopyString` to also return actual count of UTF8 bytes present in jsString.
With this information, host can simply allocate a buffer assuming all characters are ascii and
based on `writtenLength` and `actualLength` values returned by the API, it can decide if the assumption was correct
i.e. `writtenLength == actualLength` or it should take slow path to call `JsCopyString` again by passing bigger buffer
equal to size of`actualLength`.

Today, if host wants to copy UTF8 representation of javascript string into a buffer, here is how it is done:

```c++
   size_t length = 0;
   JsCopyString(jsString, nullptr, 0, &length);

   char* buffer = malloc(length);

   size_t written = 0;
   JsCopyString(jsString, buffer, length, &written);
   assert(written == length);
```

can be changed to

```c++
   size_t actualLength = 0;
   size_t writtenLength = 0;
   size_t strLength = 0;
   JsStringToPointer(strRef, nullptr, &strLength);
   char* buffer = malloc(strLength + 1);
   JsCopyString(jsString, buffer, strLength, &writtenLength, &actualLength);

   // slow path if jsString contains non-ascii characters
   if(written != actualLength) {
      free(buffer);
      buffer = malloc(actualLength + 1);

      JsCopyString(jsString, buffer, actualLength, &writtenLength, &actualLength);
   }
```
@kunalspathak
Copy link
Contributor Author

@dotnet-bot test OSX static_osx_osx_release please

@kunalspathak
Copy link
Contributor Author

Thanks everyone for the feedback and review!

@chakrabot chakrabot merged commit 356656c into release/1.7 Jul 27, 2017
chakrabot pushed a commit that referenced this pull request Jul 27, 2017
Merge pull request #3433 from pr/kunalspathak/1.7

Modified signature of `JsCopyString` to also return actual count of UTF8 bytes present in jsString.
With this information, host can simply allocate a buffer assuming all characters are ascii and
based on `writtenLength` and `actualLength` values returned by the API, it can decide if the assumption was correct
i.e. `writtenLength == actualLength` or it should take slow path to call `JsCopyString` again by passing bigger buffer
equal to size of`actualLength`.

Today, if host wants to copy UTF8 representation of javascript string into a buffer, here is how it is done:

```c++
   size_t length = 0;
   JsCopyString(jsString, nullptr, 0, &length);

   char* buffer = malloc(length);

   size_t written = 0;
   JsCopyString(jsString, buffer, length, &written);
   assert(written == length);
```

can be changed to

```c++
   size_t actualLength = 0;
   size_t writtenLength = 0;
   size_t strLength = 0;
   JsStringToPointer(strRef, nullptr, &strLength);
   char* buffer = malloc(strLength + 1);
   JsCopyString(jsString, buffer, strLength, &writtenLength, &actualLength);

   // slow path if jsString contains non-ascii characters
   if(writtenLength != actualLength) {
      free(buffer);
      buffer = malloc(actualLength + 1);

      JsCopyString(jsString, buffer, actualLength, &writtenLength, &actualLength);
   }
```
chakrabot pushed a commit that referenced this pull request Jul 27, 2017
…CopyString

Merge pull request #3433 from pr/kunalspathak/1.7

Modified signature of `JsCopyString` to also return actual count of UTF8 bytes present in jsString.
With this information, host can simply allocate a buffer assuming all characters are ascii and
based on `writtenLength` and `actualLength` values returned by the API, it can decide if the assumption was correct
i.e. `writtenLength == actualLength` or it should take slow path to call `JsCopyString` again by passing bigger buffer
equal to size of`actualLength`.

Today, if host wants to copy UTF8 representation of javascript string into a buffer, here is how it is done:

```c++
   size_t length = 0;
   JsCopyString(jsString, nullptr, 0, &length);

   char* buffer = malloc(length);

   size_t written = 0;
   JsCopyString(jsString, buffer, length, &written);
   assert(written == length);
```

can be changed to

```c++
   size_t actualLength = 0;
   size_t writtenLength = 0;
   size_t strLength = 0;
   JsStringToPointer(strRef, nullptr, &strLength);
   char* buffer = malloc(strLength + 1);
   JsCopyString(jsString, buffer, strLength, &writtenLength, &actualLength);

   // slow path if jsString contains non-ascii characters
   if(writtenLength != actualLength) {
      free(buffer);
      buffer = malloc(actualLength + 1);

      JsCopyString(jsString, buffer, actualLength, &writtenLength, &actualLength);
   }
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants