-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: implement FastByteLengthUtf8 with simdutf::utf8_length_from_latin1 #50840
src: implement FastByteLengthUtf8 with simdutf::utf8_length_from_latin1 #50840
Conversation
The Benchmark Result:
Probably the speedup will only be noticeable when the CPU supports There is a way to detect when is supported? If so, we can add fast-path when is supported. |
Thanks. The PR only changes one function, Let us profile
So I submit to you that it is not a good benchmark to examine the effect of an optimization on
In the worst case, on very old CPUs (e.g., 15 years old), it will fallback on something equivalent to the current code. On reasonable CPUs, it will provide an accelated kernel. It is easy to query which CPU type it detects, but I don't think we currently expose it in Node. The
|
This function affects directly the performance of that function because of: Line 623 in b044369
On my machine using Ryzen 9 5950X:
Probably on strings with lengths lower than 256 will probably be slower than the old version, but for larger strings, it will be faster. Well, I'm not against this change since people smarter than me approved this PR but just want to give more context. |
@lemire how did you generate the profile? The benchmark runs on multiple different data sets while this PR only affects the one-byte & two byte ones (which are Latin 1 but not ASCII), if you profile all the datasets then it’s going to be dominated by other data sets that don’t hit the path being modified here, but the perf hit here on the ASCII dataset looks real. To select the regressed data set you need to pass something like “ n=4000000 repeat=1 encoding='utf8' type='one_byte'” to the benchmark runner. |
@joyeecheung I’ll add a performance analysis in the coming days. :-) |
It should not lead to a regression but issues are always possible. I will investigate. |
Also one thing to note: in the benchmark results datasets that are not affected by this change did not show significant performance differences (when the characters are 3-4 bytes in Unicode or when they are repeated i.e. no longer flat and therefore aren’t FastOneByteStrings). Only the datasets that are affected by this change (flat strings with ASCII or Latin-1 characters) showed significant performance differences. (It could also have something to do with the reduction of the fast calls in V8 but it’s difficult for me to see how changing only what’s inside the fast call could make a difference in the optimizations in V8). |
Are you sure? The
This is not related to this PR for two reasons: I think that the only case that should be affected is |
So I have examined a bit the issue and we have a case where computing the UTF-8 length of a string like For example, if I just profile
You can see here that Let us try something more significant...
Profiling this string, I get the following...
That's more reasonable. Even so, we see that the whole program cannot be sped up by more than 30% when optimizing FastByteLengthUtf8 (i.e., imagine we made it free). The code is not instrumented so it is a bit difficult to do a more serious analysis, but it is a ballpark upper bound. Anyhow, now we have a chance for this PR to be beneficial. And, sure enough, I get that this PR is faster on this less trivial string: PR:
Main:
So how do we deal with this? I suggest doing something akin to what @H4ad proposed. You filter the queries... according to length. Short strings go through the old code, and long strings go through PR:
Main:
(I run the benchmark just once, but it is a quiet machine with no more than a 2% variance.) In the last commit, I added an extra test to the benchmark corresponding to the My thanks to @H4ad for the benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks for the PR!
hmm indeed, I looked at the dataset again and they are not Latin1. I think the fix should be enough to avoid the extra overhead for short strings, here's another benchmark CI to verify it (I also noticed that in the CI results (#50840 (comment)), two_bytes were not affected, by @H4ad's local results were - it could be machine-dependent). https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1474/ |
The |
@joyeecheung We might change the benchmark further in another PR? I am happy to issue a second PR if this one gets merged. |
Yes, #50840 (comment) is more of a "thinking out loud" comment. |
@lemire Can you fix the linting errors? |
Sorry about that. I forgot to check after modifying the JavaScript code. Done. |
Landed in 891bd5b |
PR-URL: #50840 Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vinícius Lourenço Claro Cardoso <contact@viniciusl.com.br>
PR-URL: #50840 Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Vinícius Lourenço Claro Cardoso <contact@viniciusl.com.br>
This PR proposes to replace the conventional FastByteLengthUtf8 implementation by
simdutf::utf8_length_from_latin1
. Internally, the simdutf library implements specific functions. The results are presented in the following blog posts...The speed can be 30x better in some instances.
This PR actually reduces the total number of lines of code.