-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono] unify and vectorize implementation of decode_value metadata API #100048
Conversation
cc @radekdoulik re our earlier conversation about simd in other parts of the runtime |
Fix some mistakes
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
@lewing are there plans to enable SIMD in the wasm runtime for NET10? It would be needed to be able to land this. Doing it would also make simdhash much faster. |
do you have numbers for how much faster. If the gain is large enough there are a few options we could explore |
IIRC when I tested it this PR was like a 5-10% improvement to decode_value. For simdhash it makes lookups something like 10-40% faster depending on a bunch of factors, with the caveat that hashtable ops are only 10-15% of time spent during a startup profile, so that means we'd be looking at a like 1-4% actual savings |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
We have multiple copies of the same decode_value function spread across the mono runtime. This PR unifies them all into a single implementation under metadata/, and then vectorizes it using clang vector builtins.
A basic benchmark on x64 using clang -O3 showed a 13% time reduction, and the generated wasm for this is pretty efficient, so I'm hoping it will be a small startup time win for any target where we enable it. It's hard to actually measure in practice locally though...
I verified that the vectorized implementation works correctly by comparing its output against the scalar version for all possible 5-byte sequences, so this should be a safe switch-over as long as there aren't endianness issues.
Pending stuff to fix for this PR: