Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up java-to-managed typemap lookups #6905

Merged
merged 1 commit into from
Apr 13, 2022

Conversation

grendello
Copy link
Contributor

@grendello grendello commented Apr 7, 2022

Up until now, Xamarin.Android used string comparison when finding a
Managed type corresponding to a given Java type. Even though the
strings were pre-sorted at build time, multiple string comparisons
costed more time than necessary. To improve comparison speed, this
commit implements lookups based on hash values (using the xxHash
algorithm) calculated for all the Java names at build time. This allows
us to process each Java type once at run time - to generate its hash.
After that, the hash is used to binary search an array of hashes and the
result (if found) is an index into array with the appropriate
Java-to-Managed mapping.

This change also allows us to move Java type names from the mapping
structure (TypeMapJava) to a separate array. We used to keep Java
type name in the structure to make matching slightly faster, but it
required unnecessarily complicated structure size calculation at
runtime, so that binary search can properly work on an array of
TypeMapJava structures whose size would differ from application to
application (and sometimes even between builds). The change also saves
space, because when the Java type name was stored in the structure, all
the structures had to have the same size, and thus all type names
shorter than the longest one had to be padded with NUL characters.

A handful of other optimizations are implemented as well. Namely:

  • the JNIEnv.RegisterJniNatives method is now called
    directly (thanks to the [UnmanagedCallersOnly] attribute) when
    running under .NET6
  • a conceptually simpler binary search function was implemented, which
    doesn't use C++ templates and also appears to generate faster code.
    There are two versions of the function, one "simple" using the
    standard branching binary search algorithm and the other
    "branchless". The latter is currently not used, needing a better
    timing infrastructure to make sure it's actually faster on Android
    devices (microbenchmarks suggest its faster, application
    measurements when the branchless version is used suggest it's slower
    than the simple one)
  • the typemap_managed_to_java and typemap_java_to_managed internal
    calls are now registered directly from the EmbeddedAssemblies
    class instead of from the MonodroidRuntime class
  • a number of native functions are now forcibly inlined
  • a number of native functions are now static instead of instance

Startup performance was measured a .NET6 MAUI application created with
the dotnet new maui template and the gains vary depending on where we
look. The Displayed time sees changes that are negligible, however the
most affected area of the startup sequence (the call to
JNIEnv.Initialize) which registers types and involves the biggest
number of lookups sees improvements of up to 12%. The measurements have
a degree of uncertainty and instability to them because of our use of
Android logcat to report timings as they are taken (logcat calls
need to send messages to a system daemon which involves a lot of steps
and allows for a large variation in time spent processing each call) and
also because the Displayed time is not a very stable reporting
system (it depends on CPU and GPU load among other factors)

The changes will also positively affect application performance after
startup:

On Pixel 3 XL running Android 12:

Before After Δ Notes
14.967 13.586 -9.23% ✓ preload enabled; 32-bit build
15.312 14.343 -6.33% ✓ preload enabled; 32-bit build; no compression
13.577 12.792 -5.78% ✓ preload enabled; 64-bit build
13.677 12.894 -5.73% ✓ preload disabled; 64-bit build; no compression
13.601 12.838 -5.61% ✓ preload disabled; 64-bit build
13.656 12.953 -5.15% ✓ preload enabled; 64-bit build; no compression
14.638 14.070 -3.88% ✓ preload disabled; 32-bit build
15.053 14.526 -3.50% ✓ preload disabled; 32-bit build; no compression

On Pixel 6 XL running Android 12:

Before After Δ Notes
8.972 7.826 -12.78% ✓ preload enabled; 32-bit build
8.833 7.823 -11.43% ✓ preload enabled; 32-bit build; no compression
8.611 8.031 -6.74% ✓ preload disabled; 32-bit build; no compression
6.533 6.104 -6.57% ✓ preload disabled; 64-bit build; no compression
6.504 6.119 -5.92% ✓ preload enabled; 64-bit build; no compression
6.426 6.052 -5.83% ✓ preload disabled; 64-bit build
6.493 6.125 -5.67% ✓ preload enabled; 64-bit build
8.446 8.088 -4.23% ✓ preload disabled; 32-bit build

Copy link
Contributor

@dellis1972 dellis1972 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok. I'd need to read up on the LLVM code bits in order to review those changes properly, but the rest looks ok.

@grendello
Copy link
Contributor Author

This looks ok. I'd need to read up on the LLVM code bits in order to review those changes properly, but the rest looks ok.

If you want to know how they work, I suggest reading the code outside the diff and I'm glad to explain and answer any questions :)

@dellis1972
Copy link
Contributor

If you want to know how they work, I suggest reading the code outside the diff and I'm glad to explain and answer any questions :)

OK, how does LLVM work? 🤣

@grendello
Copy link
Contributor Author

If you want to know how they work, I suggest reading the code outside the diff and I'm glad to explain and answer any questions :)

OK, how does LLVM work? rofl

It's complicated 😆

@grendello
Copy link
Contributor Author

grendello commented Apr 11, 2022

Seriously though, what we (and, of course, other clang frontends - that is specific language compilers) generate is LLVM IR (Intermediate Representation) code which allows one to describe both data and code in a type-safe manner that is abstract enough to allow for different translations into native assemblers and also allows easy(ier) optimization of the IR code before it's translated into the target architecture code. The target code generator takes the already somewhat optimized IR code and generates the most optimal native assembler for the given abstract construct. That's a very, very rough approximation of how it works :)


void WriteArrayString (string str, string symbolSuffix)
{
string name = WriteUniqueString ($"__{symbolName}_{symbolSuffix}", str, ref arrayStringCounter, LlvmIrVariableOptions.LocalConstexprString, out ulong size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have WriteUniqueString() return a StringSymbolInfo instance? This would allow it to drop the out ulong size parameter as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@jonpryor
Copy link
Member

The commit message states:

  • a simpler binary search function was implemented

I think this could use some elaboration. :-) What makes it simpler? Why change it at all?

}
}

var javaMapHashes = new List<ulong> (javaMap.Count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a List<ulong>, and not e.g. a HashSet<ulong>? (Is ordering important?)

I think for "defense in depth" we should check the "impossible" of duplicate hashes. Using a HashSet<ulong> would allow us to do this by checking the HashSet<T>.Add(T) return value. Otherwise we'd be looking at List<T>.Contains(T) calls to verify that there were no duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be a list, we must sort it before writing because it's searched with binary search at the run time. Duplicate hashes are extremely unlikely, so I decided not to search for them, but I can add a check if you insist :)

return UInt64.MaxValue;
}

// Native code will operate on wchar_t cast to a byte array, we need to do the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention the native code method which does this operation.


const TypeMapModule map_modules[] = {};
TypeMapModule map_modules[] = {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be const?

Copy link
Contributor Author

@grendello grendello Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, we write to it (the image member of the struct).

log_warn (LOG_ASSEMBLY, "typemap: empty Java type name passed to 'typemap_java_to_managed'");
// We need to generate hash for all the bytes, and since MonoString is Unicode, we double the length to get the
// number of bytes.
int name_len = mono_string_length (java_type) << 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*2 is clearer than <<1. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but I've seen that some compilers (admittedly not the ones in the current NDK) produce a multiplication assembler instruction for * (and division for /) instead of a left shift (right shift), so I chose to use << (and >> elsewhere) just to be safe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're kidding. It's 2022, and compilers don't optimize that?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might have to do with context, but yeah, I've seen that happen

Up until now, Xamarin.Android used string comparison when finding a
Managed type corresponding to a given Java type.  Even though the
strings were pre-sorted at build time, multiple string comparisons
costed more time than necessary.  To improve comparison speed, this
commit implements lookups based on hash values (using the `xxHash`
algorithm) calculated for all the Java names at build time.  This allows
us to process each Java type once at run time - to generate its hash.
After that, the hash is used to binary search an array of hashes and the
result (if found) is an index into array with the appropriate
Java-to-Managed mapping.

This change also allows us to move Java type names from the mapping
structure (`TypeMapJava`) to a separate array.  We used to keep Java
type name in the structure to make matching slightly faster, but it
required unnecessarily complicated structure size calculation at
runtime, so that binary search can properly work on an array of
`TypeMapJava` structures whose size would differ from application to
application (and sometimes even between builds).  The change also saves
space, because when the Java type name was stored in the structure, all
the structures had to have the same size, and thus all type names
shorter than the longest one had to be padded with NUL characters.

A handful of other optimizations are implemented as well.  Namely:

  * the `JNIEnv.RegisterJniNatives` method is now called
    directly (thanks to the `[UnmanagedCallersOnly]` attribute) when
    running under .NET6
  * a conceptually simpler binary search function was implemented, which
    doesn't use C++ templates and also appears to generate faster code.
    There are two versions of the function, one "simple" using the
    standard branching binary search algorithm and the other
    "branchless". The latter is currently not used, needing a better
    timing infrastructure to make sure it's actually faster on Android
    devices (microbenchmarks suggest its faster, application
    measurements when the branchless version is used suggest it's slower
    than the simple one)
  * the `typemap_managed_to_java` and `typemap_java_to_managed` internal
    calls are now registered directly from the `EmbeddedAssemblies`
    class instead of from the `MonodroidRuntime` class
  * a number of native functions are now forcibly inlined
  * a number of native functions are now static instead of instance

Startup performance was measured a .NET6 MAUI application created with
the `dotnet new maui` template and the gains vary depending on where we
look. The `Displayed` time sees changes that are negligible, however the
most affected area of the startup sequence (the call to
`JNIEnv.Initialize`) which registers types and involves the biggest
number of lookups sees improvements of up to 12%.  The measurements have
a degree of uncertainty and instability to them because of our use of
Android `logcat` to report timings as they are taken (`logcat` calls
need to send messages to a system daemon which involves a lot of steps
and allows for a large variation in time spent processing each call) and
also because the `Displayed` time is not a very stable reporting
system (it depends on CPU and GPU load among other factors)

The changes will also positively affect application performance after
startup:

On Pixel 3 XL running Android 12:

| Before | After  | Δ        | Notes                                          |
| ------ | ------ | -------- | ---------------------------------------------- |
| 14.967 | 13.586 | -9.23% ✓ | preload enabled; 32-bit build                  |
| 15.312 | 14.343 | -6.33% ✓ | preload enabled; 32-bit build; no compression  |
| 13.577 | 12.792 | -5.78% ✓ | preload enabled; 64-bit build                  |
| 13.677 | 12.894 | -5.73% ✓ | preload disabled; 64-bit build; no compression |
| 13.601 | 12.838 | -5.61% ✓ | preload disabled; 64-bit build                 |
| 13.656 | 12.953 | -5.15% ✓ | preload enabled; 64-bit build; no compression  |
| 14.638 | 14.070 | -3.88% ✓ | preload disabled; 32-bit build                 |
| 15.053 | 14.526 | -3.50% ✓ | preload disabled; 32-bit build; no compression |

On Pixel 6 XL running Android 12:

| Before | After | Δ         | Notes                                          |
| ------ | ----- | --------- | ---------------------------------------------- |
| 8.972  | 7.826 | -12.78% ✓ | preload enabled; 32-bit build                  |
| 8.833  | 7.823 | -11.43% ✓ | preload enabled; 32-bit build; no compression  |
| 8.611  | 8.031 | -6.74% ✓  | preload disabled; 32-bit build; no compression |
| 6.533  | 6.104 | -6.57% ✓  | preload disabled; 64-bit build; no compression |
| 6.504  | 6.119 | -5.92% ✓  | preload enabled; 64-bit build; no compression  |
| 6.426  | 6.052 | -5.83% ✓  | preload disabled; 64-bit build                 |
| 6.493  | 6.125 | -5.67% ✓  | preload enabled; 64-bit build                  |
| 8.446  | 8.088 | -4.23% ✓  | preload disabled; 32-bit build                 |
@grendello
Copy link
Contributor Author

Fix for the failing tests: #6922

@jonpryor
Copy link
Member

Context: https://en.algorithmica.org/hpc/

Up until now, Xamarin.Android used string comparison when finding a
Managed type corresponding to a given Java type.  Even though the
strings were pre-sorted at build time, multiple string comparisons
cost more time than necessary.  To improve comparison speed, implement
lookups based on hash values using the `xxHash` algorithm (c9270261),
calculated for all bound Java names at build time.  This allows us to
process each Java type once at run time, to generate its hash.  After
that, the hash is used to binary search an array of hashes and the
result (if found) is an index into array with the appropriate
Java-to-Managed mapping.

This change also allows us to move Java type names from the mapping
structure (`TypeMapJava`) and array (`map_java`) to a separate
`java_type_names` array.  We used to keep Java type name in the
structure to make matching slightly faster, but it required
unnecessarily complicated structure size calculation at runtime, so
that binary search can properly work on an array of `TypeMapJava`
structures whose size would differ from application to application
(and sometimes even between builds).  The change also saves space,
because when the Java type name was stored in the structure, all the
structures had to have the same size, and thus all type names shorter
than the longest one had to be padded with NUL characters.

A handful of other optimizations are implemented as well.  Namely:

  * the `JNIEnv.RegisterJniNatives()` method is now called
    directly (thanks to the `[UnmanagedCallersOnly]` attribute) when
    running under .NET6+; see also 16680700.

  * A conceptually simpler binary search function was implemented,
    which doesn't use C++ templates and also appears to generate
    faster code.  There are two versions of the function, one "simple"
    using the standard branching binary search algorithm, and the other
    "branchless". The latter is currently not used, needing a better
    timing infrastructure to verify it's actually faster on Android
    devices.  (Microbenchmarks suggest its faster, application
    measurements when the branchless version is used suggest it's
    slower than the simple one)

  * the `typemap_managed_to_java()` and `typemap_java_to_managed()`
    internal calls are now registered directly from the
    `EmbeddedAssemblies` class instead of from the `MonodroidRuntime`
    class

  * a number of native functions are now forcibly inlined

  * a number of native functions are now `static` instead of instance.

~~ File Formats ~~

The `TypeMapJava::java_name` string field (ce2bc689) is now an
`TypeMapJava::java_name_index` int32 field, which is an index into
the `java_type_names` global array:

	extern "C" const char* const java_type_names[];

A new `map_java_hashes` global array is also introduced, which contains
the xxHash value of each entry within `java_type_names`.
`map_java_hashes` is sorted for binary search purposes:

	extern "C" const xamarin::android::hash_t map_java_hashes[];

~~ Performance ~~

Startup performance was measured on a .NET6 MAUI application created
with the `dotnet new maui` template.  Gains vary depending on where
we look.

The `Displayed` time sees changes that are negligible, however the
most affected area of the startup sequence (`JNIEnv.Initialize()`)
which registers types and involves the biggest number of lookups sees
improvements of up to 12%.  The measurements have a degree of
uncertainty and instability to them because of our use of Android
`logcat` to report timings as they are taken.  (`adb logcat` calls
need to send messages to a system daemon which involves a lot of
steps and allows for a large variation in time spent processing each
call.)  The `Displayed` time is also not a very stable reporting
system (it depends on CPU and GPU load among other factors).

The changes will also positively affect application performance after
startup.  All times are from devices running Android 12.

| Runtime.init time Scenario            | Before ms |  After ms |         Δ |
| ------------------------------------- | --------: | --------: | --------: |
| Pixel 3 XL, 32-bit, Preload enabled   |    14.967 |    13.586 |  -9.23% ✓ |
| Pixel 3 XL, 64-bit, Preload disabled  |    13.601 |    12.838 |  -5.61% ✓ |
| Pixel 6 XL, 32-bit, Preload enabled   |     8.972 |     7.826 | -12.78% ✓ |
| Pixel 6 XL, 64-bit, Preload disabled  |     6.426 |     6.052 |  -5.83% ✓ |

@jonpryor jonpryor merged commit f48b97c into dotnet:main Apr 13, 2022
@grendello grendello deleted the faster-search branch April 14, 2022 15:58
@github-actions github-actions bot locked and limited conversation to collaborators Jan 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants