Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Replace GVMLookupForSlot internal cache with generic cache similar to CastCache #89331

Merged
merged 13 commits into from
Jul 24, 2023

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Jul 21, 2023

Fixes: #85927

  • the new cache is basically a generic form of CastCache. It is completely lock-free and slightly faster than the currently used caching scheme.
  • unlike the ad-hoc caching scheme used in GVM, the new cache is a standalone implementation and may be used in other similar scenarios.
  • refactored CastCache can now also be used for other purposes (no longer assumes a single static instance)

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 21, 2023
@ghost ghost assigned VSadov Jul 21, 2023
@ghost
Copy link

ghost commented Jul 21, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #85927

  • the new cache is basically a generic form of CastCache. It is completely lock-free and slightly faster than the currently used caching scheme.
  • unlike the ad-hoc caching scheme used in GVM, the new cache is a standalone implementation and may be used in other similar scenarios.
  • refactored CastCache can now be used for other purposes as well (no longer assumes a single static instance)
Author: VSadov
Assignees: VSadov
Labels:

area-NativeAOT-coreclr, needs-area-label

Milestone: -

@VSadov VSadov removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 21, 2023
@VSadov
Copy link
Member Author

VSadov commented Jul 22, 2023

I am using the following crude microbenchmark to estimate perf differences:

using System.Diagnostics;

namespace ConsoleApp29
{
    class c0
    {
        public virtual void M1<T>()
        {
            if (Program.i == 0) throw null;
        }
    }

    class c1  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c2  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c3  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c4  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c5  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c6  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c7  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c8  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }
    class c9  : c0 { public override void M1<T>() {if (Program.i == 0) throw null;} }


    internal class Program
    {
        public static int i = 1;

        c0 v0 = new c0();
        c0 v1 = new c1();
        c0 v2 = new c2();
        c0 v3 = new c3();
        c0 v4 = new c4();
        c0 v5 = new c5();
        c0 v6 = new c6();
        c0 v7 = new c7();
        c0 v8 = new c8();
        c0 v9 = new c9();


        void Test2(c0 arg)
        {
            arg.M1<object>();
            arg.M1<int>();
            arg.M1<long>();
            arg.M1<Exception>();
            arg.M1<int[]>();
            arg.M1<string>();
            arg.M1<decimal>();
            arg.M1<Guid>();
            arg.M1<int?>();
            arg.M1<object[]>();
        }

        void Test1()
        {
            Test2(v0);
            Test2(v1);
            Test2(v2);
            Test2(v3);
            Test2(v4);
            Test2(v5);
            Test2(v6);
            Test2(v7);
            Test2(v8);
            Test2(v9);
        }

        void Test()
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < 1000000; i++)
            {
                Test1();
            }

            sw.Stop();
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static void Main(string[] args)
        {
            Program p = new Program();

            for(; ; )
            {
                p.Test();
            }
        }
    }
}

Results vary a bit between runs, but the new implementation is generally faster.

Here are results of two best runs:
(time spent doing the same iterations => smaller is better)
This is on Windows, x64

==== Original (recent main)

775
773
776
771
771
779

==== New (about 30% faster)

585
585
582
588
589
589
592

@VSadov VSadov requested a review from jkotas July 22, 2023 01:09
@VSadov
Copy link
Member Author

VSadov commented Jul 22, 2023

The slow helper is still not tail-called, BTW.
@jakobbotsch - do we have the same reasons here as in #85901 (comment) or something can be tweaked to convince JIT into tail-calling in GVMLookupForSlot?


// Initialize the cache eagerly to avoid null checks.
private static GenericCache<Key, Value> s_cache;
internal static void Initialize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use EagerStaticClassConstruction for this so that it has the same structure as cast cache?

EagerStaticClassConstruction should be better for trimming. The compiler should be able to trim the initialization if the type is not otherwise used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know the differences. I just used what this cache used before assuming there was a reason to use the Initialize pattern. Makes sense to switch to EagerStaticClassConstruction.

@@ -775,6 +775,7 @@
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\CompilerServices\AsyncValueTaskMethodBuilderT.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\CompilerServices\AsyncVoidMethodBuilder.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\CompilerServices\CastCache.cs" Condition="'$(FeatureCoreCLR)' == 'true' or '$(FeatureNativeAot)' == 'true'" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\CompilerServices\GenericCache.cs" Condition="'$(FeatureCoreCLR)' == 'true' or '$(FeatureNativeAot)' == 'true'" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You can use Condition="'$(FeatureMono)' != 'true'" to make this condition a bit simpler.


#endif // CORECLR
// wraps existing table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this wrapping work? It won't intialize the other fields like _lastFlushSize, etc.

Copy link
Member Author

@VSadov VSadov Jul 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at jitted code and if I am not mistaken it looks like the struct gets promoted and unused fields disappear. When we operate with a cache that wraps an existing table, we only do TryGet, and TryGet only needs the _table field.

I looked at the code in the debugger, which is not so trivial because of "no debugging this" attributes. There is a strange call there, most likely related to accessing a static field, but otherwise the codegen looked like the struct gets completely erased and we use _table as a scalar variable.

It would be nice if someone could confirm that this is the case. There might be a "proper" way to just dump the jitted methods and see what happens there and how the code changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am more worried about that this constructor creates CastCache instance that is partially initialized. It makes it hard to see that none of the places that use this instance won't use the uninitialized fields (_maxCacheSize, etc.).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code paths that add/resize/flush are not supposed to be used in a cache that wraps existing table since logically they do not own the table. An eventual crash due to uninitialized fields would be expected.

I guess we could have more explicit asserts - if you are off the TryGet path the min/max sizes should be nonzero.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be better to structure the code such that these partially initialized CastCache instances do not need to be created.

Copy link
Member Author

@VSadov VSadov Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be better to structure the code such that these partially initialized CastCache instances do not need to be created.

That can be done if we pass the table directly to TryGet. I think that will work better. We would not need to add asserts and worry about the struct being optimized away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve removed the wrapping scenario. Just pass the existing table, if available, to ‘TryGet’.

@jakobbotsch
Copy link
Member

jakobbotsch commented Jul 22, 2023

@jakobbotsch - do we have the same reasons here as in #85901 (comment) or something can be tweaked to convince JIT into tail-calling in GVMLookupForSlot?

Yes, I think so.

It's hard to work around this particular limitation -- as soon as we end up with ldloca for a local or inlined local, we end up hitting it. You could try various patterns to avoid this, but I see many locals in the jit disasm output marked with ld-addr-op due to a bunch of inlining, so it's likely not easy.

@VSadov
Copy link
Member Author

VSadov commented Jul 22, 2023

as soon as we end up with ldloca for a local or inlined local, we end up hitting it

I see. That seems hard to avoid when working with structs.

_auxResult = auxResult;
}
}

public static unsafe void ActivatorCreateInstanceAny(ref object ptrToData, IntPtr pEETypePtr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks like a dead code. Delete it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem unused.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@VSadov
Copy link
Member Author

VSadov commented Jul 24, 2023

Thanks!

@VSadov VSadov merged commit b06b325 into dotnet:main Jul 24, 2023
169 checks passed
@VSadov VSadov deleted the castcacheImpl branch July 24, 2023 13:14
@ghost ghost locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NativeAOT] consider using casting cache in GVMLookupForSlot scenario.
3 participants