Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] A few scalability fixes for the fat Lock #88633

Merged
merged 1 commit into from
Aug 2, 2023

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Jul 10, 2023

Recent investigations in productizing Lock type revealed a few issues in the "Fat" lock implementation on NativeAOT that negatively impact scalability under contentions.

This PR addresses the following issues:

  1. On the fast path we make assumption that the lock is completely free and attempt “Unlocked”->”Locked” transition.

A blind CAS saves us one load of the state before the CAS with a hope to save on bus interactions. That does not seem to be making any difference on the current hardware.
The downside is that we cannot make a quick “unfair” one-shot acquire when a waiter is present (since the lock state is not clean “Unlocked”), and thus the initial attempt is wasted. Preventing failures to acquire/release on a quick path appear to be more impactful than avoiding a load.

  1. Missing backoff when retrying the acquire.

The spin loop uses backoffs in cases of failed interlocked operations but on the most common path, when we retry acquiring, there is only a minimum pause. As a result we retry too aggressively, which is unproductive. It becomes even more expensive when other threads keep modifying the lock state. Retrying threads would read a lot of modified states, and each will cause a cache miss + unnecessary sharing of the state that will soon need to be invalidated.
We need to pace the retries with backoffs.

  1. Locks generally belong to two kinds - those that benefit from spinning and those that do not. We try detecting cases when, for a given lock instance, spinning or blocking is more preferable and adjust spin limit accordingly. However the detection logic has flaws.

As lock becomes more crowded the spinning cost quickly goes up because of cache misses and more threads wasting more cycles. Ideally the spin limit should be dialed in reverse to the cost of spinning.
Instead, we assume that if we acquired the lock while spinning, the spin limit should be increased.

Such logic is not very sensitive to how crowded the lock is. What is worse it is self-fulfilling. We may acquire the lock after a long and expensive spinning with many cache misses and failed interlocked operations (and not only in the current thread), yet we conclude that we “succeeded” and everyone should spin even more – thus make the situation worse, and yet it can make spinning more likely to “succeed” in the future. A crowded lock can degenerate into an expensive spinlock and stay in that mode, while a better choice would be the opposite – reduce crowdedness by reducing the spin limit.

It appears that detecting if lock has changed its ownership while we were spinning is a much better signal for dialing the spin limit.

@ghost
Copy link

ghost commented Jul 10, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Recent investigations in productizing Lock type revealed a few issues in the "Fat" lock implementation on NativeAOT that negatively impact scalability under contentions.

This PR addresses the following issues:

  1. On the fast path we make assumption that the lock is completely free and attempt “Unlocked”->”Locked” transition.

A blind CAS saves us one load of the state before the CAS with a hope to save on bus interactions. That does not seem to be making any difference on the current hardware.
The downside is that we cannot make a quick “unfair” one-shot acquire when a waiter is present (since the lock state is not clean “Unlocked”), and thus the initial attempt is wasted. Preventing failures to acquire/release on a quick path appear to be more impactful than avoiding a load.

  1. Missing backoff when retrying the acquire.

The spin loop uses backoffs in cases of failed interlocked operations but on the most common path, when we retry acquiring, there is only a minimum pause. As a result we retry too aggressively, which is unproductive. It becomes even more expensive when other threads keep modifying the lock state. We’d read a lot of modified states, and each will cause a cache miss + unnecessary sharing of the state that will soon need to be invalidated.
We need to pace the retries with backoffs.

  1. Locks generally belong to two kinds - those that benefit from spinning and those that do not. We try detecting cases when, for a given lock instance, spinning or blocking is more preferable and adjust spin limit accordingly. However the detection logic has flaws.

As lock becomes more crowded the spinning cost quickly goes up because of cache misses and more threads wasting more cycles. Ideally the spin limit should be dialed in reverse to the cost of spinning.
Instead, we assume that if we acquired the lock while spinning, the spin limit should be increased.

Such logic is not very sensitive to how crowded the lock is. What is worse it is self-fulfilling. We may acquire the lock after a long and expensive spinning with many cache misses and failed interlocked operations (and not only in the current thread), yet we conclude that we “succeeded” and everyone should spin more – thus make the situation worse, and yet it can make spinning more likely to “succeed” in the future. A crowded lock can degenerate into an expensive spinlock and stay in that mode, while a better choice would be the opposite – reduce crowdedness by reducing the spin limit.

It appears that detecting if lock has changed its ownership while we were spinning is a lot better signal for dialing the spin limit.

Author: VSadov
Assignees: VSadov
Labels:

area-NativeAOT-coreclr

Milestone: -

@@ -25,13 +25,7 @@ private static void MonitorEnter(object obj, ref bool lockTaken)
ObjectHeader.GetLockObject(obj) :
SyncTable.GetLockObject(resultOrIndex);

if (lck.TryAcquire(0))
Copy link
Member Author

@VSadov VSadov Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This acquire attempt is unnecessary and is also too soon. It was probably here to pick "unusual" cases like recursive locking because TryAcquire(0) would defer internally to the slow path for one iteration. Calling the slow path directly is simpler though and just as good.

return;
}

Monitor.TryAcquireContended(lck, obj, Timeout.Infinite);
Copy link
Member Author

@VSadov VSadov Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed TryAcquireContended -> TryAcquireSlow. The name was deceiving. This method is not just to handle contention. It is to handle all possible scenarios, while the initial attempts to acquire quickly can handle only a few common cases.

@VSadov
Copy link
Member Author

VSadov commented Jul 10, 2023

A simple benchmark to check the lock response to growing contention could be the following:

using System.Diagnostics;

namespace ConsoleApp12
{
    internal class Program
    {
        static Program fatLock = new Program();
        static Program thinLock = new Program();

        private const int iters = 1000000;

        static void Main(string[] args)
        {
            fatLock = new Program();
            fatLock.GetHashCode();
            for (; ; )
            {
                for (int i = 0; i < 7; i++) 
                {
                    int thrCount = 1 << i;
                    System.Console.WriteLine("Threads:" + thrCount);

                    for (int j = 0; j < 4; j++)
                    {
                        System.Console.Write("Fat Lock: ");
                        RunMany(() => FatLock(thrCount), thrCount);

                        //System.Console.Write("Thin Lock :");
                        //thinLock = new Program();
                        //RunMany(() => ThinLock(thrCount), thrCount);
                    }
                }

                System.Console.WriteLine();   
            }
        }

        static void RunMany(Action f, int threadCount)
        {
            Thread[] threads = new Thread[threadCount];
            bool start = false;

            for (int i = 0; i < threads.Length; i++)
            {
                threads[i] = new Thread(
                    () => 
                    {
                        while (!start) Thread.SpinWait(1);
                        f();
                    } 
                );

                threads[i].Start();
            }
             
            Thread.Sleep(10);
            start = true;

            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < threads.Length; i++)
            {
                threads[i].Join();
            }
            
            System.Console.WriteLine("Ops per msec: " + iters / sw.ElapsedMilliseconds);
        }

        static int sharedCounter = 0;

        private static int Fib(int n) => n < 2 ? 1 : Fib(n - 1) + Fib(n - 2);

        private static int ComputeSomething(Random random)
        {
            int delay = random.Next(4, 10);
            return Fib(delay);
        }

        static void FatLock(int thrCount)
        { 
            Random random = new Random();
            for (int i = 0; i < iters / thrCount; i++)
            {
                // do some computation
                int value = ComputeSomething(random);

                lock (fatLock)
                {
                    // update shared state
                    sharedCounter+= value;
                }
            }
        }
    }
}

@VSadov
Copy link
Member Author

VSadov commented Jul 10, 2023

Here is the output on an x64 machine with 32 logical cores:
(more is better)

=== Before the fix

Threads:1
Fat Lock: Ops per msec: 16666
Fat Lock: Ops per msec: 16666
Fat Lock: Ops per msec: 16666
Fat Lock: Ops per msec: 16949
Threads:2
Fat Lock: Ops per msec: 27777
Fat Lock: Ops per msec: 27777
Fat Lock: Ops per msec: 27777
Fat Lock: Ops per msec: 27777
Threads:4
Fat Lock: Ops per msec: 9708
Fat Lock: Ops per msec: 9803
Fat Lock: Ops per msec: 9615
Fat Lock: Ops per msec: 9523
Threads:8
Fat Lock: Ops per msec: 4000
Fat Lock: Ops per msec: 4016
Fat Lock: Ops per msec: 3906
Fat Lock: Ops per msec: 3831
Threads:16
Fat Lock: Ops per msec: 2341
Fat Lock: Ops per msec: 2341
Fat Lock: Ops per msec: 2336
Fat Lock: Ops per msec: 2331
Threads:32
Fat Lock: Ops per msec: 2298
Fat Lock: Ops per msec: 2304
Fat Lock: Ops per msec: 2288
Fat Lock: Ops per msec: 2283
Threads:64
Fat Lock: Ops per msec: 2293
Fat Lock: Ops per msec: 2293
Fat Lock: Ops per msec: 2283
Fat Lock: Ops per msec: 2192

=== After the fix

Threads:1
Fat Lock: Ops per msec: 16949
Fat Lock: Ops per msec: 16949
Fat Lock: Ops per msec: 16666
Fat Lock: Ops per msec: 16949
Threads:2   (for a pure "2-threads in entire system" the best case is infinite spinning, a bit slower here is expected)
Fat Lock: Ops per msec: 22222  
Fat Lock: Ops per msec: 22727
Fat Lock: Ops per msec: 23255
Fat Lock: Ops per msec: 22727
Threads:4
Fat Lock: Ops per msec: 14492
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14492
Threads:8
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14492
Threads:16
Fat Lock: Ops per msec: 14084
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14285
Threads:32
Fat Lock: Ops per msec: 14084
Fat Lock: Ops per msec: 14084
Fat Lock: Ops per msec: 14084
Fat Lock: Ops per msec: 14084
Threads:64
Fat Lock: Ops per msec: 14084
Fat Lock: Ops per msec: 14084
Fat Lock: Ops per msec: 14084
Fat Lock: Ops per msec: 14084

@VSadov
Copy link
Member Author

VSadov commented Jul 10, 2023

For the reference here is the same benchmark with the native CoreCLR lock.
It looks like it has a similar problem with excessive spinning under contention. Perhaps worth making a similar fix (separately).

=== CoreCLR

Threads:1
Ops per msecond: 12987
Ops per msecond: 13157
Ops per msecond: 13698
Ops per msecond: 16949
Threads:2
Ops per msecond: 24390
Ops per msecond: 23809
Ops per msecond: 24390
Ops per msecond: 24390
Threads:4
Ops per msecond: 15151
Ops per msecond: 14084
Ops per msecond: 14084
Ops per msecond: 13698
Threads:8
Ops per msecond: 8064
Ops per msecond: 9090
Ops per msecond: 8928
Ops per msecond: 8547
Threads:16
Ops per msecond: 7812
Ops per msecond: 8620
Ops per msecond: 8000
Ops per msecond: 8695
Threads:32
Ops per msecond: 8474
Ops per msecond: 8547
Ops per msecond: 8064
Ops per msecond: 8196
Threads:64
Ops per msecond: 8771
Ops per msecond: 8620
Ops per msecond: 8403
Ops per msecond: 8620

@VSadov
Copy link
Member Author

VSadov commented Jul 12, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Jul 19, 2023

Rebased onto recent main, to keep up with changes.

@VSadov
Copy link
Member Author

VSadov commented Jul 24, 2023

Rebased onto main, to keep up to date.

@VSadov
Copy link
Member Author

VSadov commented Jul 30, 2023

Rebased onto main, to keep up to date.

@VSadov VSadov requested a review from jkotas July 30, 2023 15:09
@jkotas
Copy link
Member

jkotas commented Jul 30, 2023

@kouvel has deep understanding of this problem space and I believe that you have been discussing these issues together. I think @kouvel is the right person to review and sign-off on these changes.

@kouvel
Copy link
Member

kouvel commented Jul 31, 2023

I'm taking a look, will try to get back soon

Comment on lines +270 to +288
// now we can estimate how busy the lock is and adjust spinning accordingly
ushort spinLimit = _spinLimit;
if (ownerChanged != 0)
{
// The lock has changed ownership while we were trying to acquire it.
// It is a signal that we might want to spin less next time.
// Pursuing a lock that is being "stolen" by other threads is inefficient
// due to cache misses and unnecessary sharing of state that keeps invalidating.
if (spinLimit > MinSpinLimit)
{
_spinLimit = (ushort)(spinLimit - 1);
}
}
else if (spinLimit < MaxSpinLimit && iteration > spinLimit / 2)
{
// we used more than 50% of allowed iterations, but the lock does not look very contested,
// we can allow a bit more spinning.
_spinLimit = (ushort)(spinLimit + 1);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful aspect of a spin-tuning heuristic would be to avoid spin-waiting when it just uses CPU time but does not help to avoid context-switching. I'm not sure that this is addressed by the spin-tuning heuristic in this change. I suggest creating a test case where spin-waiting is wasted, and to compare the CPU time with profiling, with and without spin-waiting. Ideally, a self-tuning spin-wait heuristic would show roughly the same CPU time usage as disabling spin-waiting.

Copy link
Member Author

@VSadov VSadov Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the concern - this is about a lock that is typically long-held.
If the lock is held for much longer than it takes to block/awaken a thread, then spinning is useless.

Note that even if a thread decides to block right away, it will still pay thousands of cycles for blocking/waking. In the first approximation, if max spinning that we allow does not cost much more than the costs of block/awake, then the overall overhead even with the max spin should not be much worse than ideal.
Also, since such long-held lock will typically be acquired by an awoken waiter without any spinning, it would not raise the spin limit, so we will likely stay at the initial min spin setting.

We may have an occasional new thread arrive at precise time window around lock release, spin and acquire and thus cause spin count to slowly drift up and reach the max. But it is also possible for such "new" threads to contest the lock with an awoken waiter and reduce the spin.

Since the lock is long-held, there cannot be many spinners. If the lock is very popular, eventually all the interested threads will block.

I think the possibility of really bad behavior is not high, but it may be possible to further improve the heuristics for the long-held locks. It could be interesting to make some adversarial scenarios to see if we can get in trouble intentionally, and see how likely and how bad it can get.

Copy link
Member

@kouvel kouvel Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even if a thread decides to block right away, it will still pay thousands of cycles for blocking/waking. In the first approximation, if max spinning that we allow does not cost much more than the costs of block/awake, then the overall overhead even with the max spin should not be much worse than ideal.

It seems like the max spin count would use a not-insignificant amount of CPU time relative to context switching, though I haven't measured with this heuristic.

Also, since such long-held lock will typically be acquired by an awoken waiter without any spinning, it would not raise the spin limit, so we will likely stay at the initial min spin setting.

True, but if some other less-frequent lock contention on the same lock happens to increase the spin count, the spin-tuning would not tune down the spin-waiting for a more-frequent lock contention that doesn't benefit from spin-waiting. It may not be a significant issue at the moment since the spin-tuning heuristic seems to favor less spin-waiting anyway.

@@ -169,13 +191,14 @@ private static unsafe void ExponentialBackoff(uint iteration)
uint rand = ((uint)&iteration + iteration) * 2654435769u;
// set the highmost bit to ensure minimum number of spins is exponentialy increasing
// that is in case some stack location results in a sequence of very low spin counts
rand |= (1 << 32);
// it basically gurantees that we spin at least 1, 2, 4, 8, 16, times, and so on
rand |= (1u << 31);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to this particular change, but why is this additional randomization necessary? Given that the upper bit is set, it seems the randomization effect would be miniscule, especially at higher iterations. To me it doesn't seem like the added complication would be necessary.

Copy link
Member Author

@VSadov VSadov Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the upper bit is set, it seems the randomization effect would be miniscule, especially at higher iterations.

With upper bit set and other bits random, the range of possible waits should still be ~50% of the minimum for the given iteration.

The cost of this "randomization" is basically free next to the cost of occupying the core even for one spinwait cycle. It is a cheap insurance against threads arriving at the same time or synchronizing with each other via effects of cache misses.

Some scenarios/environments may not need this, and those that need may need it less after this PR, since we are trying to ensure that we do not see too much contentious spinning in the first place.
It is hard to prove that there will be no benefits though, since this code will run on very wide range of hardware. So it seems a nearly free insurance is still good to have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it seems the spin-wait backoff can yield fairly large gaps between checks at higher iterations, whereas CoreCLR's monitor's spin-wait backoff maxes out fairly quickly in favor of reducing latency. I guess a small amount of randomization may not hurt, and as you said it would probably be difficult to prove or disprove if it actually helps.

// We will use a pause that doubles up on every iteration. It will not be more than 2x worse
// than the ideal guess, while minimizing the number of retries.
// We will allow pauses up to 64~128 spinwaits, or more if there are collisions.
ExponentialBackoff(Math.Min(iteration, 6) + collisions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the added collisions counter would make this expontential backoff at lot more than power-of-2-exponential, and perhaps fairly quickly depending on the situation. Is there a reason for this change?

Copy link
Member Author

@VSadov VSadov Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two factors that inform the backoff here - iteration and collisions.
The retry period would be normally driven by iterations. As we spin more, we reduce the rate of reads, as the comment explains. The goal is to acquire the lock within 2X of optimal wait if noone wants it, while avoiding possible interference with other threads acquiring/releasing the lock. This is a power-of-2 exponential backoff with a limit.

When we see collisions (i.e. our interlocked operations start failing), it is a signal that we may still have more competition at the lock than we would like, so we backoff more.
Collisions should be relatively rare - basically a "panic mode" to resolve unexpected bursts of activity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a cap on the collisions too similarly to the iteration, so as to avoid a perhaps rare but potentially very large backoffs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExponentialBackoff has a total limit controlled by MaxExponentialBackoffBits.
It would be extremely unlikely to have so many collisions, but we have a limit just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, 10 bits seems like a lot to me as a cap for one spin iteration. If we assume SpinWait(1) takes in the order of 10 ns, SpinWait((1 << 10) / 2) would take in the order of 5 us, which is quite a lot for one spin-wait iteration, despite it being rare.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of double-up increases, by the time the pause reaches 5us, we have already spent 5us spinning. Basically the horribly long pauses will take horribly long time to achieve. The math kind of tells us that the limit is unnecessary, yet for many people, me included, an unbounded spin would trigger an alarm, so we need some limit. Because, who knows, we may not be considering everything ...

There is not much science behind this number though. I wanted to pick something that does not stand in the way in a typical scenario, yet protects from degenerate cases (i.e. not spinning for seconds). Thus I picked 2^10. I did not see it achieved in experiments (not even close), yet it is feels safer than 2^16 - another number I'd pick.

If there is any way to better estimate a better number, we could change it.

It would be interesting to get access to a 1024 core machine, or how high the core count goes these days, and see how high the collisions numbers could possibly get if trying intentionally for it (also to see if randomization helps).
I think experimenting on 32 cores will mostly tell us that this number does not matter.

@kouvel
Copy link
Member

kouvel commented Aug 2, 2023

The above test seems to show some negative processor cache effects when lock ownership transitions very quickly between more than a few threads. It seems the best way to get high throughput there is to have only 2-3 threads active and to have all other threads wait. Spin-waiting less appears to have a similar effect. This would benefit a scenario where each thread is re-acquiring the lock a great many times in very quick succession, quickly enough that processor cache effects start coming into play for throughput measurements, which doesn't seem very realistic to me. I'm not sure that tuning for this test case is worthwhile. If it can be done without tradeoffs, then great, but otherwise for scaling on this test I think it's good enough if the throughput stabilizes somewhere reasonable with increasing threads.

The spin-tuning heuristic in this change seems to favor less spin-waiting even when more spin-waiting is beneficial. Consider the following test case, which measures how quickly a burst of lock contentions on a number of threads is resolved, in terms of throughput of bursts. I think it's a reasonably realistic test case, and can occur for instance periodically with timer ticks.

using System;
using System.Diagnostics;
using System.Threading;

namespace ConsoleApp12
{
    internal class Program
    {
        static Program fatLock = new Program();

        private static bool stop;
        private static int iters;

        static void Main(string[] args)
        {
            fatLock = new Program();
            fatLock.GetHashCode();
            for (int thrCount = 1; thrCount <= Environment.ProcessorCount; thrCount *= 2)
            {
                Console.WriteLine("Threads:" + thrCount);

                for (int j = 0; j < 4; j++)
                {
                    Console.Write("Fat Lock: ");
                    RunMany(() => FatLock(thrCount), thrCount);
                }
            }
        }

        static void RunMany(Action f, int threadCount)
        {
            Thread[] threads = new Thread[threadCount];
            bool start = false;
            stop = false;
            iters = 0;
            locksTaken = 0;

            for (int i = 0; i < threads.Length; i++)
            {
                threads[i] = new Thread(
                    () =>
                    {
                        while (!start) Thread.SpinWait(1);
                        f();
                    }
                );

                threads[i].Start();
            }

            Thread.Sleep(10);
            start = true;

            Stopwatch sw = Stopwatch.StartNew();
            Thread.Sleep(500);
            int completedIters = iters;
            sw.Stop();
            stop = true;
            for (int i = 0; i < threads.Length; i++)
            {
                threads[i].Join();
            }

            Console.WriteLine($"Ops per msec: {completedIters / sw.ElapsedMilliseconds,7}");
        }

        private static int Fib(int n) => n < 2 ? 1 : Fib(n - 1) + Fib(n - 2);

        private static int ComputeSomething(Random random)
        {
            int delay = random.Next(4, 10);
            return Fib(delay);
        }

        static int locksTaken;

        static void FatLock(int thrCount)
        {
            Random random = new Random();
            while (!stop)
            {
                int locksTakenValue;
                lock (fatLock)
                {
                    // update shared state
                    iters++;
                    locksTakenValue = ++locksTaken;

                    // do some computation
                    int value = ComputeSomething(random);
                }

                if (locksTakenValue == thrCount)
                {
                    // this is the last thread to take the lock in this burst
                    locksTaken = -thrCount + 1;
                }
                else
                {
                    // wait for other threads to take the lock
                    while (!stop && locksTaken > 0) Thread.SpinWait(1);
                    Interlocked.Increment(ref locksTaken);
                }

                // wait for other threads to finish waiting above
                while (!stop && locksTaken < 0) Thread.SpinWait(1);
            }
        }
    }
}

Below are some results on my 16-core 32-proc AMD processor on Windows.

CoreCLR:

Threads:1
Fat Lock: Ops per msec:   20137
Fat Lock: Ops per msec:   23662
Fat Lock: Ops per msec:   23828
Fat Lock: Ops per msec:   23645
Threads:2
Fat Lock: Ops per msec:    8960
Fat Lock: Ops per msec:    8864
Fat Lock: Ops per msec:    8903
Fat Lock: Ops per msec:    8851
Threads:4
Fat Lock: Ops per msec:    5514
Fat Lock: Ops per msec:    5547
Fat Lock: Ops per msec:    5526
Fat Lock: Ops per msec:    5570
Threads:8
Fat Lock: Ops per msec:    3688
Fat Lock: Ops per msec:    3697
Fat Lock: Ops per msec:    3353
Fat Lock: Ops per msec:    3807
Threads:16
Fat Lock: Ops per msec:     133
Fat Lock: Ops per msec:     135
Fat Lock: Ops per msec:     134
Fat Lock: Ops per msec:     135
Threads:32
Fat Lock: Ops per msec:      63
Fat Lock: Ops per msec:      66
Fat Lock: Ops per msec:      67
Fat Lock: Ops per msec:      50

NativeAOT after this change:

Threads:1
Fat Lock: Ops per msec:   16550
Fat Lock: Ops per msec:   16522
Fat Lock: Ops per msec:   16434
Fat Lock: Ops per msec:   16420
Threads:2
Fat Lock: Ops per msec:    7248
Fat Lock: Ops per msec:    7441
Fat Lock: Ops per msec:    7349
Fat Lock: Ops per msec:    7309
Threads:4
Fat Lock: Ops per msec:    2438
Fat Lock: Ops per msec:    2565
Fat Lock: Ops per msec:    2582
Fat Lock: Ops per msec:    2716
Threads:8
Fat Lock: Ops per msec:     300
Fat Lock: Ops per msec:     329
Fat Lock: Ops per msec:     322
Fat Lock: Ops per msec:     331
Threads:16
Fat Lock: Ops per msec:      94
Fat Lock: Ops per msec:      91
Fat Lock: Ops per msec:      93
Fat Lock: Ops per msec:      92
Threads:32
Fat Lock: Ops per msec:      69
Fat Lock: Ops per msec:      48
Fat Lock: Ops per msec:      69
Fat Lock: Ops per msec:      68

NativeAOT before this change:

Threads:1
Fat Lock: Ops per msec:   16455
Fat Lock: Ops per msec:   16509
Fat Lock: Ops per msec:   16438
Fat Lock: Ops per msec:   16416
Threads:2
Fat Lock: Ops per msec:    7763
Fat Lock: Ops per msec:    7885
Fat Lock: Ops per msec:    7897
Fat Lock: Ops per msec:    7879
Threads:4
Fat Lock: Ops per msec:    4888
Fat Lock: Ops per msec:    4819
Fat Lock: Ops per msec:    4825
Fat Lock: Ops per msec:    4742
Threads:8
Fat Lock: Ops per msec:    3074
Fat Lock: Ops per msec:    3010
Fat Lock: Ops per msec:    3055
Fat Lock: Ops per msec:    3017
Threads:16
Fat Lock: Ops per msec:    1468
Fat Lock: Ops per msec:    1459
Fat Lock: Ops per msec:    1465
Fat Lock: Ops per msec:    1449
Threads:32
Fat Lock: Ops per msec:      83
Fat Lock: Ops per msec:      56
Fat Lock: Ops per msec:      55
Fat Lock: Ops per msec:      57

In this test, allowing more spin-waiting threads avoids excessive context switching, up to a point after which it doesn't help much unless the spin-wait duration is increased, which would have other negative tradeoffs. Granted, this is also only a microbenchmark and I wouldn't suggest tuning specifically for it, but it shows how spin-waiting can reduce context-switching. After this change in NativeAOT, the context-switching overhead seems to start showing up with fewer threads involved in the contention (fairly early, around 4), indicating that the spin-tuning heuristic is favoring less spin-waiting even when more is beneficial.

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite my other comments, this change seems to have fixed a number of other issues in the previous implementation in NativeAOT, and it seems likely that it would work better than NativeAOT's lock implementation in .NET 7, so it's probably a good improvement to take for .NET 8. For .NET 9, I hope that we can consolidate the implementations with #87672, which is mostly a port of CoreCLR's lock implementation, an implementation that we know to have been working without any significant issues for quite some time, and any further improvements can be made on top of that.

@VSadov
Copy link
Member Author

VSadov commented Aug 2, 2023

For .NET 9, I hope that we can consolidate the implementations with #87672

Completely agree on that!

@VSadov
Copy link
Member Author

VSadov commented Aug 2, 2023

Thanks!!!

@VSadov VSadov merged commit c5f9806 into dotnet:main Aug 2, 2023
106 checks passed
@VSadov VSadov deleted the lckFixes branch August 2, 2023 22:58
@ghost ghost locked as resolved and limited conversation to collaborators Sep 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants