IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios #50440

Dhugal · 2023-08-30T20:17:22Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

The asp.net core data protection API's appear to fail intermittently in some load-balanced scenarios.
Specifically key generation across instances does not involve locking so keys generated on different instances are not always shared correctly.

In our case, we're using the asp.net core data protection APIs in a load-balanced environment (Azure App Service with scaled-out instances running asp.net core 7 app).

Occasionally we're hitting an issue with multiple instances starting simultaneously and doing the following:

Read the existing Keys - there may be none or the existing keys may have expired.
Each instance creates a new key and adds it to the store - and then reads the store again.
The keys added by the latter instances are not available on earlier ones.

As we're using these for auth cookies, this means that authenticated users hitting some instances are not validated and therefore have to log back into the application.

The following shows an example what I believe is happening in a simple scenario of 2 instances:

Instance A	Instance B
Read Keys (No Valid Keys Found)
	Read Keys (No Valid Keys Found)
Create Key A
Read Keys
	Create Key B
	Read Keys
Key A loaded	Keys A & B loaded

This means a user authenticated on Instance B may get issued a cookie using Key B and their subsequent requests to Instance A (via a load balancer) fail authentication (but requests to B work).

We're trying to mitigate this by implementing a cross-instance lock (in our case we use a DistrubutedLock, backed by SQL Server).

The problem is that we don't appear to have a good place to lock the process of reading and writing the new key, as they're separate methods. We can wrap the call to read the keys or create a key in a lock but that doesn't help as we can still get the overlap shown above between these two steps.

What we'd like to be able to do is apply a lock around the process which both reads the keys and generates a new one if required, so that the next instance will always get the key generated by the first one to start and will therefore not generate another one.

Instance A	Instance B
Aquire Lock
Read Keys (No Valid Keys Found)	Attempt to Aquire Lock
Create Key A	Wait for lock
Read Keys	Wait for lock
Release Lock	Wait for lock
	Aquire Lock
	Read Keys
	Release Lock
Key A loaded	Key A loaded

The following section of the KeyRingProvider implements locking, which implies this scenario has been considered, but it doesn't cater for multiple instance scenarios:

https://github.com/dotnet/aspnetcore/blob/2dc991393c29a65df82efdb75e8467b7ace5bb74/src/DataProtection/DataProtection/src/KeyManagement/KeyRingProvider.cs#L180C17-L180C17

Expected Behavior

I believe this could be solved if the IKeyManager (and the default XmlKeyManager) provided a means to wrap/lock the process of reading the keys and determining if they're valid as well as generating a new one.

E.G. IKeyManager.ProcessingKeys (which calls GetAllKeys and, when required, also calls CreateNewKey)

We could then override this and place a lock around the process.

Steps To Reproduce

No response

Exceptions (if any)

No response

.NET Version

7.0.400

Anything else?

No response

pacorreia · 2023-11-29T15:52:33Z

Any updates on this? Since August no reply from anyone

amcasey · 2024-01-23T21:35:47Z

Sorry we didn't respond sooner - this is one of a family of issues and more of the chatter has happened on others.

Are you seeing this only on initialization, or while the app is running too?

Initialization is tricky because Instance B doesn't necessarily want to block requests until Instance A is ready.
In steady state, the expected behavior is that, while A and B might simultaneously notice that a new key is required, they would each generate a key to be activated far enough into the future that all other instances should have picked it up before it begins to be used

Depending on how your orchestration works, one option might be to run with a single instance until the first request has been processed, at which point any new instances will be able to see the key generated by the first instance. (Obviously, this is a quick-and-dirty mitigation and not a real solution.)

We're working on a more general approach, which you can track via #52915 and #52916.

amcasey · 2024-01-23T21:36:50Z

Here's the issue describing how the steady-state refresh can fail: #52678

ghost · 2024-01-26T21:46:36Z

Thanks for contacting us.

We're moving this issue to the .NET 9 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

amcasey · 2024-10-04T17:42:12Z

I'm going to close this issue. Please post additional feedback in #36157 so we can track it centrally and don't drop any.

dotnet-issue-labeler bot added the area-dataprotection Includes: DataProtection label Aug 30, 2023

Dhugal changed the title ~~IKeyManager / XMLManager - Fail intermittently in load balanced scenarios~~ IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios Aug 30, 2023

amcasey mentioned this issue Jan 24, 2024

Meta Issue - Data Protection keyring sync/creation race investigation #36157

Open

amcasey added this to the .NET 9 Planning milestone Jan 26, 2024

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024

wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024

wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 13, 2024

dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024

amcasey closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios #50440

IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios #50440

Dhugal commented Aug 30, 2023 •

edited

Loading

pacorreia commented Nov 29, 2023

amcasey commented Jan 23, 2024

amcasey commented Jan 23, 2024

ghost commented Jan 26, 2024

amcasey commented Oct 4, 2024

IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios #50440

IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios #50440

Comments

Dhugal commented Aug 30, 2023 • edited Loading

Is there an existing issue for this?

Describe the bug

Expected Behavior

Steps To Reproduce

Exceptions (if any)

.NET Version

Anything else?

pacorreia commented Nov 29, 2023

amcasey commented Jan 23, 2024

amcasey commented Jan 23, 2024

ghost commented Jan 26, 2024

amcasey commented Oct 4, 2024

Dhugal commented Aug 30, 2023 •

edited

Loading