-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent persistent cache data races #402
Conversation
Can you describe the actual user scenario that got broken? i.e. how does this race manifest? A few thoughts: MSAL.NET chose a slightly more granular approach to locking
This ensures that we don't lock for the entire duration of the HTTP request. Eventual consistency vs race issuesIn MSAL.NET we experimented with disabling that lock. Unfortunately this seems to lead to a situation where a cache node (e.g. for Redis) now contains too many tokens. @pmaytak is looking at this. I think the fix looks fine, but would be good to have a rough idea of the perf impact. |
The race I described above causes unnecessary reauthentication because some tokens don't make it to the persistent cache and are lost the next time the client loads persisted data. Reauthentication could be acceptable if it were rare, but empirically I see data loss ~25% of the time with only 2 goroutines authenticating concurrently. benchmarksThe scenario is 10 goroutines concurrently authenticating and retrieving a cached token, with the cache persisted to a file. "Baseline" is the latest published version i.e. none of the locking added in this PR. "Read lock" is this PR as it is now, with a read lock in It's interesting that this PR actually improves the benchmark despite adding locks. I believe this is because it also removes unnecessary writes; apparently those are more expensive in this scenario than the new locks. |
@pmaytak - can we do something similar for MSAL.NET? |
@chlowell - 30% perf improvement for cached token acquisition is massive! Ship it :) Are there any changes to the scenario where we rely just on MSAL's memory storage? (i.e. no token cache serialization) |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Given that this PR improves the baseline, that hypothesis is reasonable, and I am also giving a Ship-It! :-) The following comments are mostly FYIs for open discussion.
The MSAL .Net's approach were also adopted in MSAL Python and MSAL Java. The MSAL Go implementation in this PR looks also in the same granular manner, as opposed to locking up the entire token acquisition http request. So, we are all good here.
Were the "read lock" and "exclusive lock" referred to the actual file lock characteristic? FWIW, MSAL Python (EX) implementation used to acquire a lock exclusively, during both token read and token write. Since then, an improvement kept only lock behavior during token write, and removed the lock attempt on token read. It just assumes a successful read (meaning the token cache file is in valid json format) would mean the data is clean. It also seems to work well.
Out of curiosity, was that "empirical ~25% data loss with only 2 goroutines authenticating concurrently" a real-world "end user occasionally runs 2 instances of same app manually" scenario, or an in-lab "spinning up 2 instances and having them loop authentication repeatedly" scenario? The "~25% data loss" sounds too-high-than-we-thought for the former. |
They refer to whether
👍 I'll try this in the extensions module.
The test was a single process with two goroutines concurrently acquiring a token and then silently authenticating, once. This is a realistic simulation, though conservative, because Go programs commonly launch many goroutines. On my machine, after ~25% of test runs the external cache had only one token. The upshot is that at the lowest level of concurrency there was a significant chance the client would delete a token from its in-memory cache before writing it to the external cache. To be clear, the problem was in the client, not the external cache; an external cache can prevent concurrent reads and writes of its data but not the client destructively overwriting its own in-memory data. |
Concurrent authentication causes data loss when using a persistent cache because
base.Client
doesn't synchronize its access. The data flow of authentication looks like this:When multiple goroutines authenticate concurrently, one may execute step 1 while another is between steps 3 and 4, deleting new data in memory before it's persisted. My solution in this PR is simply to lock around accessing the persistent cache.
To prevent repetition, I simplified several methods by consolidating the (internal) interfaces for partitioned and flat caches. These interfaces had almost the same Read/Write API; the difference was an unnecessary parameter. While I was at it, I also eliminated unnecessary writes to persistence--methods that don't write the in-memory cache no longer export it to the persistent cache.