Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing test: Microsoft.Extensions.Caching.Memory.CapacityTests.AddingReplacementExceedsCapacityRemovesOldEntry #45868

Closed
radical opened this issue Dec 9, 2020 · 14 comments · Fixed by #72821
Assignees
Labels
area-Extensions-Caching blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@radical
Copy link
Member

radical commented Dec 9, 2020

Frequency during 4/1-7/22:

  • 34 failures (incl. 1x office run) (since 7/1 - 1 failure every couple of days)

Failing on PR: #45849 on net6.0-Linux-Debug-x64-CoreCLR_checked-Ubuntu.1804.Amd64.Open, which IIUC, shouldn't be affected by the PR changes.

  Starting:    Microsoft.Extensions.Caching.Memory.Tests (parallel test collections = on, max threads = 2)
    Microsoft.Extensions.Caching.Memory.CapacityTests.AddingReplacementExceedsCapacityRemovesOldEntry [FAIL]
      Assert.Equal() Failure
      Expected: 0
      Actual:   6
      Stack Trace:
        /_/src/libraries/Microsoft.Extensions.Caching.Memory/tests/CapacityTests.cs(292,0): at Microsoft.Extensions.Caching.Memory.CapacityTests.AddingReplacementExceedsCapacityRemovesOldEntry()
  Finished:    Microsoft.Extensions.Caching.Memory.Tests
=== TEST EXECUTION SUMMARY ===
   Microsoft.Extensions.Caching.Memory.Tests  Total: 89, Errors: 0, Failed: 1, Skipped: 0, Time: 15.832s
~/work/D3CA0B78/w/AD02093F/e
----- end Wed Dec 9 20:49:15 UTC 2020 ----- exit code 1 ----------------------------------------------------------
Waiting a few seconds for any dump to be written..
Looking around for any Linux dump..
... found no dump in /home/helixbot/work/D3CA0B78/w/AD02093F/e
+ export _commandExitCode=0

build: https://dnceng.visualstudio.com/public/_build/results?buildId=914578&view=results

console: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-45849-merge-bbac7bcbe2e74adca7/Microsoft.Extensions.Caching.Memory.Tests/console.e1720aa2.log?sv=2019-07-07&se=2020-12-29T20%3A30%3A56Z&sr=c&sp=rl&sig=f2gjogjMSfevE3okDjeWwFG4pv5G077HR4EMxg9k76o%3D

let failedTests = (methodName : string, includePR : bool, messageSubstr: string, includePassedOnRerun : bool) {
cluster('engsrvprod.kusto.windows.net').database('engineeringdata').AzureDevOpsTests
    | where TestName contains methodName
    | where includePassedOnRerun or (Outcome == 'Failed')
    | extend startOfTestName = indexof_regex(TestName, @"[^.]+$")
    | extend Method = substring(TestName, startOfTestName)
    | extend Type = substring(TestName, 0, startOfTestName - 1)
    | project-away startOfTestName
    | where Method == methodName
    | where Message contains messageSubstr
    | distinct JobId, WorkItemId, Message, StackTrace, Method, Type, Arguments, Outcome
    | join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').Jobs
        | where ((Branch == 'refs/heads/main') or (Branch == 'refs/heads/master') or (includePR and (Source startswith "pr/")))
        | where Type startswith "test/functional/cli/"
            and not(Properties contains "runtime-staging")
        | summarize arg_max(Finished, Properties, Type, Branch, Source, Started, QueueName) by JobId
        | project-rename JobType = Type) on JobId
    | extend PropertiesJson = parse_json(Properties)
    | extend OS = replace_regex(tostring(PropertiesJson.operatingSystem), @'\((.*)\).*|([^\(].*)', @'\1\2')
    | extend Runtime = iif(PropertiesJson.runtimeFlavor == "mono", "Mono", iif(PropertiesJson.DefinitionName contains "coreclr", "CoreCLR", ""))
    | extend TargetBranch = extractjson("$.['System.PullRequest.TargetBranch']", Properties)
    | extend Architecture = PropertiesJson.architecture
    | extend Scenario = iif(isempty(PropertiesJson.scenario), "--", PropertiesJson.scenario)
    | project-away PropertiesJson
};
failedTests('AddingReplacementExceedsCapacityRemovesOldEntry', true, '', true);
@radical radical added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Dec 9, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Extensions-Caching untriaged New issue has not been triaged by the area owner labels Dec 9, 2020
@ghost
Copy link

ghost commented Dec 9, 2020

Tagging subscribers to this area: @eerhardt, @maryamariyan
See info in area-owners.md if you want to be subscribed.

Issue Details

Failing on PR: #45849 on net6.0-Linux-Debug-x64-CoreCLR_checked-Ubuntu.1804.Amd64.Open, which IIUC, shouldn't be affected by the PR changes.

  Starting:    Microsoft.Extensions.Caching.Memory.Tests (parallel test collections = on, max threads = 2)
    Microsoft.Extensions.Caching.Memory.CapacityTests.AddingReplacementExceedsCapacityRemovesOldEntry [FAIL]
      Assert.Equal() Failure
      Expected: 0
      Actual:   6
      Stack Trace:
        /_/src/libraries/Microsoft.Extensions.Caching.Memory/tests/CapacityTests.cs(292,0): at Microsoft.Extensions.Caching.Memory.CapacityTests.AddingReplacementExceedsCapacityRemovesOldEntry()
  Finished:    Microsoft.Extensions.Caching.Memory.Tests
=== TEST EXECUTION SUMMARY ===
   Microsoft.Extensions.Caching.Memory.Tests  Total: 89, Errors: 0, Failed: 1, Skipped: 0, Time: 15.832s
~/work/D3CA0B78/w/AD02093F/e
----- end Wed Dec 9 20:49:15 UTC 2020 ----- exit code 1 ----------------------------------------------------------
Waiting a few seconds for any dump to be written..
Looking around for any Linux dump..
... found no dump in /home/helixbot/work/D3CA0B78/w/AD02093F/e
+ export _commandExitCode=0

build: https://dnceng.visualstudio.com/public/_build/results?buildId=914578&view=results

console: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-45849-merge-bbac7bcbe2e74adca7/Microsoft.Extensions.Caching.Memory.Tests/console.e1720aa2.log?sv=2019-07-07&se=2020-12-29T20%3A30%3A56Z&sr=c&sp=rl&sig=f2gjogjMSfevE3okDjeWwFG4pv5G077HR4EMxg9k76o%3D

Author: radical
Assignees: -
Labels:

area-Extensions-Caching, blocking-clean-ci, untriaged

Milestone: -

@eerhardt
Copy link
Member

eerhardt commented Dec 9, 2020

@adamsitnik - any chance your recent changes to MemoryCache affected this?

@adamsitnik
Copy link
Member

@eerhardt I can't reproduce it on my machine. I've looked at the source code and Size is definitely not related to scope handling.

I'll try to reproduce it on my other machines

@adamsitnik
Copy link
Member

@radical how can I find out how many times this test has failed? #702 shows N/A in the Build Count column

@ChadNedzlek
Copy link
Member

ChadNedzlek commented Dec 11, 2020

It has never failed in master, it's only failed 3 times in PR as far as we have history (which is 3 months) in PRs #45079 and #44980 and then again in this PR.

@ChadNedzlek
Copy link
Member

The previous two failures are also on Nov 22, so quite some time in the past.

@adamsitnik
Copy link
Member

@ChadNedzlek thank you for providing all the details!

@eerhardt my first PR (#44797) got merged on the 17th of Nov, but it was a very low-risk change with just better inlining and I don't think that it could have introduced a bug. We should watch it closely and see if it reproduces more often (and in what config)

@maryamariyan maryamariyan removed the untriaged New issue has not been triaged by the area owner label Mar 11, 2021
@maryamariyan maryamariyan added this to the Future milestone Mar 11, 2021
@vonzshik
Copy link
Contributor

vonzshik commented Jun 1, 2022

It looks like there is a race with overcapacity compaction here

entry.SetExpired(EvictionReason.Capacity);
TriggerOvercapacityCompaction();
entry.InvokeEvictionCallbacks();
if (priorEntry != null)
{
coherentState.RemoveEntry(priorEntry, _options);
}

which runs on a threadpool

private void TriggerOvercapacityCompaction()
{
if (_logger.IsEnabled(LogLevel.Debug))
_logger.LogDebug("Overcapacity compaction triggered");
// Spawn background thread for compaction
ThreadPool.QueueUserWorkItem(s => ((MemoryCache)s!).OvercapacityCompaction(), this);
}

If the thread from the threadpool will remove the entry from the CoherentState._entries before the main thread, but get stalled just before updating the size, it should lead to this problem

internal void RemoveEntry(CacheEntry entry, MemoryCacheOptions options)
{
if (EntriesCollection.Remove(new KeyValuePair<object, CacheEntry>(entry.Key, entry)))
{
if (options.SizeLimit.HasValue)
{
Interlocked.Add(ref _cacheSize, -entry.Size);
}
entry.InvokeEvictionCallbacks();
}
}

@danmoseley
Copy link
Member

@vonzshik thank you for the analysis (I haven't looked and confirmed it). Do you have any interest in working on a fix? Perhaps it is possible to create a test that with enough iterations and threads more often repros this locally.

@vonzshik
Copy link
Contributor

@danmoseley I can 'sorta' emulate reproduction by adding a few Thread.Sleep here and there:

ThreadPool.QueueUserWorkItem(s => ((MemoryCache)s!).OvercapacityCompaction(), this);

ThreadPool.QueueUserWorkItem(s => ((MemoryCache)s!).OvercapacityCompaction(), this);
Thread.Sleep(1000);

Interlocked.Add(ref _cacheSize, -entry.Size);

Thread.Sleep(2000);
Interlocked.Add(ref _cacheSize, -entry.Size);

But no luck without that yet (though I was running on .net 7 build in debug, might have a better luck with .net 6).

I think the best fix here (which also should prove whether that's the actual issue I've found) is to lower the Size here to 4:

cache.Set("key", "value", new MemoryCacheEntryOptions { Size = 6 });
Assert.Equal("value", cache.Get("key"));
Assert.Equal(6, cache.Size);

@karelz
Copy link
Member

karelz commented Jul 22, 2022

@maryamariyan I have updated frequency above -- it fails every couple of days on random PR in last month.
While not super critical, it would be good to do something about it. Any chance it could happen in 7.0?

@jkotas jkotas modified the milestones: Future, 7.0.0 Jul 22, 2022
@jkotas
Copy link
Member

jkotas commented Jul 22, 2022

Updated the milestone. Blocking clean CI issues cannot be future.

@danmoseley
Copy link
Member

Can confirm that this reproes consistently with the 2 sleeps suggested above.

  Microsoft.Extensions.Caching.Memory.CapacityTests.AddingReplacementExceedsCapacityRemovesOldEntry [FAIL]
        Assert.Equal() Failure
        Expected: 0
        Actual:   6
        Stack Trace:
          C:\git\runtime\src\libraries\Microsoft.Extensions.Caching.Memory\tests\CapacityTests.cs(292,0): at Microsoft.
  Extensions.Caching.Memory.CapacityTests.AddingReplacementExceedsCapacityRemovesOldEntry()

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 26, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 26, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Extensions-Caching blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants