Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array might not be zero-initialized #79882

Closed
threbejk opened this issue Dec 21, 2022 · 24 comments · Fixed by #79912
Closed

Array might not be zero-initialized #79882

threbejk opened this issue Dec 21, 2022 · 24 comments · Fixed by #79912
Milestone

Comments

@threbejk
Copy link

threbejk commented Dec 21, 2022

Description

We have found, that elements of an array might not be always initialized to the default value. For example, after calling new double[size], some elements might be nonzero. We have started noticing this problem after upgrading to .net7. It seems, that this happens especially when the GC heap size limit is set. (Initially, we observed this with 8 GB limit.)

Reproduction Steps

Run this C# program:

using System;

var rand = new Random();
try
{
    for (int k = 0; k < 1_000_000; k++)
    {
        Test(k);
    }
    Console.Write("Succeeded");
}
catch (Exception ex)
{
    Console.WriteLine("Failed");
    Console.WriteLine(ex.Message);    
}

void Test(int iteration)
{
    int size = (int)Math.Pow(10, 5 + rand.NextDouble() * (1 + rand.NextDouble()));

    var m = new double[size];
    for (int i = 0; i < size; i++)
        if (m[i] != 0)
            throw new Exception(m[i].ToString() + "; " + i + "/" + size + "; iteration " + iteration.ToString());

    for (int i = 0; i < size; i++)
        m[i] = 21;
}

With GC HeapHardLimit set to 1 GB

{
  "configProperties": {
    "System.GC.HeapHardLimit": 1000000000
  }
}

Expected behavior

We expect that Exception is not thrown, because elements of the array should be initialized to zero.

Actual behavior

With high probability, Exception is thrown, because an element is not zero.

Regression?

We haven't noticed this issue in .net6.

Known Workarounds

No response

Configuration

.net version 7.0.100
MS Windows 10 Pro, x64

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 21, 2022
@ghost
Copy link

ghost commented Dec 21, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We have found, that elements of an array might not be always initialized to the default value. For example, after calling new double[size], some elements might be nonzero. We have started noticing this problem after upgrading to .net7. It seems, that this happens especially when the GC heap size limit is set. (Initially, we observed this with 8 GB limit.)

Reproduction Steps

Run this C# program:

using System;

var rand = new Random();
try
{
    for (int k = 0; k < 1_000_000; k++)
    {
        Test(k);
    }
    Console.Write("Succeeded");
}
catch (Exception ex)
{
    Console.WriteLine("Failed");
    Console.WriteLine(ex.Message);    
}

void Test(int iteration)
{
    int size = (int)Math.Pow(10, 5 + rand.NextDouble() * (1 + rand.NextDouble()));

    var m = new double[size];
    for (int i = 0; i < size; i++)
        if (m[i] != 0)
            throw new Exception(m[i].ToString() + "; " + i + "/" + size + "; iteration " + iteration.ToString());

    for (int i = 0; i < size; i++)
        m[i] = 21;
}

With GC HeapHardLimit set to 1 GB

{
  "configProperties": {
    "System.GC.HeapHardLimit": 1000000000
  }
}

Expected behavior

We expect that Exception is not thrown, because elements of the array should be initialized to zero.

Actual behavior

With high probability, Exception is thrown, because an element is not zero.

Regression?

We haven't noticed this issue in .net6.

Known Workarounds

No response

Configuration

.net version 7.0.100
MS Windows 10 Pro, x64

Other information

No response

Author: threbejk
Assignees: -
Labels:

area-GC-coreclr

Milestone: -

@vcsjones
Copy link
Member

Anecdotally, I can reproduce this on Windows 11 ARM. I cannot reproduce this on macOS with Apple Silicon.

@mangod9
Copy link
Member

mangod9 commented Dec 21, 2022

Thanks for reporting this issue. Unfortunately, its not reproing on my machine. Are you running with WKS or SVR GC? Would it be possible to share a dump when its non-zero (calling Debugger.Break instead of throwing an exception) and using .dump -ma to generate a full memory dump). Thanks!

@cshung @PeterSolMS fyi.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Dec 21, 2022
@mangod9 mangod9 added this to the 8.0.0 milestone Dec 21, 2022
@vcsjones
Copy link
Member

vcsjones commented Dec 21, 2022

@mangod9 Dump is too big for GitHub even zipped, so attached here. https://1drv.ms/u/s!AqFBIHiaPFEtk0Q6Wf5oOTFCj1KW?e=kN6uHm

I used a slightly modified version of the code, which is below.

using System;
using System.Diagnostics;

Console.WriteLine(Environment.ProcessId);
var rand = new Random();
for (int k = 0; k < 1_000_000; k++)
{
    Test(k);
}
Console.Write("Succeeded");


void Test(int iteration)
{
    int size = (int)Math.Pow(10, 5 + rand.NextDouble() * (1 + rand.NextDouble()));

    var m = new double[size];
    for (int i = 0; i < size; i++)
    {
        if (m[i] != 0)
        {
            throw new Exception(m[i].ToString() + "; " + i + "/" + size + "; iteration " + iteration.ToString());
        }
    }

    for (int i = 0; i < size; i++)
    {
        m[i] = 21;
    }
}

With a runtimeconfig.template.json:

{
  "configProperties": {
    "System.GC.HeapHardLimit": 1000000000
  }
}

Are you running with WKS or SVR GC?

This does not reproduce when I use the Server GC. When I run with "System.GC.Server": true in the runtime config, it does not fail. When I use workstation GC, it fails for me in a couple of seconds.

@HonzaMD
Copy link

HonzaMD commented Dec 21, 2022

Hi, its WorkstationGC, Console App, all default settings, except the System.GC.HeapHardLimit.
I tried to create a dump but its compressed size is bigger than this form allows.

Than you, @vcsjones

@mangod9
Copy link
Member

mangod9 commented Dec 21, 2022

thanks @vcsjones / @HonzaMD, does it repro without the array item assignment in the second for loop? From the exception in the dump I notice that the message is 21; 0/8457670; iteration 572 meaning its possibly getting assigned out of order? Or do you notice it failing with other values as well?

@vcsjones
Copy link
Member

@mangod9

does it repro without the array item assignment in the second for loop?

Yes.

Result:

Unhandled exception. System.Exception: 1.1401181653923E-311; 1962135/4542241; iteration 722
   at Program.<<Main>$>g__Test|0_0(Int32 iteration, <>c__DisplayClass0_0&) in C:\code\scratch\Program.cs:line 22
   at Program.<Main>$(String[] args) in C:\code\scratch\Program.cs:line 8

@vcsjones
Copy link
Member

It just seems to be some uninitialized value.

@vcsjones
Copy link
Member

Also of note: I can still reproduce this with set COMPlus_JITMinOpts=1 and with a debug build.

@mangod9
Copy link
Member

mangod9 commented Dec 21, 2022

ok, could you please share the dump for that case then? The one you shared has the array elements all set to 21, which is weird too. I want to check if only a few elements are uninitialized within it (most likely some heap corruption under low memory conditions)

@vcsjones
Copy link
Member

@HonzaMD
Copy link

HonzaMD commented Dec 22, 2022

Changing GC to older version with "System.GC.Name": "clrgc.dll" setting helps. When set, I am unable to reproduce.

@PeterSolMS
Copy link
Contributor

This doesn't repro for me on x64 either Intel or AMD. As the dumps are from ARM64, will try that next.

@threbejk
Copy link
Author

Have you set the GC heap limit? We have easily reproduced this on different machines, both amd and intel. I will send memory dump for x64, if it helps. Btw, it usually fails withing a few seconds, so the number of iterations may be lowered if it runs too long.

@PeterSolMS
Copy link
Contributor

Interesting... Yes, I set the memory limit, but with an environment variable. Could you attach your .runtimeconfig.json file?

@threbejk
Copy link
Author

Here is the dump for x64. I had to zip it, split into two files, and then change the suffix from zip.001 to 001.zip (you might need to change it back and use 7-zip to unpack it), because otherwise github refused to upload it, so sorry for the inconvenience.
ArrayAllocTest.001.zip
ArrayAllocTest.002.zip

@threbejk
Copy link
Author

Environment variable works as well, but I think the value must be hexadecimal in that case.
runtimeconfig.json:

{
  "runtimeOptions": {
    "tfm": "net7.0",
    "framework": {
      "name": "Microsoft.NETCore.App",
      "version": "7.0.0"
    },
    "configProperties": {
      "System.GC.HeapHardLimit": 1000000000,
      "System.Reflection.Metadata.MetadataUpdater.IsSupported": false
    }
  }
}

@PeterSolMS
Copy link
Contributor

Thank you for the dump! Right, for the environment variable, value must be hex, I put 3B9ACA00 which should be the same thing...

@PeterSolMS
Copy link
Contributor

Ok, now I can repro the issue...

@vcsjones
Copy link
Member

Ok, now I can repro the issue...

Purely curiosity, but what was the factor(s) in you not being able to reproduce it originally?

@mangod9
Copy link
Member

mangod9 commented Dec 22, 2022

Not quite sure why it wasn't reproing for me earlier. Could have been an incorrect hard limit specification. The symptoms indicate some incorrect handling at region boundaries. I modified the code to assign the array in the Test method like so:

    for (int i = 0; i < size; i++)
        m[i] = iteration;

with that I see this output, indicating a previous array instance is somehow getting assigned to the new instance possibly as part of GC relocation. Interestingly it doesnt repro with GCStress.

602; 0/2348445; iteration 667

And the offending array object is usually at the region boundary:

         Address               MT     Size
000001d7db800040 00007ff96232b0c0 18787584     
000001d7d7400040 00007ff96232b0c0 37205344   

@cshung
Copy link
Member

cshung commented Dec 23, 2022

The bug is understood and fixed. Feel free to skip to the end for the fixes if the causes section appears too technical.

Causes

There are two underlying issues there, both have something to do with overlapped commit.

Overlapped commits:

Suppose some memory address range was reserved but not committed, by the time we commit it, the operating system will make sure the memory we get is zero-initialized.

Later on, we can de-commit them. The current process is no longer eligible to access the de-committed memory (accessing it would lead to AV), and the memory is eligible to be used by other processes.

Suppose we commit again, we may or may not get back the same physical memory, but the operating system will guarantee the memory is zero-initialized again.

Given the API, we could commit the same address range before de-committing it. Technically, this is a bug. But most operating systems will simply forgive us and did nothing about it. However, that does not guarantee the memory is zero-initialized. Instead, the memory just stays what it was, and that's why we observe the dirty content in the arrays.

What's wrong in our particular case is that we forgot to de-commit and leading the future overlapped commits.

Missed de-commits:

  1. Missed de-committing the corresponding mark array when de-committing the region, and
  2. Missed de-committing the region in case we are running out of memory for the mark array.

On the mark arrays

When we are done with a GC, we determine if we should return some memory back to the operating system through de-commits. In this test case, we often find large chunks of memory left unused, so we often de-commit. Mark array is a data structure that we use to maintain information related to a region. So whenever we de-commit a region, we should also de-commit the corresponding mark array. But we didn't.

Later on, when the region is needed again, we commit the corresponding mark array again. Whenever we commit, however, we increase a counter so that we know how much memory we committed, and when it reaches the heap hard limit, we need to back off. Without de-committing, the counter is double-counted. After some number of iterations, the double counting will lead us to the memory limit, and we started to hit the second issue.

On the regions failing to commit mark arrays

After exhausting the committed counter by the mark array, we are starting to fail to commit for the mark array. In that case, the region is no longer useful, and we have to delete it. Before my bug fix, the region is deleted directly to the global region allocator without de-committing first, so if the region is used again, the code assumes whatever we get from the region allocator is cleanly committed (because we did call commit on those memory addresses, assuming it is not committed, but that's an overlapping commit because we didn't de-commit), so we end up with dirty data in the user arrays.

Fixes

I have a work-in-progress PR here that should fix the problem. The change simply de-commits the memory as we should have.

Before the change, the code hit the exception almost right away. After the change (on the same base), the code survived at least 20k iterations on debug build, so it should be good functionally.

Performance-wise, however, this is doing more de-commit operations, so it is likely to regress some scenarios. Although it isn't clear to me if that is avoidable. This particular test case is churning the LOH, and it is well known that churning the LOH is a bad idea for the GC, so I wouldn't worry too much about the performance of this test case in particular, but we need to make sure the general case is not regressed.

If we are not in a hurry. I'd love to do more performance analysis on this before we merge this PR. Preferably after the vacation when the full team is available to work on it.

@mangod9
Copy link
Member

mangod9 commented Dec 23, 2022

Thanks @cshung for investigating. Wonder why this issue doesnt repro under GCStress?

@cshung
Copy link
Member

cshung commented Dec 23, 2022

Thanks @cshung for investigating. Wonder why this issue doesn't repro under GCStress?

No idea why. This isn't the first time I see a situation like this. In general, I observed that these memory corruption repros are often quite fragile. Little changes to it often lead to not repro. Therefore, in practice, I tend to make as little change to the repro or the runtime as necessary for the investigation.

For example, in this investigation - I hit an assert while running under debug mode and figured out the first bug related to the de-committing of the mark array. So the natural thing to do is to fix it. However, once I fixed it, on my box, the test case no longer repro after running for a whole night with that fix on.

But it is not clear to me how can we get to a state where we have a dirty array content with just the mark array issue. In order to figure that out, I chose a less invasive approach to simply comment out the assert, and that's how I found the second bug.

It is not perfect, and it doesn't explain why, but this pragmatic approach worked. I recommend it.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 11, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 11, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants