Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer in CGroups parser can get corrupted #5114

Closed
mobratil opened this issue Apr 17, 2024 · 3 comments · Fixed by #5129
Closed

Buffer in CGroups parser can get corrupted #5114

mobratil opened this issue Apr 17, 2024 · 3 comments · Fixed by #5129
Labels
area-fundamentals bug This issue describes a behavior which is not expected - a bug.

Comments

@mobratil
Copy link
Contributor

Description

Parsers for CGroups uses buffer for reading from the filesystem. The buffer is one and shared between the calls.
In case the methods are called in parallel the buffer can easily get corrupted.

The issue is present in 2 classes:
https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationParserCgroupV1.cs
https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationParserCgroupV2.cs

Reproduction Steps

The issue is hard to reproduce. In LinuxUtilizationParser there can be calls to the parser from multiple parallel threads.

One call can happen from the obervable gauge to CpuUtilization() method. The second call can occure in GetSnapshot() method from the publisher.

The buffer in parser can get corrupted resulting in weird error messages like:

Unable to gather utilization statistics.
Expected proc/stat to start with 'cpu ' but it was ' 8: 0000000000000000FFFF0000Ecpu 21390382 598466 10047926 536502883 17443424 0 2449856 0 0 0C043C0A:1F90 0000000000000000FFFF0000B8003C0A:D21D 01 00000000:00000000 00:00000000 00000000 1000 0 67727390 1 0000000000000000 22 0 0 10 -1 sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inodecpu 21390556 598466 10048040 536505465 17443694 0 2449865 0 0 0cpu 21390674 598466 10048097 536508217 17443923 0 2449873 0 0 0cpu 21390836 598466 10048186 536511097 17443926 0 2449896 0 0 0cpu 21390946 598466 10048228 536514094 17443926 0 2449911 0 0 0cpu 21391034 598466 10048258 536517117 17443926 0 2449919 0 0 0cpu 21391131 598466 10048290 536520119 17443926 0 2449931 0 0 0'.

Expected behavior

Buffers shouldn't get corrupted when any parallel call are performed.

Actual behavior

The buffers in LinuxUtilizationParserCgroupV1 and LinuxUtilizationParserCgroupV2 get corrupted preventing reading of utilization values.

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

The issue could be fixed by pooling buffers in parsers classes.

@mobratil mobratil added untriaged bug This issue describes a behavior which is not expected - a bug. area-fundamentals and removed untriaged labels Apr 17, 2024
@RussKie
Copy link
Member

RussKie commented Apr 18, 2024

It took me some time to wrap my head around the issue and, I think, now I get it - the instance of a parser passed to LinuxUtilizationProvider maybe invoked concurrently, which results in a race condition and a corruption of a buffer... Thanks for the analysis, it does require some mental work :)

That said, the issue brings back the half-century old design debate - "stateful" vs "stateless". If we were to follow the "stateless" design (which I'm a proponent of) then we wouldn't have had this issue as there wouldn't have been shared state to share and to corrupt. Perhaps, that'd be the best course of the resolution.

@RussKie
Copy link
Member

RussKie commented Apr 22, 2024

This test replicates the issue quite reliably (the test needs to be run on a linux box, e.g., under wsl2):

    [ConditionalFact]
    public async Task ThreadSafetyAsync()
    {
        var f1 = new HardcodedValueFileSystem(new Dictionary<FileInfo, string>
        {
            { new FileInfo("/proc/stat"), "cpu  6163 0 3853 4222848 614 0 1155 0 0 0\r\ncpu0 240 0 279 210987 59 0 927 0 0 0" },
        });
        var f2 = new HardcodedValueFileSystem(new Dictionary<FileInfo, string>
        {
            { new FileInfo("/proc/stat"), "cpu  9137 0 9296 13972503 1148 0 2786 0 0 0\r\ncpu0 297 0 431 698663 59 0 2513 0 0 0" },
        });

        int callCount = 0;
        Mock<IFileSystem> fs = new();
        fs.Setup(x => x.ReadFirstLine(It.IsAny<FileInfo>(), It.IsAny<BufferWriter<char>>()))
             .Callback<FileInfo, BufferWriter<char>>((fileInfo, buffer) =>
             {
                 callCount++;
                 if (callCount % 2 == 0)
                 {
                     f1.ReadFirstLine(fileInfo, buffer);
                 }
                 else
                 {
                     f2.ReadFirstLine(fileInfo, buffer);
                 }
             })
             .Verifiable();

        var p = new LinuxUtilizationParserCgroupV1(fs.Object, new FakeUserHz(100));

        Task[] tasks = new Task[100];
        for (int i = 0; i < tasks.Length; i++)
        {
            tasks[i] = Task.Run(p.GetHostCpuUsageInNanoseconds);
        }

        await Task.WhenAll(tasks);

        Assert.True(true);
    }

RussKie added a commit to RussKie/extensions that referenced this issue Apr 29, 2024
@mobratil
Copy link
Contributor Author

@RussKie thanks for looking into this!
Yes, this test indeed replicates the issue so we can use it for validation of the fix.
I tried to create a test using public APIs.

RussKie added a commit to RussKie/extensions that referenced this issue May 17, 2024
RussKie added a commit to RussKie/extensions that referenced this issue May 24, 2024
RussKie added a commit that referenced this issue May 28, 2024
* Correct tests decorations

* Avoid buffer race conditions in CGroups

Fixes #5114
@github-actions github-actions bot locked and limited conversation to collaborators Jun 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-fundamentals bug This issue describes a behavior which is not expected - a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants