-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer in CGroups parser can get corrupted #5114
Comments
It took me some time to wrap my head around the issue and, I think, now I get it - the instance of a That said, the issue brings back the half-century old design debate - "stateful" vs "stateless". If we were to follow the "stateless" design (which I'm a proponent of) then we wouldn't have had this issue as there wouldn't have been shared state to share and to corrupt. Perhaps, that'd be the best course of the resolution. |
This test replicates the issue quite reliably (the test needs to be run on a linux box, e.g., under wsl2): [ConditionalFact]
public async Task ThreadSafetyAsync()
{
var f1 = new HardcodedValueFileSystem(new Dictionary<FileInfo, string>
{
{ new FileInfo("/proc/stat"), "cpu 6163 0 3853 4222848 614 0 1155 0 0 0\r\ncpu0 240 0 279 210987 59 0 927 0 0 0" },
});
var f2 = new HardcodedValueFileSystem(new Dictionary<FileInfo, string>
{
{ new FileInfo("/proc/stat"), "cpu 9137 0 9296 13972503 1148 0 2786 0 0 0\r\ncpu0 297 0 431 698663 59 0 2513 0 0 0" },
});
int callCount = 0;
Mock<IFileSystem> fs = new();
fs.Setup(x => x.ReadFirstLine(It.IsAny<FileInfo>(), It.IsAny<BufferWriter<char>>()))
.Callback<FileInfo, BufferWriter<char>>((fileInfo, buffer) =>
{
callCount++;
if (callCount % 2 == 0)
{
f1.ReadFirstLine(fileInfo, buffer);
}
else
{
f2.ReadFirstLine(fileInfo, buffer);
}
})
.Verifiable();
var p = new LinuxUtilizationParserCgroupV1(fs.Object, new FakeUserHz(100));
Task[] tasks = new Task[100];
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = Task.Run(p.GetHostCpuUsageInNanoseconds);
}
await Task.WhenAll(tasks);
Assert.True(true);
} |
@RussKie thanks for looking into this! |
* Correct tests decorations * Avoid buffer race conditions in CGroups Fixes #5114
Description
Parsers for CGroups uses buffer for reading from the filesystem. The buffer is one and shared between the calls.
In case the methods are called in parallel the buffer can easily get corrupted.
The issue is present in 2 classes:
https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationParserCgroupV1.cs
https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationParserCgroupV2.cs
Reproduction Steps
The issue is hard to reproduce. In LinuxUtilizationParser there can be calls to the parser from multiple parallel threads.
One call can happen from the obervable gauge to CpuUtilization() method. The second call can occure in GetSnapshot() method from the publisher.
The buffer in parser can get corrupted resulting in weird error messages like:
Unable to gather utilization statistics.
Expected proc/stat to start with 'cpu ' but it was ' 8: 0000000000000000FFFF0000Ecpu 21390382 598466 10047926 536502883 17443424 0 2449856 0 0 0C043C0A:1F90 0000000000000000FFFF0000B8003C0A:D21D 01 00000000:00000000 00:00000000 00000000 1000 0 67727390 1 0000000000000000 22 0 0 10 -1 sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inodecpu 21390556 598466 10048040 536505465 17443694 0 2449865 0 0 0cpu 21390674 598466 10048097 536508217 17443923 0 2449873 0 0 0cpu 21390836 598466 10048186 536511097 17443926 0 2449896 0 0 0cpu 21390946 598466 10048228 536514094 17443926 0 2449911 0 0 0cpu 21391034 598466 10048258 536517117 17443926 0 2449919 0 0 0cpu 21391131 598466 10048290 536520119 17443926 0 2449931 0 0 0'.
Expected behavior
Buffers shouldn't get corrupted when any parallel call are performed.
Actual behavior
The buffers in LinuxUtilizationParserCgroupV1 and LinuxUtilizationParserCgroupV2 get corrupted preventing reading of utilization values.
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
The issue could be fixed by pooling buffers in parsers classes.
The text was updated successfully, but these errors were encountered: