You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number == nvidia.com/gpu.count == Node Allocatable.
We have noticed that sometimes (albeit rarely) all the features (labels nvidia.com/*) exported by gpu-feature-discovery disappear from the node. During one of these moments, we noticed the following line in the nfd-worker logs:
E0629 17:29:25.340196 1 local.go:266] source local failed reading file 'gfd': open /etc/kubernetes/node-feature-discovery/features.d/gfd: permission denied
We suspect that this error occurs due to an incorrect implementation of the atomic file writing logic in the function writeFileAtomically. The issue is that the os.Chmod command is executed after the os.Rename command, causing a moment when the features.d/gfd file has incorrect permissions (0600 (default for a file created with os.CreateTemp) instead of 0644).
3. Additional information that might help better understand your environment and reproduce the bug
The incorrectness of the current implementation of writeFileAtomically can be easily reproduced with the following example:
For more deterministic behavior, I added time.Sleep(1 * time.Millisecond) between the os.Rename and os.Chmod calls (but the error also reproduces without this, it just takes a bit longer to occur):
#terminal 1
go run writer/main.go
#terminal 2: run as another user since tmpFile has mode 0600 by default
sudo -u <another user> go run reader/main.go
#2024/06/30 01:55:00 open /<abs/path/to>/gfd: permission denied#2024/06/30 01:55:00 open /<abs/path/to>/gfd: permission denied#2024/06/30 01:55:00 open /<abs/path/to>/gfd: permission denied#...
4. Additional information
Fixing this issue is simple. Just change the order of the os.Rename and os.Chmod calls to os.Chmod -> os.Rename in the writeFileAtomically function - #792
I found another Issue that mentions this problem - #325. However, it also mentions an issue with selinux files, so I decided to open this separate Issue.
The text was updated successfully, but these errors were encountered:
fixNVIDIA#791 Corrected the incorrect order of operations in atomic writing of the feature file, which could lead to a "permission denied" error observed in the nfd-worker logs
Signed-off-by: belo4ya <41exey.kov41ev@gmail.com>
belo4ya
added a commit
to belo4ya/k8s-device-plugin
that referenced
this issue
Jul 9, 2024
fixNVIDIA#791 Corrected the incorrect order of operations in atomic writing of the feature file, which could lead to a "permission denied" error observed in the nfd-worker logs
note: now the temporary file is created in $TMPDIR instead of /etc/kubernetes/node-feature-discovery/features.d/gfd-tmp/
Signed-off-by: belo4ya <41exey.kov41ev@gmail.com>
belo4ya
added a commit
to belo4ya/k8s-device-plugin
that referenced
this issue
Jul 9, 2024
fixNVIDIA#791 Corrected the incorrect order of operations in atomic writing of the feature file, which could lead to a "permission denied" error observed in the nfd-worker logs
note: now the temporary file is created in $TMPDIR instead of /etc/kubernetes/node-feature-discovery/features.d/gfd-tmp/
Signed-off-by: belo4ya <41exey.kov41ev@gmail.com>
1. Quick Debug Information
2. Issue description
We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number ==
nvidia.com/gpu.count
== Node Allocatable.We have noticed that sometimes (albeit rarely) all the features (labels
nvidia.com/*
) exported by gpu-feature-discovery disappear from the node. During one of these moments, we noticed the following line in the nfd-worker logs:We suspect that this error occurs due to an incorrect implementation of the atomic file writing logic in the function writeFileAtomically. The issue is that the
os.Chmod
command is executed after theos.Rename
command, causing a moment when thefeatures.d/gfd
file has incorrect permissions (0600
(default for a file created with os.CreateTemp) instead of0644
).3. Additional information that might help better understand your environment and reproduce the bug
The incorrectness of the current implementation of writeFileAtomically can be easily reproduced with the following example:
For more deterministic behavior, I added
time.Sleep(1 * time.Millisecond)
between theos.Rename
andos.Chmod
calls (but the error also reproduces without this, it just takes a bit longer to occur):writer/main.go
The nfd-worker when reading the file simply calls the
os.ReadFile
function:reader/main.go
Let's run it and see the expected error:
4. Additional information
Fixing this issue is simple. Just change the order of the
os.Rename
andos.Chmod
calls toos.Chmod -> os.Rename
in the writeFileAtomically function - #792I found another Issue that mentions this problem - #325. However, it also mentions an issue with selinux files, so I decided to open this separate Issue.
The text was updated successfully, but these errors were encountered: