Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gfd] Incorrect implementation of atomic writing to a file when exporting features by gpu-feature-discovery #791

Closed
belo4ya opened this issue Jun 29, 2024 · 0 comments · Fixed by #792

Comments

@belo4ya
Copy link
Contributor

belo4ya commented Jun 29, 2024

1. Quick Debug Information

  • OS/Version: Ubuntu 18.04.6, Ubuntu 22.04.2

2. Issue description

We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number == nvidia.com/gpu.count == Node Allocatable.

We have noticed that sometimes (albeit rarely) all the features (labels nvidia.com/*) exported by gpu-feature-discovery disappear from the node. During one of these moments, we noticed the following line in the nfd-worker logs:

E0629 17:29:25.340196       1 local.go:266] source local failed reading file 'gfd': open /etc/kubernetes/node-feature-discovery/features.d/gfd: permission denied

We suspect that this error occurs due to an incorrect implementation of the atomic file writing logic in the function writeFileAtomically. The issue is that the os.Chmod command is executed after the os.Rename command, causing a moment when the features.d/gfd file has incorrect permissions (0600 (default for a file created with os.CreateTemp) instead of 0644).

3. Additional information that might help better understand your environment and reproduce the bug

The incorrectness of the current implementation of writeFileAtomically can be easily reproduced with the following example:

For more deterministic behavior, I added time.Sleep(1 * time.Millisecond) between the os.Rename and os.Chmod calls (but the error also reproduces without this, it just takes a bit longer to occur):

writer/main.go
...
const (
	filePath = "/<abs/path/to>/gfd"
	content  = `
nvidia.com/gpu.machine=VMware7,1
nvidia.com/gpu.product=Tesla-V100-PCIE-32GB
`
)

func main() {
	for {
		_ = writeFileAtomically(filePath, []byte(content), 0644)
	}
}

// from https://github.com/NVIDIA/k8s-device-plugin/blob/35ad18080eded1889dc1eaee1132debddfd6757c/internal/lm/output.go#L105
func writeFileAtomically(path string, contents []byte, perm os.FileMode) error {
	absPath, err := filepath.Abs(path)
	if err != nil {
		return fmt.Errorf("failed to retrieve absolute path of output file: %v", err)
	}

	absDir := filepath.Dir(absPath)
	tmpDir := filepath.Join(absDir, "gfd-tmp")

	err = os.MkdirAll(tmpDir, os.ModePerm)
	if err != nil && !os.IsExist(err) {
		return fmt.Errorf("failed to create temporary directory: %v", err)
	}
	defer func() {
		if err != nil {
			os.RemoveAll(tmpDir)
		}
	}()

	tmpFile, err := os.CreateTemp(tmpDir, "gfd-")
	if err != nil {
		return fmt.Errorf("fail to create temporary output file: %v", err)
	}
	defer func() {
		if err != nil {
			tmpFile.Close()
			os.Remove(tmpFile.Name())
		}
	}()

	err = os.WriteFile(tmpFile.Name(), contents, perm)
	if err != nil {
		return fmt.Errorf("error writing temporary file '%v': %v", tmpFile.Name(), err)
	}

	err = os.Rename(tmpFile.Name(), path)
	if err != nil {
		return fmt.Errorf("error moving temporary file to '%v': %v", path, err)
	}

	time.Sleep(1 * time.Millisecond) // added for better error reproduction

	err = os.Chmod(path, perm)
	if err != nil {
		return fmt.Errorf("error setting permissions on '%v': %v", path, err)
	}

	return nil
}

The nfd-worker when reading the file simply calls the os.ReadFile function:

reader/main.go
...
const filePath = "/<abs/path/to>/gfd"

func main() {
	for {
		_, err := os.ReadFile(filePath)
		if err != nil {
			log.Println(err)
		}
	}
}

Let's run it and see the expected error:

#terminal 1
go run writer/main.go

#terminal 2: run as another user since tmpFile has mode 0600 by default
sudo -u <another user> go run reader/main.go

#2024/06/30 01:55:00 open /<abs/path/to>/gfd: permission denied
#2024/06/30 01:55:00 open /<abs/path/to>/gfd: permission denied
#2024/06/30 01:55:00 open /<abs/path/to>/gfd: permission denied
#...

4. Additional information

Fixing this issue is simple. Just change the order of the os.Rename and os.Chmod calls to os.Chmod -> os.Rename in the writeFileAtomically function - #792

I found another Issue that mentions this problem - #325. However, it also mentions an issue with selinux files, so I decided to open this separate Issue.

belo4ya added a commit to belo4ya/k8s-device-plugin that referenced this issue Jul 9, 2024
fix NVIDIA#791 Corrected the incorrect order of operations in atomic writing of the feature file, which could lead to a "permission denied" error observed in the nfd-worker logs

Signed-off-by: belo4ya <41exey.kov41ev@gmail.com>
belo4ya added a commit to belo4ya/k8s-device-plugin that referenced this issue Jul 9, 2024
fix NVIDIA#791 Corrected the incorrect order of operations in atomic writing of the feature file, which could lead to a "permission denied" error observed in the nfd-worker logs

note: now the temporary file is created in $TMPDIR instead of /etc/kubernetes/node-feature-discovery/features.d/gfd-tmp/

Signed-off-by: belo4ya <41exey.kov41ev@gmail.com>
belo4ya added a commit to belo4ya/k8s-device-plugin that referenced this issue Jul 9, 2024
fix NVIDIA#791 Corrected the incorrect order of operations in atomic writing of the feature file, which could lead to a "permission denied" error observed in the nfd-worker logs

note: now the temporary file is created in $TMPDIR instead of /etc/kubernetes/node-feature-discovery/features.d/gfd-tmp/

Signed-off-by: belo4ya <41exey.kov41ev@gmail.com>
@elezar elezar closed this as completed in 442edd4 Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant