improve netns cleanup #2112

Luap99 · 2024-08-06T13:57:40Z

see commits

openshift-ci · 2024-08-06T13:57:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/netns/netns_linux.go

rhatdan · 2024-08-07T09:54:56Z

LGTM

giuseppe · 2024-08-07T12:49:09Z

pkg/netns/netns_linux.go

-			// the file will work and the kernel will destroy the bind mount in the
-			// other ns because of this. We also need it so pasta doesn't leak.
-			rErr = fmt.Errorf("failed to unmount NS: at %s: %w", nsPath, err)
+		// EINVAL means the path exists but is not mounted, just try to remove the path below


could we just use containers/storage/pkg/system.EnsureRemoveAll() here?

Mhh, I guess it would work but the logic there seems much more complicated especially in the normal case were a unmount+remove works. We don't need any of the recursive dir logic, EnsureRemoveAll() would then read /proc/thread-self/mountinfo every time which seems wasteful

When I wrote this originally I thought we must avoid leaking the netns so I tried to decrement first. However now I think this wrong because podman actially calls into the cleanup function again if it returned an error on the next cleanup attempt. As such we ended up doing a double decrement and the ref counter went below zero causing a sort of issues[1]. Now if we have a bug the other way around were we not decrement correctly this is much less of a problem. It simply means we leak once netns file and the pasta/slirp4netns process which isn't a problem other than needed a bit of resources. [1] containers/podman#21569 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

The Run() function is used to run long running command in the netns, namly podman unshare --rootless-netns used that. As such the function actually unlocks for the main command as otherwise a user could hold the lock forever effectively causing deadlocks. Now because we unlock the ref count might change during that time. And just because we create the netns doesn't mean there are now other users of the netns. Therefore the cleanup in runInner() was wrong in that case causing problems for other running containers. To fix this make sure we do not cleanup in the Run() case unless the count is 0. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Podman might call us more than once on the same path. If the path is not mounted or does not exists simply return no error. Second, retry the unmount/remove until the unmount succeeded. For some reason we must use MNT_DETACH as otherwise the unmount call will fail all time the time. However MNT_DETACH means it unmounts async in the background. Now if we call remove on the file and the unmount was not done yet it will fail with EBUSY. In this case we try again until it works or we get another error. This should help containers/podman#19721 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2024-08-08T14:48:02Z

Ok this should be good to go, it will not fix the flake though

Luap99 · 2024-08-08T17:29:17Z

@mheon PTAL

mheon · 2024-08-08T20:09:14Z

LGTM

rhatdan · 2024-08-08T21:27:57Z

/lgtm

openshift-ci bot added the approved label Aug 6, 2024

Luap99 force-pushed the netns-cleanup branch from 1e0557a to 03f0021 Compare August 6, 2024 14:33

Luap99 marked this pull request as draft August 6, 2024 14:34

openshift-ci bot added the do-not-merge/work-in-progress label Aug 6, 2024

rhatdan reviewed Aug 7, 2024

View reviewed changes

pkg/netns/netns_linux.go Outdated Show resolved Hide resolved

Luap99 mentioned this pull request Aug 7, 2024

[CI][TEST] investigate https://github.com/containers/storage/issues/2042 containers/podman#23516

Closed

giuseppe reviewed Aug 7, 2024

View reviewed changes

Luap99 added 4 commits August 8, 2024 16:45

rootlessnetns: make cleanup idempotent

515488f

Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 force-pushed the netns-cleanup branch from 9cf9953 to 515488f Compare August 8, 2024 14:46

Luap99 marked this pull request as ready for review August 8, 2024 14:47

openshift-ci bot removed the do-not-merge/work-in-progress label Aug 8, 2024

Luap99 changed the title ~~try to improve netns cleanup to fix podman CI flakes~~ improve netns cleanup Aug 8, 2024

openshift-ci bot assigned rhatdan Aug 8, 2024

openshift-ci bot added the lgtm label Aug 8, 2024

openshift-merge-bot bot merged commit dc70ee3 into containers:main Aug 8, 2024
14 of 15 checks passed

Luap99 deleted the netns-cleanup branch August 9, 2024 08:49

Luap99 mentioned this pull request Aug 9, 2024

libpod: cleanupNetwork() return error containers/podman#23553

Merged

TomSweeneyRedHat added the 5.2 Wanted for Podman v5.2 label Aug 9, 2024

Luap99 mentioned this pull request Aug 11, 2024

add epbf program to trace podman cleanup errors in CI containers/podman#23487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve netns cleanup #2112

improve netns cleanup #2112

Luap99 commented Aug 6, 2024

openshift-ci bot commented Aug 6, 2024

rhatdan commented Aug 7, 2024

giuseppe Aug 7, 2024

Luap99 Aug 7, 2024

Luap99 commented Aug 8, 2024

Luap99 commented Aug 8, 2024

mheon commented Aug 8, 2024

rhatdan commented Aug 8, 2024

improve netns cleanup #2112

improve netns cleanup #2112

Conversation

Luap99 commented Aug 6, 2024

openshift-ci bot commented Aug 6, 2024

rhatdan commented Aug 7, 2024

giuseppe Aug 7, 2024

Choose a reason for hiding this comment

Luap99 Aug 7, 2024

Choose a reason for hiding this comment

Luap99 commented Aug 8, 2024

Luap99 commented Aug 8, 2024

mheon commented Aug 8, 2024

rhatdan commented Aug 8, 2024