Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve netns cleanup #2112

Merged
merged 4 commits into from
Aug 8, 2024
Merged

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Aug 6, 2024

see commits

Copy link
Contributor

openshift-ci bot commented Aug 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Aug 6, 2024
@Luap99 Luap99 marked this pull request as draft August 6, 2024 14:34
pkg/netns/netns_linux.go Outdated Show resolved Hide resolved
@rhatdan
Copy link
Member

rhatdan commented Aug 7, 2024

LGTM

// the file will work and the kernel will destroy the bind mount in the
// other ns because of this. We also need it so pasta doesn't leak.
rErr = fmt.Errorf("failed to unmount NS: at %s: %w", nsPath, err)
// EINVAL means the path exists but is not mounted, just try to remove the path below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just use containers/storage/pkg/system.EnsureRemoveAll() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, I guess it would work but the logic there seems much more complicated especially in the normal case were a unmount+remove works. We don't need any of the recursive dir logic, EnsureRemoveAll() would then read /proc/thread-self/mountinfo every time which seems wasteful

Luap99 added 4 commits August 8, 2024 16:45
When I wrote this originally I thought we must avoid leaking the netns
so I tried to decrement first. However now I think this wrong because
podman actially calls into the cleanup function again if it returned an
error on the next cleanup attempt. As such we ended up doing a double
decrement and the ref counter went below zero causing a sort of issues[1].

Now if we have a bug the other way around were we not decrement
correctly this is much less of a problem. It simply means we leak once
netns file and the pasta/slirp4netns process which isn't a problem other
than needed a bit of resources.

[1] containers/podman#21569

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The Run() function is used to run long running command in the netns,
namly podman unshare --rootless-netns used that. As such the function
actually unlocks for the main command as otherwise a user could hold the
lock forever effectively causing deadlocks.

Now because we unlock the ref count might change during that time. And
just because we create the netns doesn't mean there are now other users
of the netns. Therefore the cleanup in runInner() was wrong in that
case causing problems for other running containers.

To fix this make sure we do not cleanup in the Run() case unless the
count is 0.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Podman might call us more than once on the same path. If the path is not
mounted or does not exists simply return no error.

Second, retry the unmount/remove until the unmount succeeded. For some
reason we must use MNT_DETACH as otherwise the unmount call will fail
all time the time. However MNT_DETACH means it unmounts async in the
background. Now if we call remove on the file and the unmount was not
done yet it will fail with EBUSY. In this case we try again until it
works or we get another error.

This should help containers/podman#19721

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 marked this pull request as ready for review August 8, 2024 14:47
@Luap99 Luap99 changed the title try to improve netns cleanup to fix podman CI flakes improve netns cleanup Aug 8, 2024
@Luap99
Copy link
Member Author

Luap99 commented Aug 8, 2024

Ok this should be good to go, it will not fix the flake though

@Luap99
Copy link
Member Author

Luap99 commented Aug 8, 2024

@mheon PTAL

@mheon
Copy link
Member

mheon commented Aug 8, 2024

LGTM

@rhatdan
Copy link
Member

rhatdan commented Aug 8, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 8, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit dc70ee3 into containers:main Aug 8, 2024
14 of 15 checks passed
@Luap99 Luap99 deleted the netns-cleanup branch August 9, 2024 08:49
@TomSweeneyRedHat TomSweeneyRedHat added the 5.2 Wanted for Podman v5.2 label Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.2 Wanted for Podman v5.2 approved lgtm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants