Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing Flake: iptables chain already exists #3447

Closed
cevich opened this issue Jun 27, 2019 · 21 comments · Fixed by #4028 or #3901
Closed

Testing Flake: iptables chain already exists #3447

cevich opened this issue Jun 27, 2019 · 21 comments · Fixed by #4028 or #3901
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@cevich
Copy link
Member

cevich commented Jun 27, 2019

/kind bug

Description

Occasionally, automated testing fails due to a race-condition involving CNI iptables (though others CNI tools could also race). Since firewalld is intended to mitigate this, but use is forbidden, synchronization either needs to happen within CNI or within libpod.

Steps to reproduce the issue:

  1. Run CI tests hundreds of billion times under slightly varying conditions.

Describe the results you received:

Multiple tests fail with error from CNI claiming "iptables chain FOO already exists"

Describe the results you expected:

Testing continuously passes despite slightly varying conditions.

Additional information you deem important (e.g. issue happens only occasionally):

containernetworking-plugins-0.7.5-1.fc30.x86_64

Output of podman version:

Version:            1.4.2
RemoteAPI Version:  1
Go Version:         go1.12.5
OS/Arch:            linux/amd64

Output of podman info --debug:

debug:
  compiler: gc
  git commit: ""
  go version: go1.12.5
  podman version: 1.4.2
host:
  BuildahVersion: 1.9.0
  Conmon:
    package: podman-1.4.2-1.fc30.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 0.2.0, commit: d7234dc01ae2ef08c42e3591e876723ad1c914c9'
  Distribution:
    distribution: fedora
    version: "30"
  MemFree: 2929360896
  MemTotal: 4133994496
  OCIRuntime:
    package: runc-1.0.0-93.dev.gitb9b6cc6.fc30.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc8+dev
      commit: e3b4c1108f7d1bf0d09ab612ea09927d9b59b4e3
      spec: 1.0.1-dev
  SwapFree: 0
  SwapTotal: 0
  arch: amd64
  cpus: 2
  hostname: cevich-fedora-30-libpod-5081463649730560
  kernel: 5.1.12-300.fc30.x86_64
  os: linux
  rootless: false
  uptime: 5m 23.78s
registries:
  blocked: null
  insecure: null
  search:
  - docker.io
  - quay.io
  - registry.fedoraproject.org
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 0
  GraphDriverName: overlay
  GraphOptions:
  - overlay.mountopt=nodev,metacopy=on
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  ImageStore:
    number: 0
  RunRoot: /var/run/containers/storage
  VolumePath: /var/lib/containers/storage/volumes

Additional environment details (AWS, VirtualBox, physical, etc.):

Fedora 30 VM running in GCE

/etc/cni/net.d/87-podman-bridge.conflist:

{
    "cniVersion": "0.3.0",
    "name": "podman",
    "plugins": [
      {
        "type": "bridge",
        "bridge": "cni0",
        "isGateway": true,
        "ipMasq": true,
        "ipam": {
            "type": "host-local",
            "subnet": "10.88.0.0/16",
            "routes": [
                { "dst": "0.0.0.0/0" }
            ]
        }
      },
      {
        "type": "portmap",
        "capabilities": {
          "portMappings": true
        }
      }
    ]
}
@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2019
@cevich
Copy link
Member Author

cevich commented Jul 12, 2019

@mheon @baude either of you seen errors coming from CNI iptables that complain about CHAIN ALREADY EXISTS?

@mheon
Copy link
Member

mheon commented Jul 12, 2019

I've heard of this mostly through Ed's tests. I have not personally seen these errors.

@mheon
Copy link
Member

mheon commented Jul 12, 2019

Though, I will note Ed has a consistent reproducer.

@cevich
Copy link
Member Author

cevich commented Jul 12, 2019

Ohhh, that's good to hear. @edsantiago would it be possible to use a fresh(er) build of CNI to see if the problem is fixed? (see the CNI issue I opened for details) I only see it very inconsistently in libpod CI 😞

@edsantiago
Copy link
Member

would it be possible to use a fresh(er) build of CNI

I only see the problem on RHEL8, and I have no idea how to get a new CNI there

@adrianreber
Copy link
Collaborator

From the two RHEL8 VMs I have I see this one of them always and on the other one never.

@cevich
Copy link
Member Author

cevich commented Jul 15, 2019

@edsantiago darn, I was afraid you'd say that. I'd guesstimate I was seeing this hit maybe 1/10 CI jobs here (when I opened the issues). Across all distros IIRC. Now it seems much reduced, but I dislike leaving that up to chance.

I remember matt saying it would be really really expensive if podman did the locking for iptables.

The crappy thing is, CNI can literally just ignore this error. I can't imagine why anything should ever care about a chain existing on create, since that was the intention. There's no other data tied to existence or absence that I know of.

(AFAIK, the up suspected "fix" in golang doesn't ignore the error, it does something else)

@edsantiago
Copy link
Member

Feel free to follow along at rhbz1627561

@cevich
Copy link
Member Author

cevich commented Jul 17, 2019

That BZ is private, but I will share that it's tracking the same issue, and fix I reported upstream in containernetworking/plugins#335 So I think we can be fairly confident when that finalizes for RHEL, we can close all these issues also.

@cevich
Copy link
Member Author

cevich commented Jul 24, 2019

Dangit...saw this happen again on master:

[+0422s] Podman run 
[+0422s]   podman run a container based on a complex local image name
[+0422s]   /var/tmp/go/src/github.com/containers/libpod/test/e2e/run_test.go:53
[+0422s] 
[+0422s] [BeforeEach] Podman run
[+0422s]   /var/tmp/go/src/github.com/containers/libpod/test/e2e/run_test.go:30
[+0422s] [It] podman run a container based on a complex local image name
[+0422s]   /var/tmp/go/src/github.com/containers/libpod/test/e2e/run_test.go:53
[+0422s] Running: /usr/bin/podman --storage-opt vfs.imagestore=/tmp/podman/imagecachedir --root /tmp/podman_test990930550/crio --runroot /tmp/podman_test990930550/crio-run --runtime /usr/bin/runc --conmon /usr/libexec/podman/conmon --cni-config-dir /etc/cni/net.d --cgroup-manager cgroupfs --tmpdir /tmp/podman_test990930550 --storage-driver=vfs run libpod/alpine_nginx:latest ls
[+0422s] time="2019-07-24T11:07:42Z" level=error msg="unable to write system event: \"write unixgram @0014d->/run/systemd/journal/socket: sendmsg: no such file or directory\""
[+0422s] time="2019-07-24T11:07:42Z" level=error msg="unable to write pod event: \"write unixgram @0014d->/run/systemd/journal/socket: sendmsg: no such file or directory\""
[+0422s] Error: error adding firewall rules for container 4656cd271c0f45affe8868176783bd733c22c6b210e6fcc7e8337498f9dc3fa9: running [/usr/sbin/iptables -t filter -N CNI-FORWARD --wait]: exit status 1: iptables: Chain already exists.

@mheon
Copy link
Member

mheon commented Jul 24, 2019

We need a fixed release of containernetworking-plugins vendoring the patched go-iptables.

@baude
Copy link
Member

baude commented Aug 2, 2019

@mheon did that happen?

@mheon
Copy link
Member

mheon commented Aug 2, 2019

We are still waiting on a plugins tag

@mheon
Copy link
Member

mheon commented Aug 7, 2019

We're carrying a patch for RHEL/Cent 8.1. We should look into doing the same for Fedora.

@cevich
Copy link
Member Author

cevich commented Sep 23, 2019

Note to me: We have fresh images for all platforms in master. I will check and verify the version of CNI plugins that are present.

@cevich
Copy link
Member Author

cevich commented Sep 24, 2019

This is what we have today:

  • fedora-30-libpod-5664838702858240
    • containernetworking-plugins-0.8.1-1.fc30.x86_64
    • Last change-log entry: Fri Jun 07 2019
  • fedora-29-libpod-5664838702858240
    • containernetworking-plugins-0.8.1-1.fc29.x86_64
    • Last change-log entry: <didn't look>
  • ubuntu-18-libpod-5664838702858240
    • containernetworking- 0.8.2-1~ubuntu18
    • Last change-log entry: Mon, 19 Aug 2019 17:32:57 +0000
  • ubuntu-19-libpod-5664838702858240
    • containernetworking-plugins 0.8.2-1~ubuntu19
    • Last change-log entry: Mon, 19 Aug 2019 17:32:55 +0000

I'm not able to tell if any of those were built with the iptables vendor code required to fix the problem. @lsm5 do you have a way to know what vendor code was used to make those packages? (specifically go-iptables 1.8.1 or later)

@cevich
Copy link
Member Author

cevich commented Sep 25, 2019

Problem is def. not fixed in fedora : https://api.cirrus-ci.com/v1/task/6382599951351808/logs/integration_test.log

@cevich
Copy link
Member Author

cevich commented Sep 25, 2019

@lsm5 can we please get new containernetworking-plugins packages that are built with vendor go-iptables 1.8.1 or later? I believe latest upstream should already specify that.

@edsantiago
Copy link
Member

Looks like this is still happening:

[+0103s] [It] podman run add dns server
[+0103s]   /var/tmp/go/src/github.com/containers/libpod/test/e2e/run_dns_test.go:57
[+0103s] Running: podman [options] --events-backend file --storage-driver vfs run --dns=1.2.3.4 docker.io/library/alpine:latest cat /etc/resolv.conf
[+0103s] time="2020-02-20T14:42:34-05:00" level=error msg="Error adding network: running [/usr/sbin/iptables -t filter -N CNI-FORWARD --wait]: exit status 1: iptables: Chain already exists.\n"
[+0103s] time="2020-02-20T14:42:34-05:00" level=error msg="Error while adding pod to CNI network \"podman\": running [/usr/sbin/iptables -t filter -N CNI-FORWARD --wait]: exit status 1: iptables: Chain already exists.\n"
[+0103s] Error: error configuring network namespace for container e499a0072ca242a431b101da81b2484008e4d6f970deb9915c2ce5044ac00e3d: running [/usr/sbin/iptables -t filter -N CNI-FORWARD --wait]: exit status 1: iptables: Chain already exists.

Cirrus says this is fedora-30 but I have no way of knowing which version of containernetworking-plugins is installed.

@edsantiago edsantiago reopened this Feb 20, 2020
@edsantiago
Copy link
Member

UPDATE: package_versions.log shows:

containernetworking-plugins-0.8.2-2.git485be65.fc30-x86_64

@rhatdan
Copy link
Member

rhatdan commented Feb 21, 2020

@rhvgoyal was just showing some similar errors to this. We had to do a podman system reset to clear them up.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
7 participants