Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Stops working after some time in bridge network #500

Closed
czarnas opened this issue Aug 28, 2024 · 12 comments · Fixed by #503
Closed

DNS Stops working after some time in bridge network #500

czarnas opened this issue Aug 28, 2024 · 12 comments · Fixed by #503
Assignees

Comments

@czarnas
Copy link

czarnas commented Aug 28, 2024

Issue Description

The issue I have is that my podman containers stops resolving internal & external DNS after some time ~1h.
If I restart whole podman or reboot system I can resolve all of the dns records and ping between containers or external network.
After ~1h I can no longer resolve dns, or ping between containers, or outside network.

I'm running named bridge network.

Issue started to show up after upgrade from podman 5:5.1.2-1.fc40 -> 5:5.2.1-1.fc40 & maybe what's most important netavark 1.11.0-1.fc40 -> 2:1.12.1-1.fc40
aardvark-dns 1.11.0-1.fc40 -> 2:1.12.1-1.fc40

Steps to reproduce the issue

Steps to reproduce the issue

  1. Start the system
  2. compose up containers
  3. after ~1h no longer able to ping between containers

Describe the results you received

Right after container start:

DNS Check:
root@ca958a38fca0:/# nslookup gitea
Server:         10.89.0.1
Address:        10.89.0.1:53

Non-authoritative answer:
Name:   gitea.dns.podman
Address: 10.89.0.8
Name:   gitea.dns.podman
Address: 10.89.0.8

Non-authoritative answer:

root@ca958a38fca0:/# ping gitea
PING gitea (10.89.0.8): 56 data bytes
64 bytes from 10.89.0.8: seq=0 ttl=42 time=0.106 ms
64 bytes from 10.89.0.8: seq=1 ttl=42 time=0.148 ms
^C
--- gitea ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.106/0.127/0.148 ms

After 1h

root@f7e795eac0a0:/# nslookup gitea
;; connection timed out; no servers could be reached

root@f7e795eac0a0:/# ping google.com
ping: bad address 'google.com'
root@f7e795eac0a0:/#

Journalctl contains following entries:

Aug 27 16:56:59 Nighthawk systemd[1]: Started run-rf52e2a5ee0fb4ca9916ea051963e2963.scope - /usr/libexec/podman/aardvark-dns --config /run/containers/networks/aardvark-dns -p 53 run.
Aug 27 17:13:44 Nighthawk aardvark-dns[1377]: 45306 dns request got empty response
Aug 27 21:57:59 Nighthawk aardvark-dns[1377]: No configuration found stopping the sever
Aug 27 22:02:36 Nighthawk systemd[1]: Started run-r7e803c9914cd438983b19242454a19a2.scope - /usr/libexec/podman/aardvark-dns --config /run/containers/networks/aardvark-dns -p 53 run.
Aug 27 22:24:46 Nighthawk aardvark-dns[35457]: No configuration found stopping the sever
Aug 27 22:25:13 Nighthawk systemd[1]: Started run-rc7f6cb4647da460c8c4bd5890036e989.scope - /usr/libexec/podman/aardvark-dns --config /run/containers/networks/aardvark-dns -p 53 run.

Describe the results you expected

Network working all the time

podman info output

host:
  arch: amd64
  buildahVersion: 1.37.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.10-1.fc40.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: '
  cpuUtilization:
    idlePercent: 98.85
    systemPercent: 0.36
    userPercent: 0.79
  cpus: 4
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: iot
    version: "40"
  eventLogger: journald
  freeLocks: 2009
  hostname: Nighthawk
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.10.6-200.fc40.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 28301148160
  memTotal: 33379938304
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.12.1-1.fc40.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.12.1
    package: netavark-1.12.1-1.fc40.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.12.1
  ociRuntime:
    name: crun
    package: crun-1.15-1.fc40.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240821.g1d6142f-1.fc40.x86_64
    version: |
      pasta 0^20240821.g1d6142f-1.fc40.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-2.fc40.x86_64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.5
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 16h 48m 54.00s (Approximately 0.67 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 16
    paused: 0
    running: 15
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.imagestore: /usr/lib/containers/storage
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 73391005696
  graphRootUsed: 20967243776
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 33
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.2.1
  Built: 1723593600
  BuiltTime: Wed Aug 14 02:00:00 2024
  GitCommit: ""
  GoVersion: go1.22.5
  Os: linux
  OsArch: linux/amd64
  Version: 5.2.1

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

No

Additional environment details

Running Fedora IoT 40 latest
Running through compose

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@Luap99
Copy link
Member

Luap99 commented Aug 28, 2024

Is the aardvark-dns process still running when the dns stop working? Does ss -tulpn shows it listing on port 53?

@czarnas
Copy link
Author

czarnas commented Aug 28, 2024

Yes in both cases:

❯ ss -tulpn | grep :53
udp   UNCONN 213248 0          10.89.0.1:53         0.0.0.0:*    users:(("aardvark-dns",pid=42323,fd=11))
udp   UNCONN 0      0         127.0.0.54:53         0.0.0.0:*    users:(("systemd-resolve",pid=973,fd=19))
udp   UNCONN 0      0      127.0.0.53%lo:53         0.0.0.0:*    users:(("systemd-resolve",pid=973,fd=17))
udp   UNCONN 0      0            0.0.0.0:5355       0.0.0.0:*    users:(("systemd-resolve",pid=973,fd=10))
udp   UNCONN 0      0               [::]:5355          [::]:*    users:(("systemd-resolve",pid=973,fd=12))
tcp   LISTEN 0      4096      127.0.0.54:53         0.0.0.0:*    users:(("systemd-resolve",pid=973,fd=20))
tcp   LISTEN 0      1024       10.89.0.1:53         0.0.0.0:*    users:(("aardvark-dns",pid=42323,fd=12))
tcp   LISTEN 0      4096   127.0.0.53%lo:53         0.0.0.0:*    users:(("systemd-resolve",pid=973,fd=18))
tcp   LISTEN 0      4096         0.0.0.0:5355       0.0.0.0:*    users:(("systemd-resolve",pid=973,fd=11))
tcp   LISTEN 0      4096            [::]:5355          [::]:*    users:(("systemd-resolve",pid=973,fd=13))
❯ ps -aux |grep aardvark
root       42323  0.0  0.0 276424  3384 ?        Ssl  Aug27   0:00 /usr/libexec/podman/aardvark-dns --config /run/containers/networks/aardvark-dns -p 53 run

@Luap99
Copy link
Member

Luap99 commented Aug 28, 2024

udp UNCONN 213248 0 10.89.0.1:53 0.0.0.0:* users:(("aardvark-dns",pid=42323,fd=11))

Recv-Q value seems way to high which seems to suggest we no longer read anything of the socket so some form of logic bug.

If you drop the -l from the ss call so ss -tupn what open connections do you see for aardvark-dns?

@czarnas
Copy link
Author

czarnas commented Aug 28, 2024

There is only one active:

❯ ss -tupn | grep aardvark
tcp   ESTAB 0      0                 10.89.0.1:53              10.89.0.23:40292 users:(("aardvark-dns",pid=42323,fd=5))

@Luap99
Copy link
Member

Luap99 commented Aug 28, 2024

Can you check what container uses 10.89.0.23? podman network inspect <network_name> should show you the attached containers with its ip addresses listed. It seems odd that this tcp connection is open for so long.

This makes sense to me then, we listen async to either a incoming udp/tcp connection so it never processes two connection at the same time, so as long as the tcp connection doesn't send any data we just do not thing. That is most likely something we should fix.

But it would be good to know where it hangs, can you run gdb -p <aadvark-dns-pid> -ex="thread apply all bt" -batch and show me the output?

Also sounds like there is a tcpkill that you could try to use to close the open tcp connection, I would think that makes it work again.

@czarnas
Copy link
Author

czarnas commented Aug 28, 2024

10.89.0.23 is occupied by STALWART container:

               "6e83fc72587b6d61da0f907672570afb29f7ccd1275ff9454033646926f8fc11": {
                    "name": "stalwart",
                    "interfaces": {
                         "eth0": {
                              "subnets": [
                                   {
                                        "ipnet": "10.89.0.23/24",
                                        "gateway": "10.89.0.1"
                                   }
                              ],
                              "mac_address": "8a:6c:1b:4c:d8:23"
                         }
                    }
               },

I don't have GDB installed on the machine. I will install it, which requires machine restart (rpm-ostree) and come back with output after it "hangs" again.

@Luap99
Copy link
Member

Luap99 commented Aug 28, 2024

I don't have GDB installed on the machine. I will install it, which requires machine restart (rpm-ostree) and come back with output after it "hangs" again.

You can start a container with --pid=host --privileged and install/use gdb there

@czarnas
Copy link
Author

czarnas commented Aug 28, 2024

podman run --pid=host --privileged haggaie/gdb gdb -p 42323 -ex="thread apply all bt" -batch
[New LWP 42324]
[New LWP 42325]
[New LWP 42326]
[New LWP 42327]

warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
0x00007fbd86e7a3dd in ?? ()

Thread 5 (LWP 42327):
#0  0x00007fbd86e7a3dd in ?? ()
containers/podman#1  0x0000558c7ec353c2 in ?? ()
containers/podman#2  0x00000000ffffffff in ?? ()
containers/podman#3  0x0000000000000000 in ?? ()

Thread 4 (LWP 42326):
#0  0x00007fbd86e7a3dd in ?? ()
containers/podman#1  0x0000558c7ec353c2 in ?? ()
containers/podman#2  0x00000000ffffffff in ?? ()
containers/podman#3  0x0000000000000000 in ?? ()

Thread 3 (LWP 42325):
#0  0x00007fbd86e7ca32 in ?? ()
containers/podman#1  0x00007fbd867ff890 in ?? ()
containers/podman#2  0xffffffff7ebb93b2 in ?? ()
containers/podman#3  0x0000558cab1380e0 in ?? ()
containers/podman#4  0x0000000400000400 in ?? ()
containers/podman#5  0x00007fbd867ff7f0 in ?? ()
containers/podman#6  0x0000558c7ebc3a2f in ?? ()
containers/podman#7  0x0000000000000000 in ?? ()

Thread 2 (LWP 42324):
#0  0x00007fbd86e7a3dd in ?? ()
containers/podman#1  0x0000558c7ec353c2 in ?? ()
containers/podman#2  0x00000000ffffffff in ?? ()
containers/podman#3  0x0000000000000000 in ?? ()

Thread 1 (LWP 42323):
#0  0x00007fbd86e7a3dd in ?? ()
containers/podman#1  0x0000558c7ec353c2 in ?? ()
containers/podman#2  0x00007ffeffffffff in ?? ()
containers/podman#3  0x0000000000000000 in ?? ()
[Inferior 1 (process 42323) detached]

@czarnas
Copy link
Author

czarnas commented Aug 28, 2024

I've stopped stalwart container, and it immediately fixed DNS issues. Now, wondering why it hangs up after some time. It used to work flawless.

@Luap99
Copy link
Member

Luap99 commented Aug 28, 2024

podman run --pid=host --privileged haggaie/gdb gdb -p 42323 -ex="thread apply all bt" -batch
...

Oh sorry I think you must make sure to use the exact same fedora version image (fedora:40) and then install gdb there so that the linker and such matches.

I've stopped stalwart container, and it immediately fixed DNS issues. Now, wondering why it hangs up after some time. It used to work flawless.

Large parts of arrdvark-dns where rewritten by me for 1.12, most importantly the aardvark-dns didn't even support tcp connections at all before. It is not clear to me why the tcp connection stays open, it may be our fault or the clients but either way we need to fix this in aardvark-dns because a single client should never be allowed to make the server non functional.

@Luap99
Copy link
Member

Luap99 commented Aug 28, 2024

I know how to reproduce the tcp hang myself so I do not need to full stack trace from you. I move the issue to the aardvark-dns repo as it is a bug there.

@Luap99 Luap99 transferred this issue from containers/podman Aug 28, 2024
@czarnas
Copy link
Author

czarnas commented Aug 28, 2024

Thank You very much for Your help. Let me know If I can assist in any way further.
Any workaround for now would be more than welcome :)

@Luap99 Luap99 self-assigned this Sep 2, 2024
Luap99 added a commit to Luap99/aardvark-dns that referenced this issue Sep 4, 2024
Right now for a single network all requests where processed serial and
with tcp a caller is able to block us for a long time if it just opens
the connection but sends very little or no data. To avoid this always
spawn a new task if we accept a new tcp connection.

We could do the same for udp however my testing with contrib/perf/run.sh
has shown that it slows things down as the overhead of spawning a task
is greater than the few quick simple map lookups so we only spawn where
needed. We still have to spawn when forwarding external requests as this
can take a long time.

Fixes containers#500

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants