[BUG] Deadlock on recreation of logging container #12103

smoehrle · 2024-09-05T09:58:22Z

Description

See example below. We have a fluentbit container and an app container which uses the fluentbit container in its logging configuration. If there is anything which triggers a recreation of the fluentbit container, the following happens:

Fluentbit is recreated, waits with status created, is not running
Depends-on triggers a recreation of the app which tries to terminate, but cannot since the log driver is blocking the termination (can only terminate when fluentbit is available)

Workaround: run docker start [project-name]-fluentbit-1 in a second terminal

Expected behaviour: No deadlock

Steps To Reproduce

# compose.yml
services:
  fluentbit:
    image: fluent/fluent-bit:3.1.7-debug
    ports:
      - "24224:24224"
      - "24224:24224/udp"
    environment:
      FOO: ${BAR}

  app:
    image: nginx
    depends_on:
      - fluentbit
    logging:
      driver: fluentd
      options:
        fluentd-address: 127.0.0.1:24224

# test.txt
BAR=test

docker compose up -d
docker compose --env-file test.txt up -d
- Second command can be anything that triggers a recreate of the fluentbit container.

Compose Version

Docker Compose version v2.24.5

Docker Environment

Client: Docker Engine - Community
 Version:    25.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.5
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 373
  Running: 3
  Paused: 0
  Stopped: 370
 Images: 770
 Server Version: 25.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc runsc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-44-generic
 Operating System: Ubuntu 23.10
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 30.12GiB
 Name: stefan-lenovo-x13
 ID: 9fa19f06-02fc-4ec3-a95b-28a56edf479e
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.17.0.0/12, Size: 20
   Base: 192.168.0.0/16, Size: 24

Anything else?

No response

The text was updated successfully, but these errors were encountered:

ndeloof · 2024-09-05T14:00:41Z

This is expected. Compose can't manage a container which is used to extends the Docker Engine capabilities. As you noticed, ability to manage service container lifecycle depends on this logging service to be available. Using compose to manage this engine extension is not the right way

smoehrle · 2024-09-06T14:24:10Z

And what would be the right way?

As of today, compose considers container lifecycles for commands like up and down and works as expected. So at least for me, it's not straight forward to understand why it doesn't do that for recreation.

ndeloof · 2024-09-06T14:36:05Z

as long as fluentbit is used to extends docker engine features, it should be managed the same way as dockerd - as a systemd service or comparable. Running in a container just bring a chicked-and-egg challenge: how to run dockerd so it can run fluentbit if fluentbit isn't already running ?

smoehrle · 2024-09-06T14:55:46Z

Aaah ok. So, I can understand your point of view if fluentbit would be used system wide for all containers or at least additional containers outside of that single compose file. Than that makes sense and I see the chicken-and-egg-problem.

The example above is a shortened version of a local dev-env. For developers, it should be a single, isolated environment to use. The fluentbit-image is a custom image with the configuration already build in to mimic the production env as close as possible. It's only used by other containers in that compose file and works really well if you just use up and down.

So at least from my point of view all required information are present to handle this situation and I don't see the chicken-and-egg problem.

ndeloof · 2024-09-09T15:05:45Z

Tested your example compose.yaml and can't reproduce:

$  docker compose --progress=plain restart
 Container machin-app-1  Restarting
 Container machin-fluentbit-1  Restarting
 Container machin-app-1  Started
 Container machin-fluentbit-1  Started

smoehrle · 2024-09-09T19:25:09Z

I also don't get the behavior with the command docker compose --progress=plain restart (restart != recreate). Can you retry with the commands specified in the first post?

docker compose up -d
docker compose --env-file test.txt up -d

ndeloof · 2024-09-10T05:25:17Z

Sure, docker compose up --force-recreate get stuck as you describe in your issue, but as I explained this is expected: as compose stop the fluentbit service for recreation the other one (which is scheduled for recreation as well) is stuck as you just broke your docker engine by killing a required extension.
To support your use-case, up command would need to not just manage (re)creating services in dependent order, but also to manage clean termination the way down does. This is feasible but would introduce a significant increase in implementation complexity.

smoehrle · 2024-09-13T11:46:56Z

To support your use-case, up command would need to not just manage (re)creating services in dependent order, but also to manage clean termination the way down does.

Exactly this is where my expectation came from. Since down handles a clean termination, I expected that recreate would do the same.

But I can understand that it's up to you to decide if this is a bug or expected behavior. The technical parts of my report should be clear now. Thank you for taking the time to understand my problem!

smoehrle added kind/bug status/0-triage labels Sep 5, 2024

ndeloof mentioned this issue Sep 13, 2024

stop dependent containers before recreating diverged service #12122

Merged

glours closed this as completed in #12122 Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Deadlock on recreation of logging container #12103

[BUG] Deadlock on recreation of logging container #12103

smoehrle commented Sep 5, 2024

ndeloof commented Sep 5, 2024

smoehrle commented Sep 6, 2024

ndeloof commented Sep 6, 2024

smoehrle commented Sep 6, 2024

ndeloof commented Sep 9, 2024

smoehrle commented Sep 9, 2024

ndeloof commented Sep 10, 2024

smoehrle commented Sep 13, 2024

[BUG] Deadlock on recreation of logging container #12103

[BUG] Deadlock on recreation of logging container #12103

Comments

smoehrle commented Sep 5, 2024

Description

Steps To Reproduce

Compose Version

Docker Environment

Anything else?

ndeloof commented Sep 5, 2024

smoehrle commented Sep 6, 2024

ndeloof commented Sep 6, 2024

smoehrle commented Sep 6, 2024

ndeloof commented Sep 9, 2024

smoehrle commented Sep 9, 2024

ndeloof commented Sep 10, 2024

smoehrle commented Sep 13, 2024