Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deadlock on recreation of logging container #12103

Closed
smoehrle opened this issue Sep 5, 2024 · 8 comments · Fixed by #12122
Closed

[BUG] Deadlock on recreation of logging container #12103

smoehrle opened this issue Sep 5, 2024 · 8 comments · Fixed by #12122

Comments

@smoehrle
Copy link

smoehrle commented Sep 5, 2024

Description

See example below. We have a fluentbit container and an app container which uses the fluentbit container in its logging configuration. If there is anything which triggers a recreation of the fluentbit container, the following happens:

  • Fluentbit is recreated, waits with status created, is not running
  • Depends-on triggers a recreation of the app which tries to terminate, but cannot since the log driver is blocking the termination (can only terminate when fluentbit is available)

Workaround: run docker start [project-name]-fluentbit-1 in a second terminal

Expected behaviour: No deadlock

Steps To Reproduce

# compose.yml
services:
  fluentbit:
    image: fluent/fluent-bit:3.1.7-debug
    ports:
      - "24224:24224"
      - "24224:24224/udp"
    environment:
      FOO: ${BAR}

  app:
    image: nginx
    depends_on:
      - fluentbit
    logging:
      driver: fluentd
      options:
        fluentd-address: 127.0.0.1:24224
# test.txt
BAR=test
  1. docker compose up -d
  2. docker compose --env-file test.txt up -d
    • Second command can be anything that triggers a recreate of the fluentbit container.

Compose Version

Docker Compose version v2.24.5

Docker Environment

Client: Docker Engine - Community
 Version:    25.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.5
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 373
  Running: 3
  Paused: 0
  Stopped: 370
 Images: 770
 Server Version: 25.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc runsc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-44-generic
 Operating System: Ubuntu 23.10
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 30.12GiB
 Name: stefan-lenovo-x13
 ID: 9fa19f06-02fc-4ec3-a95b-28a56edf479e
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.17.0.0/12, Size: 20
   Base: 192.168.0.0/16, Size: 24

Anything else?

No response

@ndeloof
Copy link
Contributor

ndeloof commented Sep 5, 2024

This is expected. Compose can't manage a container which is used to extends the Docker Engine capabilities. As you noticed, ability to manage service container lifecycle depends on this logging service to be available. Using compose to manage this engine extension is not the right way

@smoehrle
Copy link
Author

smoehrle commented Sep 6, 2024

And what would be the right way?

As of today, compose considers container lifecycles for commands like up and down and works as expected. So at least for me, it's not straight forward to understand why it doesn't do that for recreation.

@ndeloof
Copy link
Contributor

ndeloof commented Sep 6, 2024

as long as fluentbit is used to extends docker engine features, it should be managed the same way as dockerd - as a systemd service or comparable. Running in a container just bring a chicked-and-egg challenge: how to run dockerd so it can run fluentbit if fluentbit isn't already running ?

@smoehrle
Copy link
Author

smoehrle commented Sep 6, 2024

Aaah ok. So, I can understand your point of view if fluentbit would be used system wide for all containers or at least additional containers outside of that single compose file. Than that makes sense and I see the chicken-and-egg-problem.

The example above is a shortened version of a local dev-env. For developers, it should be a single, isolated environment to use. The fluentbit-image is a custom image with the configuration already build in to mimic the production env as close as possible. It's only used by other containers in that compose file and works really well if you just use up and down.

So at least from my point of view all required information are present to handle this situation and I don't see the chicken-and-egg problem.

@ndeloof
Copy link
Contributor

ndeloof commented Sep 9, 2024

Tested your example compose.yaml and can't reproduce:

$  docker compose --progress=plain restart
 Container machin-app-1  Restarting
 Container machin-fluentbit-1  Restarting
 Container machin-app-1  Started
 Container machin-fluentbit-1  Started

@smoehrle
Copy link
Author

smoehrle commented Sep 9, 2024

I also don't get the behavior with the command docker compose --progress=plain restart (restart != recreate). Can you retry with the commands specified in the first post?

  1. docker compose up -d
  2. docker compose --env-file test.txt up -d

@ndeloof
Copy link
Contributor

ndeloof commented Sep 10, 2024

Sure, docker compose up --force-recreate get stuck as you describe in your issue, but as I explained this is expected: as compose stop the fluentbit service for recreation the other one (which is scheduled for recreation as well) is stuck as you just broke your docker engine by killing a required extension.
To support your use-case, up command would need to not just manage (re)creating services in dependent order, but also to manage clean termination the way down does. This is feasible but would introduce a significant increase in implementation complexity.

@smoehrle
Copy link
Author

To support your use-case, up command would need to not just manage (re)creating services in dependent order, but also to manage clean termination the way down does.

Exactly this is where my expectation came from. Since down handles a clean termination, I expected that recreate would do the same.

But I can understand that it's up to you to decide if this is a bug or expected behavior. The technical parts of my report should be clear now. Thank you for taking the time to understand my problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants