Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker task driver overrides max_kill_timeout #17023

Closed
louievandyke opened this issue Apr 28, 2023 · 8 comments · Fixed by #17731
Closed

docker task driver overrides max_kill_timeout #17023

louievandyke opened this issue Apr 28, 2023 · 8 comments · Fixed by #17731
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Milestone

Comments

@louievandyke
Copy link
Contributor

Describe the bug

 
The Docker Task drivers' dockerTimeout setting supersedes Nomad's max_kill_timeout setting.
 

Steps to reproduce the behavior.

Configure a max_kill_timeout for a period longer than 5 minutes.  Run a Nomad job that utilizes the docker driver and then initiate a stop.  Once an initial signal is sent via nomad alloc stop the docker driver will send another signal in 5 * time.Min
 

Expected behavior

The expected behavior is that Nomad's max_kill_timeout will override task driver timeout settings.
 
Product Version: Current 1.5.x

ubuntu@ip-172-31-17-63:~$ cat /etc/nomad.d/nomad.hcl | grep max
  max_kill_timeout = "10m"
ubuntu@ip-172-31-17-63:~$
job "simple" {
  datacenters = ["dc1"]
  type = "batch"

  periodic {
    cron             = "*/1 * * * *"
    prohibit_overlap = true
  }

  group "simple" {
    task "simple" {
      driver       = "docker"
      kill_timeout = "10m"

      config {
        image   = "ubuntu:latest"
        command = "bash"
        args    = ["-c", "/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/bin/bash

# set trap
trap 'echo "Received SIGTERM doint nothing..."' SIGTERM
i=1

for i in {1..250}; do
#Print message with counter i
echo "[$(date -u)] running the loop for $i times; sleeping"
sleep 5
#Increment the counter by one
((i++))
done
EOF
        perms       = "755"
        destination = "local/script.sh"
      }

      resources {
        cpu    = 100
        memory = 100
      }
    }
  }
}
ubuntu@ip-172-31-17-63:~$ nomad job status
ID                          Type            Priority  Status   Submit Date
example                     service         50        running  2022-07-22T15:15:01Z
plugin-aws-ebs-controller   service         50        running  2022-07-23T01:11:39Z
plugin-aws-ebs-nodes        system          50        running  2022-07-23T01:11:35Z
plugin-efs                  system          50        running  2022-07-23T01:08:48Z
simple                      batch/periodic  50        running  2022-08-02T05:45:57Z
simple/periodic-1659454200  batch           50        running  2022-08-02T15:30:00Z
ubuntu@ip-172-31-17-63:~$ nomad job status simple/
ID            = simple/periodic-1659454200
Name          = simple/periodic-1659454200
Submit Date   = 2022-08-02T15:30:00Z
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
simple      0       0         1        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
0d0aca38  4885b6de  simple      0        run      running  7s ago   4s ago

ubuntu@ip-172-31-17-63:~$ nomad alloc status 0d
ID                  = 0d0aca38-4195-e8e0-cd70-a8785a5ef2a6
Eval ID             = d28788f4
Name                = simple/periodic-1659454200.simple[0]
Node ID             = 4885b6de
Node Name           = ip-172-31-17-63
Job ID              = simple/periodic-1659454200
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 30s ago
Modified            = 27s ago

Task "simple" is "running"
Task Resources
CPU        Memory           Disk     Addresses
3/100 MHz  416 KiB/100 MiB  300 MiB

Task Events:
Started At     = 2022-08-02T15:30:02Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2022-08-02T15:30:02Z  Started     Task started by client
2022-08-02T15:30:00Z  Driver      Downloading image
2022-08-02T15:30:00Z  Task Setup  Building Task Directory
2022-08-02T15:30:00Z  Received    Task received by client

ubuntu@ip-172-31-17-63:~$ nomad alloc stop 0d
==> 2022-08-02T15:30:50Z: Monitoring evaluation "eef8571c"
    2022-08-02T15:30:50Z: Evaluation triggered by job "simple/periodic-1659454200"
    2022-08-02T15:30:50Z: Allocation "d66f9e99" created: node "4885b6de", group "simple"
    2022-08-02T15:30:51Z: Evaluation status changed: "pending" -> "complete"
==> 2022-08-02T15:30:51Z: Evaluation "eef8571c" finished with status "complete"
ubuntu@ip-172-31-17-63:~$ nomad alloc status 0d
ID                   = 0d0aca38-4195-e8e0-cd70-a8785a5ef2a6
Eval ID              = d28788f4
Name                 = simple/periodic-1659454200.simple[0]
Node ID              = 4885b6de
Node Name            = ip-172-31-17-63
Job ID               = simple/periodic-1659454200
Job Version          = 0
Client Status        = running
Client Description   = Tasks are running
Desired Status       = stop
Desired Description  = alloc is being migrated
Created              = 1m ago
Modified             = 10s ago
Replacement Alloc ID = d66f9e99

Task "simple" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  420 KiB/100 MiB  300 MiB

Task Events:
Started At     = 2022-08-02T15:30:02Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2022-08-02T15:30:50Z  Killing     Sent interrupt. Waiting 10m0s before force killing
2022-08-02T15:30:02Z  Started     Task started by client
2022-08-02T15:30:00Z  Driver      Downloading image
2022-08-02T15:30:00Z  Task Setup  Building Task Directory
2022-08-02T15:30:00Z  Received    Task received by client
So we see the messaged Killing Sent interrupt. Waiting 10m0s before force killing which looks good. However, if I tail the alloc logs for the container I see an additional SIGTERM in the output before 10m. That looks like it's coming from docker 1.5.3 docker/config.go
Here is the log output. The first SIGTERM is when I ran nomad alloc stop 0d. I didn't touch Nomad after that and I observed another SIGTERM @ [Tue Aug 2 15:35:52 UTC 2022] running the loop for 71 times; sleeping
ubuntu@ip-172-31-17-63:~$ nomad alloc logs -f 0d
[Tue Aug  2 15:30:02 UTC 2022] running the loop for 1 times; sleeping
[Tue Aug  2 15:30:07 UTC 2022] running the loop for 2 times; sleeping
[Tue Aug  2 15:30:12 UTC 2022] running the loop for 3 times; sleeping
[Tue Aug  2 15:30:17 UTC 2022] running the loop for 4 times; sleeping
[Tue Aug  2 15:30:22 UTC 2022] running the loop for 5 times; sleeping
[Tue Aug  2 15:30:27 UTC 2022] running the loop for 6 times; sleeping
[Tue Aug  2 15:30:32 UTC 2022] running the loop for 7 times; sleeping
[Tue Aug  2 15:30:37 UTC 2022] running the loop for 8 times; sleeping
[Tue Aug  2 15:30:42 UTC 2022] running the loop for 9 times; sleeping
[Tue Aug  2 15:30:47 UTC 2022] running the loop for 10 times; sleeping
Received SIGTERM doint nothing...                               
[Tue Aug  2 15:30:52 UTC 2022] running the loop for 11 times; sleeping
[Tue Aug  2 15:30:57 UTC 2022] running the loop for 12 times; sleeping
[Tue Aug  2 15:31:02 UTC 2022] running the loop for 13 times; sleeping
[Tue Aug  2 15:31:07 UTC 2022] running the loop for 14 times; sleeping
[Tue Aug  2 15:31:12 UTC 2022] running the loop for 15 times; sleeping
[Tue Aug  2 15:31:17 UTC 2022] running the loop for 16 times; sleeping
[Tue Aug  2 15:31:22 UTC 2022] running the loop for 17 times; sleeping
[Tue Aug  2 15:31:27 UTC 2022] running the loop for 18 times; sleeping
[Tue Aug  2 15:31:32 UTC 2022] running the loop for 19 times; sleeping
[Tue Aug  2 15:31:37 UTC 2022] running the loop for 20 times; sleeping
[Tue Aug  2 15:31:42 UTC 2022] running the loop for 21 times; sleeping
[Tue Aug  2 15:31:47 UTC 2022] running the loop for 22 times; sleeping
[Tue Aug  2 15:31:52 UTC 2022] running the loop for 23 times; sleeping
[Tue Aug  2 15:31:57 UTC 2022] running the loop for 24 times; sleeping
[Tue Aug  2 15:32:02 UTC 2022] running the loop for 25 times; sleeping
[Tue Aug  2 15:32:07 UTC 2022] running the loop for 26 times; sleeping
[Tue Aug  2 15:32:12 UTC 2022] running the loop for 27 times; sleeping
[Tue Aug  2 15:32:17 UTC 2022] running the loop for 28 times; sleeping
[Tue Aug  2 15:32:22 UTC 2022] running the loop for 29 times; sleeping
[Tue Aug  2 15:32:27 UTC 2022] running the loop for 30 times; sleeping
[Tue Aug  2 15:32:32 UTC 2022] running the loop for 31 times; sleeping
[Tue Aug  2 15:32:37 UTC 2022] running the loop for 32 times; sleeping
[Tue Aug  2 15:32:42 UTC 2022] running the loop for 33 times; sleeping
[Tue Aug  2 15:32:47 UTC 2022] running the loop for 34 times; sleeping
[Tue Aug  2 15:32:52 UTC 2022] running the loop for 35 times; sleeping
[Tue Aug  2 15:32:57 UTC 2022] running the loop for 36 times; sleeping
[Tue Aug  2 15:33:02 UTC 2022] running the loop for 37 times; sleeping
[Tue Aug  2 15:33:07 UTC 2022] running the loop for 38 times; sleeping
[Tue Aug  2 15:33:12 UTC 2022] running the loop for 39 times; sleeping
[Tue Aug  2 15:33:17 UTC 2022] running the loop for 40 times; sleeping
[Tue Aug  2 15:33:22 UTC 2022] running the loop for 41 times; sleeping
[Tue Aug  2 15:33:27 UTC 2022] running the loop for 42 times; sleeping
[Tue Aug  2 15:33:32 UTC 2022] running the loop for 43 times; sleeping
[Tue Aug  2 15:33:37 UTC 2022] running the loop for 44 times; sleeping
[Tue Aug  2 15:33:42 UTC 2022] running the loop for 45 times; sleeping
[Tue Aug  2 15:33:47 UTC 2022] running the loop for 46 times; sleeping
[Tue Aug  2 15:33:52 UTC 2022] running the loop for 47 times; sleeping
[Tue Aug  2 15:33:57 UTC 2022] running the loop for 48 times; sleeping
[Tue Aug  2 15:34:02 UTC 2022] running the loop for 49 times; sleeping
[Tue Aug  2 15:34:07 UTC 2022] running the loop for 50 times; sleeping
[Tue Aug  2 15:34:12 UTC 2022] running the loop for 51 times; sleeping
[Tue Aug  2 15:34:17 UTC 2022] running the loop for 52 times; sleeping
[Tue Aug  2 15:34:22 UTC 2022] running the loop for 53 times; sleeping
[Tue Aug  2 15:34:27 UTC 2022] running the loop for 54 times; sleeping
[Tue Aug  2 15:34:32 UTC 2022] running the loop for 55 times; sleeping
[Tue Aug  2 15:34:37 UTC 2022] running the loop for 56 times; sleeping
[Tue Aug  2 15:34:42 UTC 2022] running the loop for 57 times; sleeping
[Tue Aug  2 15:34:47 UTC 2022] running the loop for 58 times; sleeping
[Tue Aug  2 15:34:52 UTC 2022] running the loop for 59 times; sleeping
[Tue Aug  2 15:34:57 UTC 2022] running the loop for 60 times; sleeping
[Tue Aug  2 15:35:02 UTC 2022] running the loop for 61 times; sleeping
[Tue Aug  2 15:35:07 UTC 2022] running the loop for 62 times; sleeping
[Tue Aug  2 15:35:12 UTC 2022] running the loop for 63 times; sleeping
[Tue Aug  2 15:35:17 UTC 2022] running the loop for 64 times; sleeping
[Tue Aug  2 15:35:22 UTC 2022] running the loop for 65 times; sleeping
[Tue Aug  2 15:35:27 UTC 2022] running the loop for 66 times; sleeping
[Tue Aug  2 15:35:32 UTC 2022] running the loop for 67 times; sleeping
[Tue Aug  2 15:35:37 UTC 2022] running the loop for 68 times; sleeping
[Tue Aug  2 15:35:42 UTC 2022] running the loop for 69 times; sleeping
[Tue Aug  2 15:35:47 UTC 2022] running the loop for 70 times; sleeping
[Tue Aug  2 15:35:52 UTC 2022] running the loop for 71 times; sleeping
Received SIGTERM doint nothing...
[Tue Aug  2 15:35:57 UTC 2022] running the loop for 72 times; sleeping
[Tue Aug  2 15:36:02 UTC 2022] running the loop for 73 times; sleeping
[Tue Aug  2 15:36:07 UTC 2022] running the loop for 74 times; sleeping
[Tue Aug  2 15:36:12 UTC 2022] running the loop for 75 times; sleeping
[Tue Aug  2 15:36:18 UTC 2022] running the loop for 76 times; sleeping
[Tue Aug  2 15:36:23 UTC 2022] running the loop for 77 times; sleeping
@louievandyke louievandyke added type/bug hcc/cst Admin - internal labels Apr 28, 2023
@jrasell jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker labels May 4, 2023
@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation May 4, 2023
@jrasell jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage May 4, 2023
@jrasell
Copy link
Member

jrasell commented May 4, 2023

Hi @louievandyke and thanks for raising this detailed issue.

@joshuaclausen
Copy link

Just to confirm behavior, I can see this same thing happening on v1.4.2. Exactly 5 minutes after an allocation is given the "stop" command, the container exits unexpectedly with an exit code value of 137. An exit code of 137 can possibly indicate an OOM situation, but that certainly has not happened in this case.

In the nomad client logs I can see the following:

client.driver_mgr.docker: failed to stop container: container_id=a320656bdfe16b224a4e88e48815164e9c5a1bb877eb3e038ea445d2b1bbac0d driver=docker error="Post \"http://unix.sock/containers/a320656bdfe16b224a4e88e48815164e9c5a1bb877eb3e038ea445d2b1bbac0d/stop?t=3610\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

If I run a manual curl command to stop a container (I believe this is the same thing the Nomad Docker driver is doing):

curl --silent -XPOST --unix-socket /run/docker.sock -H 'Content-Type: application/json' http://localhost/containers/a320656bdfe16b224a4e88e48815164e9c5a1bb877eb3e038ea445d2b1bbac0d/stop?t=1200

I see the command blocks in bash until either the container exits or the timeout is reached, so my sense is that Nomad is assuming the docker engine has become unresponsive and is reverting to the 5-minute default fallback Alex Dadgar mentioned here: #2119 (comment)

Docker driver has an http client with an explicit 5 minute timeout. We use this because occasionally docker engine stops being responsive and we don't want it to hang the Nomad client.

My guess is that the "max_kill_timeout" should be been applied to that "explicit 5-minute timeout".

My use case is that one or more end users could be interacting with the processes running in the allocation, and I have some logic in my application to intercept the kill signal coming from docker and Nomad and and then wait until the last end user disconnects before shutting down the process (and thus allowing the allocation to fully stop). This can be 1-2 hours in some cases.

@srisch
Copy link

srisch commented May 30, 2023

@joshuaclausen We're also seeing this in production as well and have similar shutdown times ( 1-2 hours ), I was able to reproduce it consistently with docker 23, I've since reverted us to use docker 20.10 and it's working like it should. Currently on Nomad 1.5.5

@mikenomitch
Copy link
Contributor

I had an afternoon free and started to dig into this a bit. I had a "fix" PR up, but there were still open questions around why that fix actually worked.

I documented the open questions, in this comment.

Hope this can give somebody else a headstart in getting an official fix up. I'm unassigning myself as I'll have to focus on producty things in the near future. Will work with engineering to get this in the queue for right after the 1.6 beta goes out.

@srisch
Copy link

srisch commented Jun 12, 2023

Thanks @mikenomitch I actually opened an issue this morning #17499 which may be related, also filed one with docker as well after the rabbit hole I went down as i'm not entirely sure if it's a nomad issue or a docker api issue.

@shoenig shoenig self-assigned this Jun 21, 2023
@shoenig shoenig added this to the 1.6.0 milestone Jun 21, 2023
@shoenig shoenig moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Jun 22, 2023
@shoenig
Copy link
Member

shoenig commented Jun 22, 2023

@louievandyke and @mikenomitch can you note the docker version you are using? So far I haven't been able to reproduce on


➜ docker version
Client:
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.1
 Git commit:        20.10.21-0ubuntu1~22.04.3
 Built:             Thu Apr 27 05:57:17 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.1
  Git commit:       20.10.21-0ubuntu1~22.04.3
  Built:            Thu Apr 27 05:37:25 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.12-0ubuntu1~22.04.1
  GitCommit:
 runc:
  Version:          1.1.4-0ubuntu1~22.04.3
  GitCommit:
 docker-init:
  Version:          0.19.0
  GitCommit:

@mikenomitch
Copy link
Contributor

@shoenig here you go

Client:
 Cloud integration: v1.0.22
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.12
 Git commit:        e91ed57
 Built:             Mon Dec 13 11:46:56 2021
 OS/Arch:           darwin/arm64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.12
  Git commit:       459d0df
  Built:            Mon Dec 13 11:43:07 2021
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

If you want to pair on a repro let me know. Maybe we'll catch something that we're doing differently. Or mac related?

@shoenig
Copy link
Member

shoenig commented Jun 23, 2023

Fixed my repro, had a typo in the template shell script 😨

I tried running against a few versions of docker to see what's up, and turns out with recent versions the symptom goes from bad to worse. Starting with 23.0.0 instead of the SIGTERM at 5 minutes it sends SIGKILL.

nomad  | docker-ce | docker-client | status
v1.4.8 | 19.03.15  | 24.0.2        | SIGTERM @ 5 minutes
main   | 19.03.15  | 24.0.2        | SIGTERM @ 5 minutes
main   | 20.10.24  | 24.0.2        | SIGTERM @ 5 minutes
main   | 23.0.0    | 24.0.2        | DIED (!) @ 5 minutes
main   | 24.0.2    | 24.0.2        | DIED (!) @ 5 minutes

It is still unclear whether docker is issuing a signal at all on behalf of an erroneous instruction coming from Nomad, but I just wanted to note upgrading Nomad or Docker doesn't help.

Here is what it happens from the docker events perspective (24.0.2). The first kill is SIGTERM from the alloc stop. The second kill 5 minutes later is SIGKILL.


2023-06-23T16:37:07.667122693Z container kill bb1be4eb3fb54a677cbe065720b872cb14b5092a1c7a6f3f55c1900b88a65e38 (com.hashicorp.nomad.alloc_id=2d670a69-3416-d85c-804d-1ac6a5dcd807, image=ubuntu:22.04, name=task-2d670a69-3416-d85c-804d-1ac6a5dcd807, org.opencontainers.image.ref.name=ubuntu, org.opencontainers.image.version=22.04, signal=15)

2023-06-23T16:42:07.667387039Z container kill bb1be4eb3fb54a677cbe065720b872cb14b5092a1c7a6f3f55c1900b88a65e38 (com.hashicorp.nomad.alloc_id=2d670a69-3416-d85c-804d-1ac6a5dcd807, image=ubuntu:22.04, name=task-2d670a69-3416-d85c-804d-1ac6a5dcd807, org.opencontainers.image.ref.name=ubuntu, org.opencontainers.image.version=22.04, signal=9)
2023-06-23T16:42:07.833312265Z network disconnect b64bbd45ae6e42e1ac95809580ebe3600cd6a972c1486efb5ef3a8fadd0db02e (container=bb1be4eb3fb54a677cbe065720b872cb14b5092a1c7a6f3f55c1900b88a65e38, name=bridge, type=bridge)
2023-06-23T16:42:07.859385914Z container stop bb1be4eb3fb54a677cbe065720b872cb14b5092a1c7a6f3f55c1900b88a65e38 (com.hashicorp.nomad.alloc_id=2d670a69-3416-d85c-804d-1ac6a5dcd807, image=ubuntu:22.04, name=task-2d670a69-3416-d85c-804d-1ac6a5dcd807, org.opencontainers.image.ref.name=ubuntu, org.opencontainers.image.version=22.04)
2023-06-23T16:42:07.870921068Z container die bb1be4eb3fb54a677cbe065720b872cb14b5092a1c7a6f3f55c1900b88a65e38 (com.hashicorp.nomad.alloc_id=2d670a69-3416-d85c-804d-1ac6a5dcd807, execDuration=377, exitCode=137, image=ubuntu:22.04, name=task-2d670a69-3416-d85c-804d-1ac6a5dcd807, org.opencontainers.image.ref.name=ubuntu, org.opencontainers.image.version=22.04)
2023-06-23T16:42:07.889131223Z container destroy bb1be4eb3fb54a677cbe065720b872cb14b5092a1c7a6f3f55c1900b88a65e38 (com.hashicorp.nomad.alloc_id=2d670a69-3416-d85c-804d-1ac6a5dcd807, image=ubuntu:22.04, name=task-2d670a69-3416-d85c-804d-1ac6a5dcd807, org.opencontainers.image.ref.name=ubuntu, org.opencontainers.image.version=22.04

My next step is to decorate Nomad with a whole bunch of extra trace logging to see if we can't observe it doing something unexpected.

shoenig added a commit that referenced this issue Jun 26, 2023
This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.

The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.

This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.

Fixes #17023
shoenig added a commit that referenced this issue Jun 26, 2023
This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.

The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.

This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.

Fixes #17023
shoenig added a commit that referenced this issue Jun 26, 2023
This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.

The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.

This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.

Fixes #17023
shoenig added a commit that referenced this issue Jun 26, 2023
This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.

The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.

This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.

Fixes #17023
Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 26, 2023
shoenig added a commit that referenced this issue Jun 26, 2023
* drivers/docker: refactor use of clients in docker driver

This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.

The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.

This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.

Fixes #17023

* cr: followup items
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Projects
6 participants