Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jobs support to CLI #2262

Merged
merged 1 commit into from
May 6, 2020
Merged

Add jobs support to CLI #2262

merged 1 commit into from
May 6, 2020

Conversation

dperny
Copy link
Contributor

@dperny dperny commented Jan 16, 2020

- What I did

Add support to the CLI for swarm jobs (moby/moby#40307).

Does not include compose support.

- How I did it

  • Added two new modes accepted by the --mode flag
    • replicated-job creates a replicated job
    • global-job creates a global job.
  • When using replicated-job mode, the replicas flag sets the TotalCompletions parameter of the job. This is the total number of tasks that will run
  • Added a new flag, max-concurrent, for use with replicated-job mode. This flag sets the MaxConcurrent parameter of the job, which is the maximum number of replicas the job will run simultaneously.
  • When using replicated-job or global-job mode, using any of the update parameter flags will result in an error, as jobs cannot be updated in the traditional sense.
  • Updated the docker service ls UI to include the completion status (completed vs total tasks) if the service is a job.
  • Updated the progress bars UI for service creation and update to support jobs. For jobs, there is displayed a bar covering the overall progress of the job (the number of tasks completed over the total number of tasks to complete).
  • Added documentation explaining the use of the new flags, and of jobs in general.

- How to verify it

Includes automated tests for all major changes.

- Description for the changelog

Added CLI support for swarm jobs.

@dperny dperny requested a review from thaJeztah as a code owner January 16, 2020 15:40
@dperny dperny force-pushed the swarm-jobs branch 2 times, most recently from fc6a2cd to d34f273 Compare January 16, 2020 15:58
@dperny
Copy link
Contributor Author

dperny commented Jan 16, 2020

Added nolint: gocyclo to the (*serviceOptions).ToService method. In exchange for this concession, I have added a documentation comment to that method.

@mathroc
Copy link

mathroc commented Jan 17, 2020

@dperny I see that a job will have to be force updated to run again, will it be possible to have a --rm flag that removes the job once completed (like docker run --rm does) ?

So that we don’t have to worry if the job should be created or force updated (when running a database migration job for example)

Alternatively, would it be possible to create a job service without starting it. and have a command to run this job when we like ? The benefits here would be that the job configuration stays always the same and can be included easily in a stack and docker would keep the task history (compared to the --rm proposal just before)

@dperny
Copy link
Contributor Author

dperny commented Jan 17, 2020

@mathroc for whatever reason, I had not considered the possibility of an --rm flag, actually. I'm unsure how to implement it correctly, and it certainly won't make it into this release. That said, if my distant memories of being a mediocre ruby-on-rails developer are somewhat in tact, database migrations should be idempotent, mitigating the possibility of screwing things up by accidentally re-running a job.

Second, to create a job without starting it, you can just set --replicas to 0, which should be valid. Or, for a global job, set a placement constraint that cannot be met.

@mathroc
Copy link

mathroc commented Jan 18, 2020

initializing a service job with --replicas 0 might be enough

the problem with the database migration in my exemple was not that it could run twice, but that I thought I would have to query docker to see if the service job already exists and then either create the service or force update it. but with --replicas 0, I can initalize the service once (eg: with docker stack deploy and then I don’t have to worry I just jave to run docker service update --replicas 1 --force to run the job and it should work all the time

thx @dperny

@dperny dperny changed the title WIP: Add jobs support to CLI Add jobs support to CLI Jan 20, 2020
@dperny
Copy link
Contributor Author

dperny commented Jan 20, 2020

Removed WIP. The support for jobs upstream was merged.

@SvenDowideit
Copy link
Contributor

@thaJeztah Is there some change / way this can be expected in the next release? I'm presuming there will be a docker-v20.04 ?

@dperny dperny force-pushed the swarm-jobs branch 2 times, most recently from 67d4da0 to e1dabab Compare February 20, 2020 17:59
@dperny
Copy link
Contributor Author

dperny commented Feb 20, 2020

Rebased to hopefully fix merge conflicts.

@silvin-lubecki
Copy link
Contributor

The code itself looks good to me, but I need to take it for a spin to check the UX and the whole feature 👍
Anyway thank you @dperny for this awesome work 🐱

@nkabbara
Copy link

Hello! Is there an ETA for this feature?

@nkabbara
Copy link

initializing a service job with --replicas 0 might be enough

the problem with the database migration in my exemple was not that it could run twice, but that I thought I would have to query docker to see if the service job already exists and then either create the service or force update it. but with --replicas 0, I can initalize the service once (eg: with docker stack deploy and then I don’t have to worry I just jave to run docker service update --replicas 1 --force to run the job and it should work all the time

thx @dperny

Hi @mathroc, curious about which migration strategy you settled on.

I'm thinking about something similar to what you've suggested:

  1. Create a migrations service with 0 replicas and restart-condition set none.
  2. Update image with new release
  3. Up replicas to 1.


```bash
$ docker service create --name mythrottledjob \
--mode replicated-job \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe --kind=job --mode=replicated?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively this could be docker job create even if the API is for a service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While docker job create is probably a clean(er); advantages;

  • separate subcommand
  • we could hide/remove flags that don't apply to jobs

Downside:

  • given that they're both backed by a service, we need to either filter out jobs from docker service ls (etc) and vice-versa.
  • that might become confusing if we don't apply the same filter very strictly (docker service rm <job service> could otherwise remove a job)
  • also think of docker service create myjob, which could show an error that service myjob already exists, but docker service ls wouldn't show it

@thaJeztah
Copy link
Member

thaJeztah commented Apr 21, 2020

Reviewing this together with @silvin-lubecki I'll post notes along the way (sorry for the extra noise)

I tried running the example you included in the documentation;

docker service create --name myjob \
   --mode replicated-job \
   bash "true"

Output looks good to me

job progress: 1 out of 1 complete [==================================================>]
active tasks: 0 out of 0 tasks
1/1: complete  [==================================================>]
job complete

What I think is slightly confusing, is that "REPLICAS" shows 0/1: we expect this job to run once, so should it show /1 ?

docker service ls
ID                  NAME                MODE                REPLICAS              IMAGE               PORTS
1tzs32id63hz        myjob               replicated job      0/1 (1/1 completed)   bash:latest

Thinking if I can come with a better presentation for that 🤔

Trying with more replicas:

docker service rm myjob
docker service create --name myjob --mode replicated-job --replicas=4 bash "true"
vbtoewcdxz17hfa14p2kua96r
job progress: 4 out of 4 complete [==================================================>]
active tasks: 0 out of 0 tasks
1/4: complete  [==================================================>]
2/4: complete  [==================================================>]
3/4: complete  [==================================================>]
4/4: complete  [==================================================>]
job complete
docker service ls
ID                  NAME                MODE                REPLICAS              IMAGE               PORTS
vbtoewcdxz17        myjob               replicated job      0/4 (4/4 completed)   bash:latest

@thaJeztah
Copy link
Member

Slightly confusing:

$ docker service scale myjob=2
myjob: scale can only be used with replicated mode

The job is replicated, so perhaps we should change this to "cannot be used with jobs" instead of mentioning the replicated mode?

@thaJeztah
Copy link
Member

thaJeztah commented Apr 21, 2020

Looks like the compose schema (or validation) needs some updating; using this compose file:

version: "3.9"
services:
  job:
    image: bash
    command: "true"
    deploy:
      mode: "replicated-job"
      replicas: 6

I get an error:

docker stack deploy -c docker-compose.yml mystack
Creating network mystack_default
service job: Unknown mode: replicated-job

I tried updating the compose code:

diff --git a/cli/compose/convert/service.go b/cli/compose/convert/service.go
index da182bbf..9ce91b90 100644
--- a/cli/compose/convert/service.go
+++ b/cli/compose/convert/service.go
@@ -609,12 +609,12 @@ func convertDeployMode(mode string, replicas *uint64) (swarm.ServiceMode, error)
 	serviceMode := swarm.ServiceMode{}

 	switch mode {
-	case "global":
+	case "global", "global-job":
 		if replicas != nil {
 			return serviceMode, errors.Errorf("replicas can only be used with replicated mode")
 		}
 		serviceMode.Global = &swarm.GlobalService{}
-	case "replicated", "":
+	case "replicated", "replicated-job", "":
 		serviceMode.Replicated = &swarm.ReplicatedService{Replicas: replicas}
 	default:
 		return serviceMode, errors.Errorf("Unknown mode: %s", mode)

After that, docker stack deploy "worked", but behind the scenes, it keeps failing; probably because docker stack deploy updates the service after it's been created (adding a network alias). By the time it tries doing so, the task exited already, so a new task is created;

docker service ls
ID                  NAME                MODE                REPLICAS                IMAGE               PORTS
uq7b0h3v6ghf        mystack_job         replicated          0/6                     bash:latest
DEBU[2020-04-21T13:23:24.888699099Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).addSvcRecords(tasks.job, 10.0.1.69, <nil>, false) addServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.888713104Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).addSvcRecords(mystack_job, 10.0.1.2, <nil>, false) addServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.888719214Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).addSvcRecords(job, 10.0.1.2, <nil>, false) addServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.888725306Z] addServiceBinding from addServiceInfoToCluster END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.888921206Z] addServiceInfoToCluster END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.889020616Z] EnableService 745ed888b6fc250aa77129acc92efe3c86ae56ea0b2d9d9cf2e78acc92c561ac DONE
DEBU[2020-04-21T13:23:24.889211225Z] deleteServiceInfoFromCluster from sbLeave START for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.889277881Z] rmServiceBinding from deleteServiceInfoFromCluster START for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab p:0xc00092bd00 nid:t1holdww2ke12dtrzc8u5cf76 sKey:{uq7b0h3v6ghf9t6su2svo3o6r } deleteSvc:true
DEBU[2020-04-21T13:23:24.889579250Z] rmServiceBinding fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab delete t1holdww2ke12dtrzc8u5cf76, p:0xc001fcc300 in loadbalancers len:0
DEBU[2020-04-21T13:23:24.891846283Z] state for task nzfaz6g6xq9je6isbt5zc2qh1 updated to COMPLETE  method="(*Dispatcher).processUpdates" module=dispatcher node.id=cof80xs5o119fw86iyoj4g1eb state.transition="STARTING->COMPLETE" task.id=nzfaz6g6xq9je6isbt5zc2qh1
DEBU[2020-04-21T13:23:24.892457593Z] dispatcher committed status update to store   method="(*Dispatcher).processUpdates" module=dispatcher node.id=cof80xs5o119fw86iyoj4g1eb state.transition="STARTING->COMPLETE" task.id=nzfaz6g6xq9je6isbt5zc2qh1
ERRO[2020-04-21T13:23:24.896670528Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.934970842Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.936440772Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.937721697Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.938833516Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.945949923Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.947207460Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.948256879Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.949100243Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.951210680Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.951912778Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.952897854Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.953746905Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.954535721Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.956941017Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.958569629Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.959791120Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.960749464Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.962392561Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.963731630Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.965130174Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.966460429Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.967377257Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.968119396Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.968828863Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.969258109Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:24.986545081Z] deleteEndpointNameResolution fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab mystack_job rm_service:true suppress:false sAliases:[job] tAliases:[745ed888b6fc]
DEBU[2020-04-21T13:23:24.986952947Z] delContainerNameResolution fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab mystack_job.6.n9ein9uqdaal1iz9ofnrlp5mj
DEBU[2020-04-21T13:23:24.987067454Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(mystack_job.6.n9ein9uqdaal1iz9ofnrlp5mj, 10.0.1.69, <nil>, true) rmServiceBinding sid:fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.987134811Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(745ed888b6fc, 10.0.1.69, <nil>, true) rmServiceBinding sid:fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.987266665Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(tasks.mystack_job, 10.0.1.69, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.987487926Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(tasks.job, 10.0.1.69, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.987826764Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(mystack_job, 10.0.1.2, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.988210127Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(job, 10.0.1.2, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.988428037Z] rmServiceBinding from deleteServiceInfoFromCluster END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.988553281Z] deleteServiceInfoFromCluster from sbLeave END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:25.032479740Z] Revoking external connectivity on endpoint gateway_c8a85dc2ca97 (1852b1637272d761266368a9aaa7ae92db121dba94dd081f7d05de816eccf998)
DEBU[2020-04-21T13:23:25.034617553Z] DeleteConntrackEntries purged ipv4:0, ipv6:0
DEBU[2020-04-21T13:23:25.036293286Z] (*worker).Update                              len(assignments)=30 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036400099Z] (*worker).reconcileSecrets                    len(removedSecrets)=0 len(updatedSecrets)=0 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036425854Z] (*worker).reconcileConfigs                    len(removedConfigs)=0 len(updatedConfigs)=0 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036762896Z] (*worker).reconcileTaskState                  len(removedTasks)=25 len(updatedTasks)=5 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036865221Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=41ctnmnrtlxmy0y25zeypuvbb
DEBU[2020-04-21T13:23:25.036991513Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=uve1k8wnxu6f9aqjeyyaj5rby
DEBU[2020-04-21T13:23:25.037227835Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=dexskffhpf6kdg3fcq5425ly1
DEBU[2020-04-21T13:23:25.037288598Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=n9ein9uqdaal1iz9ofnrlp5mj
DEBU[2020-04-21T13:23:25.037554188Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=fwdyyzse2za6250edat9ipb8m
DEBU[2020-04-21T13:23:25.037741656Z] Could not find network sandbox for container mystack_job.1.41ctnmnrtlxmy0y25zeypuvbb on service binding deactivation request
DEBU[2020-04-21T13:23:25.037855034Z] Could not find network sandbox for container mystack_job.3.jzdojq6rhvjstq497ivin4t6f on service binding deactivation request
DEBU[2020-04-21T13:23:25.039183320Z] Could not find network sandbox for container mystack_job.1.858lwuzkldxpnsitrddc3xufv on service binding deactivation request
DEBU[2020-04-21T13:23:25.038237993Z] Could not find network sandbox for container mystack_job.6.mhnw31hqedi0suglvi6hlagqf on service binding deactivation request
DEBU[2020-04-21T13:23:25.038373469Z] Could not find network sandbox for container mystack_job.2.yrouuun8qlcteih936149fvmf on service binding deactivation request
DEBU[2020-04-21T13:23:25.038400634Z] Could not find network sandbox for container mystack_job.5.5ji21r2vthmqld15zandztqz6 on service binding deactivation request
DEBU[2020-04-21T13:23:25.038465065Z] Could not find network sandbox for container mystack_job.6.oruzoa6aps23e5vfo4ga1winl on service binding deactivation request
DEBU[2020-04-21T13:23:25.039286354Z] Could not find network sandbox for container mystack_job.6.b5lwk0fovf9693pglumovqd1g on service binding deactivation request
DEBU[2020-04-21T13:23:25.038113392Z] Could not find network sandbox for container mystack_job.1.a526t4vylb020ailytz3ec1uo on service binding deactivation request
DEBU[2020-04-21T13:23:25.038690939Z] Could not find network sandbox for container mystack_job.3.lgd2xt92l7y05tcu291ho0xkq on service binding deactivation request
DEBU[2020-04-21T13:23:25.038730482Z] Could not find network sandbox for container mystack_job.1.cgmzrchzdewstm5yzk2xh80fw on service binding deactivation request
DEBU[2020-04-21T13:23:25.038734658Z] Could not find network sandbox for container mystack_job.2.n0g0rb4pvuuchel57owaihqqx on service binding deactivation request

@silvin-lubecki
Copy link
Contributor

Tried with multiple replicas:

$ docker service create --name test --replicas=10 --max-concurrent=2  --mode=replicated-job bash true
idja6562brblp995gnq7ps78i
job progress: 10 out of 10 complete [==================================================>]
active tasks: 0 out of 0 tasks
1/10: complete  [==================================================>]
2/10: complete  [==================================================>]
3/10: complete  [==================================================>]
4/10: complete  [==================================================>]
5/10: complete  [==================================================>]
6/10: complete  [==================================================>]
7/10: complete  [==================================================>]
8/10: complete  [==================================================>]
9/10: complete  [==================================================>]
10/10: complete  [==================================================>]
job complete

Then tried a ps

$ docker service ps test
ID                  NAME                             IMAGE                NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
tti15rwuoglx        test.1                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
u0ef31o2mgyq        test.2                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
avvo2zxxysu8        test.3                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
l022qhhmz148        test.4                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
u3sc4nht7va9        test.5                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
zsgfzezhit0y        test.6                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
t1dx8jej05lq        test.7                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
teon4syxhul9        test.8                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
e7qb4wn8ew7h        test.9                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
lankjaond9iy        test.9h335bnxjxm95yf2n8uuz4kk3   hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago

This part of the UI/UX looks nice so far 👍

@thaJeztah
Copy link
Member

Trying with a long running container as job; creating (as expected) continues waiting for it to complete, so I had to CTRL-C;

docker service create --mode=replicated-job --name=longy nginx:alpine
qx9ol95pxk0i7ztz5291wspc8
job progress: 0 out of 1 complete [>                                                  ]
active tasks: 1 out of 1 tasks
1/1: running   [=============================================>     ]
^COperation continuing in background.

After that, I killed the container:

docker kill fdad641d54a6

Looking at docker service ls

docker service ls
ID                  NAME                MODE                REPLICAS                IMAGE               PORTS
qx9ol95pxk0i        longy               replicated job      1/1 (0/1 completed)     nginx:alpine

I can see a new task was created for the service

docker service ps longy
ID                  NAME                                  IMAGE               NODE                DESIRED STATE       CURRENT STATE           ERROR                         PORTS
3fwmn61dybah        longy.cof80xs5o119fw86iyoj4g1eb       nginx:alpine        1db0546f51e6        Complete            Running 4 minutes ago
l0qiutahfdmr         \_ longy.cof80xs5o119fw86iyoj4g1eb   nginx:alpine        1db0546f51e6        Shutdown            Failed 4 minutes ago    "task: non-zero exit (137)"

Is it expected that a new task is created if one fails, or should it terminate the job, and mark it as "failed" ?

c8wgl7q4ndfd frontend replicated 5/5 nginx:alpine
dmu1ept4cxcf redis replicated 3/3 redis:3.0.6
iwe3278osahj mongo global 7/7 mongo:3.3
hh08h9uu8uwr job replicated-job 1/1 (3/5 completed) nginx:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1/1 correct here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It implies that 1 task is still running, 3 tasks are completed, and 5 tasks are desired. This would imply the job is running 5 iterations one after another.


Jobs are a special kind of service designed to run an operation to completion
and then stop, as opposed to running long-running daemons. When a Task
belonging to a job exits successfully (return value 0), the Task is marked as
Copy link
Member

@thaJeztah thaJeztah Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment; do we want failed tasks to be started / replaced / tried again by default? Or should it have --restart-condition=none ?

@thaJeztah
Copy link
Member

One thing I'm thinking of; should we have an alias (on the CLI) for replicated-job to allow people to just set --mode=job (which is the same as --mode=replicated-job)?

@thaJeztah
Copy link
Member

I'm overall good with the current UX. I think that fully separating "jobs" from regular services would not be possible (because they share the same constructs under the hood). That said; it would be possible to add docker job subcommands in future (or just write a simple plugin for this). When doing so, we should (I think) not try to hide that jobs are services (just show both in docker service ps, and just make docker job create a shorthand / convenience function).

Some things I think should be addressed:

  • better call out that the default is --restart-condition=failure, so users must set a custom restart policy if they do not want to have a job run multiple times if it fails
  • perhaps discuss the --mode=job alias (open to input)
  • ideally we'd have docker stack deploy working with this (but could be added in future if it's problematic to get working)

@thaJeztah
Copy link
Member

Discussing with @tonistiigi @cpuguy83 - haven't checked yet, but we need to check what happens if I create a job with --restart-condition=any (is it rejected, or ok? because that would make it exactly the same as a regular service)

@dperny
Copy link
Contributor Author

dperny commented Apr 24, 2020

  1. The problem where completed jobs are showing 0/4 (4/4 Completed) is actually a bit of a bug in Swarmkit. In the ServiceStatus, Swarmkit should not be setting the denominator to MaxReplicas, but should instead be setting it to the lesser of MaxReplicas or TotalCompletions - CompletedTasks. It's an easy fix, but it's not in this code.
  2. docker service scale should be usable with jobs, and the fact that it's not is a consequence of me overlooking it.
  3. Compose support for jobs isn't in this PR. I was going to open a second PR with compose support. I can add it to this PR if desired.
  4. It is expected that, if a Task fails, a new task should be spawned, until the desired number of completions is reached. The exception should be if --restart-condition=none is set.
  5. RestartOnAny is treated the same as RestartOnFailure if the service is a Job. This needs to be added both to the documentation here and to the swagger docs in the main repo, actually. This behavior isn't accidental; it was a deliberate decision (IIRC, it was part of the jobs design spec).

@dperny
Copy link
Contributor Author

dperny commented Apr 24, 2020

I'm opposed to the alias of --mode=job for --mode=replicated-job primarily because it makes the documentation unwieldy.

* Added two new modes accepted by the `--mode` flag
  * `replicated-job` creates a replicated job
  * `global-job` creates a global job.
* When using `replicated-job` mode, the `replicas` flag sets the
  `TotalCompletions` parameter of the job. This is the total number of
  tasks that will run
* Added a new flag, `max-concurrent`, for use with `replicated-job`
  mode. This flag sets the `MaxConcurrent` parameter of the job, which
  is the maximum number of replicas the job will run simultaneously.
* When using `replicated-job` or `global-job` mode, using any of the
  update parameter flags will result in an error, as jobs cannot be
  updated in the traditional sense.
* Updated the `docker service ls` UI to include the completion status
  (completed vs total tasks) if the service is a job.
* Updated the progress bars UI for service creation and update to
  support jobs. For jobs, there is displayed a bar covering the overall
  progress of the job (the number of tasks completed over the total
  number of tasks to complete).
* Added documentation explaining the use of the new flags, and of jobs
  in general.

Signed-off-by: Drew Erny <derny@mirantis.com>
@thaJeztah
Copy link
Member

Sorry for the delay (again)

Compose support for jobs isn't in this PR. I was going to open a second PR with compose support. I can add it to this PR if desired.

I think it's ok to leave it out of compose for now; we should perhaps consider if we want it to be a separate "entity" inside compose files (instead of just an option for mode?)

It is expected that, if a Task fails, a new task should be spawned, until the desired number of completions is reached. The exception should be if --restart-condition=none is set.

There's something to be said for both sides; either I want (e.g.) a migration to run (but don't try to run it again if it failed), or have a guarantee that all my jobs will at least continue until completed.

I think it's ok in the current implementation, as long as we're explicit about this in the documentation so that users are not caught by surprise

I'm opposed to the alias of --mode=job for --mode=replicated-job primarily because it makes the documentation unwieldy

👍 mostly me thinking out loud; could also be easily added in future it there's a strong need for it, so no blocker from my perpective

@thaJeztah
Copy link
Member

@dperny I see you pushed after my previous comment; were there specific things you addressed/changed?

@dperny
Copy link
Contributor Author

dperny commented May 1, 2020

Yes. I fixed docker service scale to work with replicated jobs, and I reworded some docs to address general comments (although I can't now remember exactly what I reworded. I think it had to do with restart-condition).

Copy link
Contributor

@silvin-lubecki silvin-lubecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for that PR @dperny 🎉

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's merge; we can tweak docs later if needed 👍

thanks @dperny !

@thaJeztah thaJeztah merged commit 4f05814 into docker:master May 6, 2020
@thaJeztah thaJeztah added this to the 20.03.0 milestone Jun 10, 2020
@silvin-lubecki silvin-lubecki mentioned this pull request Aug 17, 2020
@Ohtar10
Copy link

Ohtar10 commented Oct 21, 2020

This is really good stuff!

Question: I see some discussion w.r.t compose support, would that have its own PR too then? would it make it to 20.x.0 release?

Thanks!

@thaJeztah
Copy link
Member

@Ohtar10 compose support has not been added yet. Some discussion may be needed if we implement this as "mode" for services, or if a new "jobs" top-level property is added. Perhaps "mode" could be implemented as (temporary?) solution, but may need some work; #2262 (comment)

(contributions should be welcome though!)

@Ohtar10
Copy link

Ohtar10 commented Oct 22, 2020

Cool, thanks for the reply @thaJeztah.

As an end-user, and after trying the feature from the test channel, I would say that if jobs will keep being part of docker service create command, then in the compose file should be the same for the sake of consistency, hence, handle it as a "mode", at least temporary solution as you mention.

In case the feature evolves to something more elaborate, e.g., cronjobs, additional configurable properties etc. then I think it would make sense to have a top-level property not only in the compose file but at docker CLI level as well, e.g., docker job create.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants