client/driver: docker progress detection and monitoring #4192

nickethier · 2018-04-20T04:52:24Z

This PR aims to add the following enhancements to the docker driver:

Inactivity timeout which aborts a pull of no progress has been made during a defined period of time
~~Pull timeout, configured in the jobspec as image_pull_timeout which will abort the pull if it does not complete in the configured duration~~ Removed from this PR see b6d2b75
Image pull progress logging to client stdout
- Multiplexed to all allocs pulling the same image.
Image pull progress logged to alloc if the pull lasts longer than 2 mins (and subsequent msgs ever 2 mins there after)

Example log:

2018/04/24 12:47:12.076066 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (106.7 MiB/227.3 MiB) pulling 3 layers - est 11.2s remaining
2018/04/24 12:47:22.076012 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 5/6 (191.4 MiB/227.3 MiB) pulling 1 layers - est 3.6s remaining
2018/04/24 12:47:32.076002 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 6/6 (227.3 MiB/227.3 MiB) pulling 0 layers - est 0.1s remaining
2018/04/24 12:47:33.394116 [DEBUG] driver.docker: docker pull docker.elastic.co/elasticsearch/elasticsearch:6.2.4 succeeded

nickethier · 2018-04-23T20:57:09Z

client/driver/docker.go

@@ -1546,7 +1550,8 @@ func (d *DockerDriver) pullImage(driverConfig *DockerDriverConfig, client *docke

 	d.emitEvent("Downloading image %s:%s", repo, tag)
 	coordinator, callerID := d.getDockerCoordinator(client)
-	return coordinator.PullImage(driverConfig.ImageName, authOptions, callerID)
+
+	return coordinator.PullImage(driverConfig.ImageName, authOptions, time.Duration(driverConfig.ImagePullTimeout)*time.Second, callerID, d.emitEvent)


The casting in the return args is ugly and hard to read, making a note to self to fix this.

schmichael · 2018-04-23T19:41:49Z

client/driver/docker.go

@@ -234,6 +234,7 @@ type DockerDriverConfig struct {
 	ReadonlyRootfs       bool                `mapstructure:"readonly_rootfs"`        // Mount the container’s root filesystem as read only
 	AdvertiseIPv6Address bool                `mapstructure:"advertise_ipv6_address"` // Flag to use the GlobalIPv6Address from the container as the detected IP
 	CPUHardLimit         bool                `mapstructure:"cpu_hard_limit"`         // Enforce CPU hard limit.
+	ImagePullTimeout     int64               `mapstructure:"image_pull_timeout"`     // Timeout on the image pull after which the pull is cancelled


Might be nice to allow specifying this as a duration. You'd have to add a ImagePullTimeoutRaw string field and parse it into time.Duration in DockerDriverConfig.Validate, so might not be worth it... Either way.

schmichael · 2018-04-23T19:51:24Z

client/driver/docker_coordinator.go

+	if pullTimeout > 0 {
+		ctx, cancel = context.WithDeadline(context.Background(), time.Now().Add(pullTimeout))
+	}
+	defer cancel()


Put this above the if block so it cancels the first context and propagates to the child deadline context (if one exists). I don't think leaking a cancellable context will cause a memory leak, but better to be careful than rely on internals.

schmichael · 2018-04-23T19:54:58Z

client/driver/docker_progress.go

+func newImageProgressManager(
+	image string, cancel context.CancelFunc,
+	inactivityFunc, reporter progressReporterFunc) *imageProgressManager {
+	return &imageProgressManager{


put an empty line between multiline func definitions and the first line of their body

schmichael · 2018-04-23T20:06:20Z

client/driver/docker_coordinator.go

+func (d *dockerCoordinator) handlePullProgressReport(image, msg string, timestamp, pullStart time.Time) {
+	d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg)
+
+	if timestamp.Sub(pullStart) > 2*time.Minute {


Put 2*minute in a package level const with documentation.

schmichael · 2018-04-23T20:31:47Z

client/driver/docker_progress.go

+	}
+}
+
+type imageProgress struct {


We try to comment internal structs as if they were exported.

schmichael · 2018-04-23T20:47:40Z

client/driver/docker_progress.go

+	return pm
+}
+
+func (pm *imageProgressManager) withReportInterval(t time.Duration) *imageProgressManager {


I don't see where this or the above method are used. Since they're not safe for concurrent use with start() let's not leave them lying around if they're not needed.

schmichael · 2018-04-23T20:59:25Z

client/driver/docker_progress.go

+	for {
+		line, err := pm.buf.ReadBytes('\n')
+		if err == io.EOF {
+			pm.buf.Write(line)


Maybe a comment here like:

// Partial write of a line; break until a full line is written

schmichael · 2018-04-23T21:02:07Z

client/driver/docker_progress.go

+			return n, err
+		}
+
+		if msg.Error != nil {


A comment as to what these errors mean would be useful.

schmichael · 2018-04-23T21:06:46Z

client/driver/docker_progress_test.go

+{"status":"Pulling fs layer","progressDetail":{},"id":"c73ab1c6897b"}
+{"status":"Pulling fs layer","progressDetail":{},"id":"1ab373b3deae"}
+`))
+	assert.NoError(t, err)


We generally use testify/require instead of testify/assert as assert doesn't fail immediately. It's no big deal but in cases like this where you're checking for errors, the very next line is likely to panic and make for confusing test output if there is an error.

schmichael · 2018-04-23T21:11:43Z

client/driver/docker_progress.go

+type progressReporterFunc func(image string, msg string, timestamp time.Time, pullStart time.Time)
+
+type imageProgressManager struct {
+	*imageProgress


Not sure if I'm a fan of this embedding. It causes imageProgressManager to have RWMutex's methods despite those methods only guarding fields on imageProgress. Perhaps just give it a named field to make it clear for uses in imageProgressManager which structs "owns" which fields.

Not a big deal. I do like the overall approach of bundling up the fields that need guarded by a mutex with a little oneshot struct.

schmichael · 2018-04-24T16:47:12Z

client/driver/docker_coordinator.go

+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()
+	if pullTimeout > 0 {
+		ctx, cancel = context.WithDeadline(context.Background(), time.Now().Add(pullTimeout))


I tricked you into introducing a bug. This creates a new context from Background(), not ctx, so we're back to leaking a context again 😬

I think just change Background() -> ctx here

dadgar · 2018-04-24T17:49:45Z

client/driver/docker.go

@@ -303,6 +305,12 @@ func (c *DockerDriverConfig) Validate() error {
 		return err
 	}
 	c.Ulimit = ulimit
+	if len(c.ImagePullTimeoutRaw) > 0 {


Can we introduce a min timeout, something like 1m

dadgar · 2018-04-24T17:53:36Z

client/driver/docker_coordinator.go

 	// Parse the repo and tag
 	repo, tag := docker.ParseRepositoryTag(image)
 	if tag == "" {
 		tag = "latest"
 	}
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()


This wouldn't cancel the context created below

@schmichael convinced me that canceling a parent context will cancel any child contexts.

Fixed this when I removed the pull deadline

dadgar · 2018-04-24T18:00:17Z

client/driver/docker_coordinator.go

+	d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg)
+
+	if timestamp.Sub(pullStart) > dockerPullProgressEmitInterval {
+		d.emitEvent(image, "Docker image %s pull progress: %s", image, msg)


This will emit an event every 10 seconds after the initial 2 minutes? I think we only want to emit at max once every two minutes

dadgar · 2018-04-24T18:00:30Z

client/driver/docker_coordinator.go

+}
+
+func (d *dockerCoordinator) handlePullInactivity(image, msg string, timestamp, pullStart time.Time) {
+	d.logger.Printf("[ERR] driver.docker: image %s pull aborted due to inactivity, last message recevieved at [%s]: %s", image, timestamp.String(), msg)


Should this not emit an event?

Emit and log or just emit?

dadgar · 2018-04-24T18:04:01Z

client/driver/docker_progress.go

+			select {
+			case <-ticker.C:
+				msg, timestamp := pm.imageProgress.get()
+				if time.Now().Sub(timestamp) > pm.activityDeadline {


Do we want it to be based purely on the time? What if set was called without any additional bytes pulled?

I think thats ok. If set is called from Write then a full JSON message was sent and decoded. This includes extracting, verifying signatures etc which is still activity.

dadgar · 2018-04-24T18:05:33Z

client/driver/docker_progress.go

+// start intiates the ticker to trigger the inactivity and reporter handlers
+func (pm *imageProgressManager) start() {
+	pm.imageProgress.pullStart = time.Now()
+	go func() {


Any thoughts on having newImageProgressManager call go pm.start() and then the interface for callers just becomes new and stop?

The only possible hiccup I see is that there are some fields that we don't set in the newImageProgressManager. They are currently set to default, ex: reportInterval, activityDeadline, but could be overridden after the new call and before start. If we ever want to set these I suppose we just add the args to the new call.

✔️ Added a call to pm.start() in the newImageProgressManager method.

nickethier · 2018-05-01T03:04:23Z

@schmichael could you give this one more look when you have a moment

preetapan · 2018-05-01T16:37:37Z

client/driver/docker_coordinator.go

@@ -98,6 +104,12 @@ type dockerCoordinator struct {
 	// only have one request be sent to Docker
 	pullFutures map[string]*pullFuture

+	// pullLoggers is used to track the LogEventFn for each alloc pulling an image


Worth adding a comment here about why it needs a slice of pull loggers if the map is keyed by docker image id.

preetapan · 2018-05-01T16:40:55Z

client/driver/docker_coordinator.go

+
+func (d *dockerCoordinator) handlePullInactivity(image, msg string, timestamp, pullStart time.Time, interval int64) {
+	d.logger.Printf("[ERR] driver.docker: image %s pull aborted due to inactivity, last message recevieved at [%s]: %s", image, timestamp.String(), msg)
+


remove empty line

preetapan · 2018-05-01T16:47:13Z

client/driver/docker_coordinator.go

+func (d *dockerCoordinator) handlePullProgressReport(image, msg string, timestamp, pullStart time.Time, interval int64) {
+	d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg)
+
+	if interval%int64(dockerPullProgressEmitInterval.Seconds()/dockerImageProgressReportInterval.Seconds()) == 0 {


minor - any reason not to handle the "should emit" logic in the caller? This method doesn't need to know what the interval value is or do the math with the two progress report intervals, the caller can just tell it whether it needs to emit the event in addition to logging.

Good question. I think I tried to keep the imageProgressManager scoped as small as possible to start. So it currently doesn't make a distinction of to emit or not, just that it needs to fire a progressReporterFunc when it detects inactivity and at each reporting interval. I don't love the way this handlePullProgressReport turned out though.

I'll go back and play with it for a few mins and see what moving the "should emit" logic into the manager looks like.

preetapan · 2018-05-01T16:48:40Z

client/driver/docker_progress.go

+}
+
+// set takes a status message received from the docker engine api during an image
+// pull and updates the status of the coorisponding layer


typo - corresponding

preetapan · 2018-05-01T16:51:59Z

client/driver/docker_progress.go

+	case "Extracting":
+		return layerProgressStatusExtracting
+	case "Pull complete":
+		return layerProgressStatusComplete


This one appears to be defined but not used, is that intentional?

Some statuses aren't referenced directly but are used in the comparison on line 104. I could condense them all down into the same int, but I figured it would be best to enumerate all of them.

preetapan · 2018-05-01T17:01:36Z

client/driver/docker_progress.go

+	case "Pulling fs layer":
+		return layerProgressStatusStarting
+	case "Waiting":
+		return layerProgressStatusWaiting


This one is also unused. Did we want to check if the status is layerProgressWaiting in the get() method below and return a nicer string message for it?

I went ahead an enumerated each status even though there may be a couple that don't get used. I didn't want to make the status string too long, you can already derive the number of layers waiting (total - pulled - pulling).

If I add waiting it could look something like:

2018/04/24 12:47:12.076066 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (106.7 MiB/227.3 MiB) pulling 3 waiting on 2 - est 11.2s remaining

Ok I added the waiting status to the msg and it now looks like:

2018/05/03 15:18:43.299862 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (57.65 MiB/227.3 MiB) layers: 2 waiting/3 pulling - est 29.3s remaining 2018/05/03 15:18:53.299760 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (115.2 MiB/227.3 MiB) layers: 2 waiting/3 pulling - est 19.3s remaining 2018/05/03 15:19:03.299919 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 4/6 (173 MiB/227.3 MiB) layers: 0 waiting/2 pulling - est 9.3s remaining 2018/05/03 15:19:13.299778 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 5/6 (223.3 MiB/227.3 MiB) layers: 0 waiting/1 pulling - est 0.5s remaining 2018/05/03 15:19:16.757129 [DEBUG] driver.docker: docker pull docker.elastic.co/elasticsearch/elasticsearch:6.2.4 succeeded

schmichael · 2018-05-04T21:33:44Z

client/driver/docker_progress.go

+	return b
+}
+
+// progressReporterFunc defines the method for handeling inactivity and report


…d to unexpected timeouts If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout. This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout causing the pull to potentially timeout much sooner than expected by the second job.

… msg

dadgar

This came out great! The reporting intervals became quite clean! Good job!

github-actions · 2023-03-05T02:24:25Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

nickethier changed the title ~~[WIP] client/driver: docker progress detection and monitoring~~ client/driver: docker progress detection and monitoring Apr 23, 2018

nickethier commented Apr 23, 2018

View reviewed changes

schmichael requested changes Apr 23, 2018

View reviewed changes

schmichael reviewed Apr 24, 2018

View reviewed changes

dadgar requested changes Apr 24, 2018

View reviewed changes

nickethier force-pushed the f-docker-progress-detection branch from b6d2b75 to 9539687 Compare April 26, 2018 00:40

nickethier force-pushed the f-docker-progress-detection branch from 9539687 to a58a914 Compare May 1, 2018 03:16

preetapan requested changes May 1, 2018

View reviewed changes

preetapan approved these changes May 4, 2018

View reviewed changes

schmichael reviewed May 4, 2018

View reviewed changes

schmichael approved these changes May 4, 2018

View reviewed changes

nickethier added 7 commits May 7, 2018 12:17

client/driver: add image pull progress monitoring

77d57e8

client/driver: emit progress to all allocs pulling same image

c44db4c

client/driver: do accounting on layer pull progress

6fb6ecd

client/driver: add seperate handler for emitting pull progress

6337f8e

client/driver: add waiting layer status count to pull progress status…

e324e86

… msg

changelog: add changelog entry for docker image pull progress monitoring

d442a44

nickethier force-pushed the f-docker-progress-detection branch from bef2845 to d442a44 Compare May 7, 2018 16:21

nickethier merged commit 006c6f3 into master May 7, 2018

nickethier deleted the f-docker-progress-detection branch May 7, 2018 20:15

dadgar reviewed May 8, 2018

View reviewed changes

github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023


		func (d *dockerCoordinator) handlePullInactivity(image, msg string, timestamp, pullStart time.Time, interval int64) {
		d.logger.Printf("[ERR] driver.docker: image %s pull aborted due to inactivity, last message recevieved at [%s]: %s", image, timestamp.String(), msg)

client/driver: docker progress detection and monitoring #4192

client/driver: docker progress detection and monitoring #4192

Conversation

nickethier commented Apr 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickethier commented May 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickethier May 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dadgar left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 5, 2023

nickethier commented Apr 20, 2018 •

edited

Loading

nickethier May 3, 2018 •

edited

Loading