docker: periodically reconcile containers #6325

notnoop · 2019-09-13T16:24:06Z

When running at scale, it's possible that Docker Engine starts
containers successfully but gets wedged in a way where API call fails.
The Docker Engine may remain unavailable for arbitrary long time.

Here, we introduce a periodic reconciliation process that ensures that any
container started by nomad is tracked, and killed if is running
unexpectedly.

Basically, the periodic job inspects any container that isn't tracked in
its handlers. A creation grace period is used to prevent killing newly
created containers that aren't registered yet.

Also, we aim to avoid killing unrelated containers started by host or
through raw_exec drivers. The logic is to pattern against containers
environment variables and mounts to infer if they are an alloc docker
container.

Lastly, the periodic job can be disabled to avoid any interference if
need be.

On client restart, the grace period and period allows client for restorations before killing containers.

Some background

Nomad 0.8.7 was brittle judging by code in [1], If a container started but we failed to inspect it or docker engine became unavailable, we will leak container without stopping it for example.

Nomad 0.9.0 tried to ensure that we remove container on start failures. Though it doesn't account for failed creation and doesn't retry: if engine becomes unavailable at start time, it may be awhile until it's available again, so a single removal call isn't sufficient.

[1] https://github.com/hashicorp/nomad/blob/v0.8.7/client/driver/docker.go#L899-L935
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/drivers/docker/driver.go#L279-L284

tgross

Overall this LGTM. I left a few questions/comments.

drivers/docker/reconciler.go

drivers/docker/driver.go

drivers/docker/reconciler.go

drivers/docker/reconciler_test.go

drivers/docker/config.go

schmichael

Great work! This should help a lot of Docker users!

Add a test from at least the TaskRunner level to ensure there's no undesirable interaction between this and TaskRunner's (indirect) container management.

Not a blocker, but I'd prefer this be an independent struct like the image coordinator to help manage its dependencies and scope.

schmichael · 2019-09-17T15:16:24Z

drivers/docker/reconciler.go

+}
+
+func (d *Driver) isNomadContainer(c docker.APIContainers) bool {
+	if _, ok := c.Labels["com.hashicorp.nomad.alloc_id"]; ok {


Let's use a const for this.

Actually where is this set? I didn't think we automatically set any labels?

Very observant :). This isn't being set anywhere yet; I intend to resurrect #5153 to set them and then use a const there for setting it.

schmichael · 2019-09-17T15:16:55Z

drivers/docker/reconciler.go

+	}
+
+	// pre-0.10 containers aren't tagged or labeled in any way,
+	// so use cheap heauristic based on mount paths


Suggested change

// so use cheap heauristic based on mount paths

// so use cheap heuristic based on mount paths

schmichael · 2019-09-17T15:39:15Z

drivers/docker/reconciler.go

+	return false
+}
+
+func (d *Driver) trackedContainers() map[string]bool {


Instead of building a point-in-time-snapshot of tracked containers, I think this should call d.tasks.Get(...) in untrackedContainers main loop instead of checking the map. The scenario I think this avoids is:

tracked map built

container list request to dockerd sent

4m pass because of load

container list returned

cutoff is set

a number of 0.9 containers exist so InspectContainer is called against a slow dockerd and 1m passes

At this point the tracked map is >5m old so any containers created since are treated as untracked and eligible for GC.

Removing the InspectContainer call may be sufficient to fix this scenario, but I don't see a reason to build a copy of tracked containers vs doing individual gets.

Good catch - I'd be in favor of re-ording operations so cutoff is taken before any lookups. I think it's much an easier system if we can reduce mutating data.

I find it much easier to reason and test around time-snapshotted data (and mutating container state), as opposed to changing container lists, changing handler, and changing containers. If we want to use this reconciler so we detect undetected-exited containers and kill containers, having both lists being mutated makes it tricky imo.

schmichael · 2019-09-17T15:43:18Z

drivers/docker/config.go

+		if err != nil {
+			return fmt.Errorf("failed to parse 'container_delay' duration: %v", err)
+		}
+		d.config.GC.DanglingContainers.creationTimeout = dur


Validate that this is >0 and probably greater than 10s or 1m or some conservative value to ensure pauses in the Go runtime don't cause us to make the wrong decision (eg a pause between starting a container and tracking it).

drivers/docker/config.go

schmichael · 2019-09-17T17:53:32Z

drivers/docker/reconciler.go

+	}
+
+	for _, id := range untracked {
+		d.logger.Info("removing untracked container", "container_id", id)


Move to after removal has succeeded.

Suggested change

d.logger.Info("removing untracked container", "container_id", id)

d.logger.Info("removed untracked container", "container_id", id)

schmichael · 2019-09-17T17:56:30Z

drivers/docker/reconciler.go

+	ctx, cancel := context.WithTimeout(d.ctx, 20*time.Second)
+	defer cancel()
+
+	ci, err := client.InspectContainerWithContext(c.ID, ctx)


I think it's safer to skip this check. If a container running Nomad has those 3 directories and a Nomad-esque name, I think we can remove it.

Indeed, I was being conservative here, as false positive might be troublesome. But it does indeed add additional side effects; I'll remove it and document it.

drivers/docker/reconciler_test.go

schmichael · 2019-09-17T17:58:48Z

drivers/docker/reconciler_test.go

+}
+
+func TestDanglingContainerRemoval(t *testing.T) {
+	if !tu.IsCI() {


Is this necessary anymore?

Not sure - this is pretty much cargo-culting from other tests without fully understanding context.

schmichael · 2019-09-17T18:18:41Z

drivers/docker/reconciler.go

+func (d *Driver) untrackedContainers(tracked map[string]bool, creationTimeout time.Duration) ([]string, error) {
+	result := []string{}
+
+	cc, err := client.ListContainers(docker.ListContainersOptions{})


I believe only running containers are listed by default which means we won't GC stopped containers.

Although GCing stopped containers introduces 2 other things to consider:

We should only GC containers with taskHandle.removeContainerOnExit set.

There is a race between this goroutine and task runner stopping and removing the container itself. We may need a stop grace period similar to the create grace period to avoid spurious error logs when this race is hit and the container is removed twice. Although maybe it's not worth the complexity?

Regardless of desired behavior around GCing stopped containers, we should add a test with a stopped container.

schmichael · 2019-09-17T18:51:27Z

drivers/docker/reconciler.go

+
+		result = append(result, c.ID)
+	}
+


Out of scope for this PR, but I wonder if we shouldn't loop over tracked and compare against what's actually running (cc). I feel like we've had a report of a container exiting and Nomad "not noticing", but I can't find it now (and maybe it got fixed?).

I'm not sure we even have a mechanism to properly propagate that exit back up to the TaskRunner, but perhaps there's a way to force kill the dangling task handle such that TR will notice?

Anyway, a problem for another PR if ever.

schmichael · 2019-09-17T19:20:39Z

drivers/docker/config.go

+			hclspec.NewLiteral(`"5m"`),
+		),
+		"creation_timeout": hclspec.NewDefault(
+			hclspec.NewAttr("creation_timeout", "string", false),


grace might be a more clear name as we use it in the check restart stanza

The "timeout" in creation_timeout just makes me think this has to do with API timeouts, not a grace period.

nickethier · 2019-09-18T18:21:38Z

drivers/docker/reconciler.go

+		case <-timer.C:
+			if d.previouslyDetected() && d.fingerprintSuccessful() {
+				err := d.removeDanglingContainersIteration()
+				if err != nil && succeeded {


Would you mind adding a comment here that this succeeded check is to deduplicate logs. Maybe renaming it to lastIterSucceeded or something more descriptive.

nickethier · 2019-09-18T18:23:37Z

drivers/docker/reconciler.go

+
+// untrackedContainers returns the ids of containers that suspected
+// to have been started by Nomad but aren't tracked by this driver
+func (d *Driver) untrackedContainers(tracked map[string]bool, creationTimeout time.Duration) ([]string, error) {


Do we need to pass tracked in here? Why not just get it directly for the driver store?

Having function taking tracked containers as an argument makes logic simpler and easier to test IMO. tracked is computed from driver store directly.

Also, given that driver store maps task id to handlers with container id, looking up tracked containers map is O(n) in total, while scanning each time to check container presence in map will result into O(n^2) logic, assuming the question being why we don't lookup presence while we loop through containers.

nickethier · 2019-09-18T18:31:38Z

drivers/docker/reconciler.go

+		return nil, fmt.Errorf("failed to list containers: %v", err)
+	}
+
+	cutoff := time.Now().Add(-creationTimeout).Unix()


👍 for using the term grace instead of timeout. Took a min to figure our why we're subtracting a timeout here 😅

schmichael · 2019-10-14T20:31:08Z

When implementing and using Docker labels on Nomad managed containers we should consider netns pause containers and #6385. Would be nice if we could handle them with generic reconciliation logic, but if that's not possible we should just ensure this reconciler won't interact poorly with pause containers.

When running at scale, it's possible that Docker Engine starts containers successfully but gets wedged in a way where API call fails. The Docker Engine may remain unavailable for arbitrary long time. Here, we introduce a periodic reconcilation process that ensures that any container started by nomad is tracked, and killed if is running unexpectedly. Basically, the periodic job inspects any container that isn't tracked in its handlers. A creation grace period is used to prevent killing newly created containers that aren't registered yet. Also, we aim to avoid killing unrelated containters started by host or through raw_exec drivers. The logic is to pattern against containers environment variables and mounts to infer if they are an alloc docker container. Lastly, the periodic job can be disabled to avoid any interference if need be.

Ensure we wait for some grace period before killing docker containers that may have launched in earlier nomad restore.

notnoop · 2019-10-17T14:58:55Z

I have updated this PR and is ready for re-review:

Refactor reconciler to be a separate struct and a bit more generic so we can extract it if we want in future
added docker label tags and use const as appropriate
Added more tests to cover stopped container, and better coverage for labels and grace period handling

schmichael

Remaining issues are fairly small. Feel free to merge whenever!

schmichael · 2019-10-18T17:20:21Z

drivers/docker/driver.go

+	config.Labels = map[string]string{
+		dockerLabelTaskID:   task.ID,
+		dockerLabelTaskName: task.Name,
+		dockerLabelAllocID:  task.AllocID,


Were we just going to ship this label by default initially? I think we can ask users to pay the cost of at least 1 label, but I wasn't sure where we landed on more expanded labels.

done - kept dockerLabelAllocID and removed others.

drivers/docker/driver.go

drivers/docker/reconciler.go

driver.SetConfig is not appropriate for starting up reconciler goroutine. Some ephemeral driver instances are created for validating config and we ought not to side-effecting goroutines for those. We currently lack a lifecycle hook to inject these, so I picked the `Fingerprinter` function for now, and reconciler should only run after fingerprinter started. Use `sync.Once` to ensure that we only start reconciler loop once.

Other labels aren't strictly necessary here, and we may follow up with a better way to customize.

github-actions · 2023-01-28T02:17:11Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added this to the 0.10.0 milestone Sep 13, 2019

notnoop requested review from nickethier and tgross September 13, 2019 19:33

tgross approved these changes Sep 13, 2019

View reviewed changes

preetapan requested a review from schmichael September 16, 2019 13:57

endocrimes reviewed Sep 16, 2019

View reviewed changes

drivers/docker/config.go Show resolved Hide resolved

schmichael modified the milestones: 0.10.0, 0.10.1 Sep 17, 2019

schmichael requested changes Sep 17, 2019

View reviewed changes

schmichael reviewed Sep 17, 2019

View reviewed changes

notnoop mentioned this pull request Sep 18, 2019

docker: remove containers on creation failures #6346

Merged

nickethier approved these changes Sep 18, 2019

View reviewed changes

Mahmood Ali added 3 commits October 17, 2019 08:36

docker: explicit grace period for initial container reconcilation

3bf0ae9

Ensure we wait for some grace period before killing docker containers that may have launched in earlier nomad restore.

address code review comments

c8ba2d1

notnoop force-pushed the b-docker-reconcile-periodically branch from dcf9bcb to 97f0875 Compare October 17, 2019 12:37

refactor reconciler code and address comments

24f6c2b

notnoop force-pushed the b-docker-reconcile-periodically branch from 97f0875 to 95bc9b3 Compare October 17, 2019 14:29

notnoop requested a review from schmichael October 17, 2019 14:32

Mahmood Ali added 2 commits October 17, 2019 10:45

add docker labels

ef4465d

docker label refactoring and additional tests

8c3136a

notnoop force-pushed the b-docker-reconcile-periodically branch from 95bc9b3 to 8c3136a Compare October 17, 2019 14:45

schmichael approved these changes Oct 18, 2019

View reviewed changes

Mahmood Ali added 3 commits October 18, 2019 14:43

only set a single label for now

04a2e05

Other labels aren't strictly necessary here, and we may follow up with a better way to customize.

add timeouts for docker reconciler docker calls

c64647c

notnoop force-pushed the b-docker-reconcile-periodically branch from 4114138 to c64647c Compare October 18, 2019 19:31

notnoop merged commit 75acbcc into master Oct 18, 2019

notnoop deleted the b-docker-reconcile-periodically branch October 18, 2019 19:53

notnoop mentioned this pull request Nov 22, 2019

document docker dangling container reaper #6762

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jan 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker: periodically reconcile containers #6325

docker: periodically reconcile containers #6325

notnoop commented Sep 13, 2019 •

edited

Loading

tgross left a comment

schmichael left a comment

schmichael Sep 17, 2019

schmichael Sep 17, 2019

notnoop Sep 18, 2019 •

edited

Loading

schmichael Sep 17, 2019

schmichael Sep 17, 2019

notnoop Sep 18, 2019

schmichael Sep 17, 2019

schmichael Sep 17, 2019

schmichael Sep 17, 2019

notnoop Sep 18, 2019

schmichael Sep 17, 2019

notnoop Sep 18, 2019

schmichael Sep 17, 2019

schmichael Sep 17, 2019 •

edited

Loading

schmichael Sep 17, 2019

nickethier Sep 18, 2019

nickethier Sep 18, 2019

notnoop Oct 17, 2019

nickethier Sep 18, 2019

schmichael commented Oct 14, 2019

notnoop commented Oct 17, 2019

schmichael left a comment

schmichael Oct 18, 2019

notnoop Oct 18, 2019

github-actions bot commented Jan 28, 2023

	// so use cheap heauristic based on mount paths
	// so use cheap heuristic based on mount paths

	d.logger.Info("removing untracked container", "container_id", id)
	d.logger.Info("removed untracked container", "container_id", id)

docker: periodically reconcile containers #6325

docker: periodically reconcile containers #6325

Conversation

notnoop commented Sep 13, 2019 • edited Loading

Some background

tgross left a comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael Sep 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael commented Oct 14, 2019

notnoop commented Oct 17, 2019

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 28, 2023

notnoop commented Sep 13, 2019 •

edited

Loading

notnoop Sep 18, 2019 •

edited

Loading

schmichael Sep 17, 2019 •

edited

Loading