Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker driver leaks images #7940

Closed
benbuzbee opened this issue May 12, 2020 · 2 comments · Fixed by #7947
Closed

Docker driver leaks images #7940

benbuzbee opened this issue May 12, 2020 · 2 comments · Fixed by #7947

Comments

@benbuzbee
Copy link
Contributor

benbuzbee commented May 12, 2020

Nomad version

Output from nomad version
Nomad v0.11.1 (825a2da27e778655a735804a084c0a131c6b6fec)
(has custom logging changes)

Issue

We noticed our EC2 instances (which only have a few gigs of disk) were running out of space, and narrowed the issue down to /var/lib/docker in spite of having the following GC settings in nomad

plugin "docker" {
  config {
    auth {
      config = "/opt/nomad/config/docker/client-config.json"
    }
    gc {
      image       = true
      image_delay = "3m"
      container   = true

      dangling_containers {
        enabled        = true
        dry_run        = false
        period         = "5m"
        creation_grace = "5m"
      }
    }
  }
}

In reviewing the logs I noticed some times image de-referencing would happen, but the reference count the logs showed was the same as the count before, for example

2020-05-07T23:04:04.286Z [INFO]  client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:30f730740178418c028cd8687e8f67481a9c997948a012873f371f164e28fe9b references=3
2020-05-07T23:04:04.287Z [INFO]  client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:30f730740178418c028cd8687e8f67481a9c997948a012873f371f164e28fe9b references=3

In this case, the reference count was 3 after both de-references, within a second apart, and there was no reason anything else would have increased the ref count.

So I checked the code and realized the following:

delete(references, callerID)
count := len(references)

Delete may do nothing, if callerID is not found in references
And thus I made the following changes to the build to get more data

// Decrement the reference count
	prevCount := len(references)
	if _, ok := references[callerID]; !ok {
		d.logger.Error("task ID is not contributing to ref count", "image_id", imageID, "task_id", callerID)
	}
	delete(references, callerID)
	count := len(references)
	d.logger.Info("image id reference count decremented", "image_id", imageID, "new_references", count, "old_references", prevCount, "references", references)

And made similar changes around reference count increase.

What I found was sometimes the caller ID seems to be in a different format than when it was passed to incrementImageReferenceImpl and so it does nothing

Consider the following logs with my changes:

2020-05-12T22:42:23.735Z [INFO]  client.driver_mgr.docker: image reference count incremented: driver=docker image_name=583829371371.dkr.ecr.us-west-2.amazonaws.com/s2/browser_host@sha256:c62cfe40448ef667de28dd30279d71660cd5f2a2de4ac1b056996b46428ae4d1 image_id=sha256:0e88eac525d26357ce48dd52ce78a40c88b7cb0353d2f8ba3210dfe624598def new_references=1 task_id=ff93c709-3208-08ba-18c2-9c9b28c26d93/browser/20a6aef8 references="map[ff93c709-3208-08ba-18c2-9c9b28c26d93/browser/20a6aef8:{}]
2020-05-12T22:42:31.029Z [INFO]  client.driver_mgr.docker: image reference count incremented: driver=docker image_name=583829371371.dkr.ecr.us-west-2.amazonaws.com/s2/browser_host@sha256:c62cfe40448ef667de28dd30279d71660cd5f2a2de4ac1b056996b46428ae4d1 image_id=sha256:0e88eac525d26357ce48dd52ce78a40c88b7cb0353d2f8ba3210dfe624598def new_references=2 task_id=cebff5e5-5243-78e6-59c4-cfdbdc2cbd00/browser/dd84eb9d-browser references="map[cebff5e5-5243-78e6-59c4-cfdbdc2cbd00/browser/dd84eb9d-browser:{} ff93c709-3208-08ba-18c2-9c9b28c26d93/browser/20a6aef8:{}]"
2020-05-12T22:42:35.249Z [INFO]  client.driver_mgr.docker: image reference count incremented: driver=docker image_name=583829371371.dkr.ecr.us-west-2.amazonaws.com/s2/browser_host@sha256:c62cfe40448ef667de28dd30279d71660cd5f2a2de4ac1b056996b46428ae4d1 image_id=sha256:0e88eac525d26357ce48dd52ce78a40c88b7cb0353d2f8ba3210dfe624598def new_references=3 task_id=8379c08c-ec41-f5e7-1649-2a8fe071a12c/browser/b221394c-browser references="map[8379c08c-ec41-f5e7-1649-2a8fe071a12c/browser/b221394c-browser:{} cebff5e5-5243-78e6-59c4-cfdbdc2cbd00/browser/dd84eb9d-browser:{} ff93c709-3208-08ba-18c2-9c9b28c26d93/browser/20a6aef8:{}]"
2020-05-12T22:42:39.372Z [INFO]  client.driver_mgr.docker: image reference count incremented: driver=docker image_name=583829371371.dkr.ecr.us-west-2.amazonaws.com/s2/browser_host@sha256:c62cfe40448ef667de28dd30279d71660cd5f2a2de4ac1b056996b46428ae4d1 image_id=sha256:0e88eac525d26357ce48dd52ce78a40c88b7cb0353d2f8ba3210dfe624598def new_references=4 task_id=e3925196-1485-faa4-0693-7818c938d224/browser/353e434c-browser references="map[8379c08c-ec41-f5e7-1649-2a8fe071a12c/browser/b221394c-browser:{} cebff5e5-5243-78e6-59c4-cfdbdc2cbd00/browser/dd84eb9d-browser:{} e3925196-1485-faa4-0693-7818c938d224/browser/353e434c-browser:{} ff93c709-3208-08ba-18c2-9c9b28c26d93/browser/20a6aef8:{}]"

So we have 4 tasks running referencing the same image (0e88...).

3 of them end in -browser and 1 of them does not.

Now we stop all the tasks and expect reference count to hit 0

2020-05-12T22:48:42.683Z [ERROR] client.driver_mgr.docker: task ID is not contributing to ref count: driver=docker image_id=sha256:0e88eac525d26357ce48dd52ce78a40c88b7cb0353d2f8ba3210dfe624598def task_id=8379c08c-ec41-f5e7-1649-2a8fe071a12c/browser/b221394c
2020-05-12T22:48:42.683Z [INFO]  client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:0e88eac525d26357ce48dd52ce78a40c88b7cb0353d2f8ba3210dfe624598def new_references=4 old_references=4 references="map[8379c08c-ec41-f5e7-1649-2a8fe071a12c/browser/b221394c-browser:{} cebff5e5-5243-78e6-59c4-cfdbdc2cbd00/browser/dd84eb9d-browser:{} e3925196-1485-faa4-0693-7818c938d224/browser/353e434c-browser:{} ff93c709-3208-08ba-18c2-9c9b28c26d93/browser/20a6aef8:{}]"

Already we have a problem. This call to RemoveImage did nothing because the callerID passed in was 8379c08c-ec41-f5e7-1649-2a8fe071a12c/browser/b221394c however the callerID it should have passed in is 8379c08c-ec41-f5e7-1649-2a8fe071a12c/browser/b221394c-browser

Same tasks but different format - one has -browser after it and the other does not.

It does this 3 times, only 1 of them (the one that doesn't end in -browser in the map) has the correct task ID, resulting in a final reference count of 3 when it should be 0

@benbuzbee benbuzbee changed the title Docker drive leaks images Docker driver leaks images May 12, 2020
@notnoop
Copy link
Contributor

notnoop commented May 13, 2020

Thanks @benz0 for raising and debugging the issue! That was very helpful. I PRed a fix in #7947

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants