Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add driver.docker counter metric for OOM Killer events #4185

Merged
30 changes: 29 additions & 1 deletion client/driver/docker.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,13 @@ import (
"time"

"github.com/armon/circbuf"
docker "github.com/fsouza/go-dockerclient"
"github.com/fsouza/go-dockerclient"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goimports suggested this change. Please let me know if you want me to rollback this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its because the imported package in that folder is already called docker :)


"github.com/docker/docker/cli/config/configfile"
"github.com/docker/docker/reference"
"github.com/docker/docker/registry"

"github.com/armon/go-metrics"
"github.com/hashicorp/go-multierror"
"github.com/hashicorp/go-plugin"
"github.com/hashicorp/nomad/client/allocdir"
Expand Down Expand Up @@ -478,6 +479,10 @@ type DockerHandle struct {
client *docker.Client
waitClient *docker.Client
logger *log.Logger
jobName string
taskGroupName string
taskName string
allocID string
Image string
ImageID string
containerID string
Expand Down Expand Up @@ -898,6 +903,10 @@ func (d *DockerDriver) Start(ctx *ExecContext, task *structs.Task) (*StartRespon
executor: exec,
pluginClient: pluginClient,
logger: d.logger,
jobName: d.DriverContext.jobName,
taskGroupName: d.DriverContext.taskGroupName,
taskName: d.DriverContext.taskName,
allocID: d.DriverContext.allocID,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes on this:

  • AllocID: I've added the allocation id because then we have the full context from the job to the allocation. This is useful to provide user oriented alerts.

  • We're initializing DockerHandle with 4 out of 8 fields of the DriverContext struct. Then I question myself: Would it make sense to pass the whole struct to DockerHandle?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image: d.driverConfig.ImageName,
ImageID: d.imageID,
containerID: container.ID,
Expand Down Expand Up @@ -1924,6 +1933,25 @@ func (h *DockerHandle) run() {
h.logger.Printf("[ERR] driver.docker: failed to inspect container %s: %v", h.containerID, ierr)
} else if container.State.OOMKilled {
werr = fmt.Errorf("OOM Killed")
labels := []metrics.Label{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me the interesting labels would be job, group and task :)

Copy link
Contributor Author

@jesusvazquez jesusvazquez Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for me too, but i don't know how to access them from here. I'm happy to implement a solution but I'd need some help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jesusvazquez you could add those fields to the DockerHandle struct, and then populate them from the DriverContext (I'm preparing a PR so you can have those available).

h := &DockerHandle{
client: client,
waitClient: waitClient,
executor: exec,
pluginClient: pluginClient,
logger: d.logger,
Image: d.driverConfig.ImageName,
ImageID: d.imageID,
containerID: container.ID,
version: d.config.Version.VersionNumber(),
killTimeout: GetKillTimeout(task.KillTimeout, maxKill),
maxKillTimeout: maxKill,
doneCh: make(chan bool),
waitCh: make(chan *dstructs.WaitResult, 1),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #4196

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh really nice @jvrplmlmn Ill update this PR as soon as I get back from my vacations.

{
Name: "JobName",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value: h.jobName,
},
{
Name: "TaskGroupName",
Value: h.taskGroupName,
},
{
Name: "TaskName",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jippi this is what you meant right?

Copy link
Contributor

@jippi jippi May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - looks great!

I personally don't care for the allocation id below, its a very high cardinality field, making influxdb & friends unhappy fairly fast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that from the database and query point of view it probably has no value but from the alerting point of view the user can quickly run nomad alloc-status <alloc-id> and see what is happening.

What would you say here? Should we keep it or drop it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my personal preference would be to drop it - sounds like an alert in your logging rather than telemetry if you ask me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop it. As @jippi points out it will make many telemetry systems unhappy

Value: h.taskName,
},
{
Name: "AllocID",
Value: h.allocID,
},
}
metrics.IncrCounterWithLabels([]string{"driver", "docker", "oom"}, 1, labels)
}

close(h.doneCh)
Expand Down