Use soft docker memory limit instead of hard one #2771

burdandrei · 2017-07-04T14:02:09Z

As duscussed in couple of open issues(like this) there is sometime a need to run docker containters without hard memory limit.
mem_limit_disable option will run docker container as it described with soft memory limit. So docker will not limit memory allocation for container, but will honor it to know who to kill if host memory is exhausted.

burdandrei · 2017-07-10T17:52:39Z

@dadgar any chance this will hit 0.6?
As i can understand it fits not only our usecase.
Le'ts discuss the flag name or anything what is questionable.
Cheers!

diptanu · 2017-07-25T23:03:12Z

@dadgar If you want this in, I would suggest making this a client config which the operator can control and not something which lives in the job config. I think allowing users to control such a thing in a multi-tenant environment could be bad from a QOS perspective for the tenants.

diptanu · 2017-07-25T23:06:22Z

client/driver/docker.go

@@ -840,7 +844,15 @@ func (d *DockerDriver) createContainerConfig(ctx *ExecContext, task *structs.Tas
 		config.WorkingDir = driverConfig.WorkDir
 	}

-	memLimit := int64(task.Resources.MemoryMB) * 1024 * 1024
+	memLimit := int64(0)
+	memReservation := int64(0)


var memLimit, memReservation int64

diptanu · 2017-07-25T23:11:21Z

client/driver/docker_test.go

+
+	waitForExist(t, client, handle.(*DockerHandle))
+
+	_, err := client.InspectContainer(handle.(*DockerHandle).ContainerID())


Probably want to test that the MemReservation is set properly in this test. And test the other case too.

diptanu · 2017-07-25T23:12:03Z

Left a few comments but probably worth waiting for @dadgar to weigh in.

burdandrei · 2017-07-26T13:41:00Z

Thanks for review @diptanu, i updated variable definition, and added test to check docker config.

I discussed with our Architect should this be agent configuration, or per job flag. And we came to agreement, that nomad is great scheduler, and it helps us to use all the features drivers are providing.
And from Docker side, memory reservation can only be configured on the running stage, it can't be configured host wize.

But you are right, and we can add a flad that will permit on the client level this option like it's done with execs in consul.

Let's wait what @dadgar will say, and i hope it'll land in 0.6

dadgar · 2017-07-28T20:16:08Z

@burdandrei Can you explain your use case for this more. With the current implementation of this PR, a single allocation with this set would be able to starve the memory of the host and effect the QoS of every other allocation on the host.

burdandrei · 2017-07-28T20:36:29Z

Ok, @dadgar.
so in our case we got a real time application serving HTTP requests, and Resque workers doing all the background, async, daily ,umonthly jobs.
For resque we're running worker per queue, and we got more than 150 queues. We're running resque workers on dedicated machines, using constrains. Basic memory footprint of the worker is 200 MB, with spikes to 400-500 MB while forking to process the job(this is resque flow, we don't really control it).
Problem is, that we got daily and weekly jobs, that should generate summary reports or other stuff, and these workers RAM usage can jump to 12 GB. But because it's running on the machine with 70 more workes, that are not actively working now, we got lot of spare memory to use.
Before running in nomad, we were running this using systemd, and it worked perfectly for us.

While working on my pull request i checked that behavior of OOM killer when using soft memory limits:
If we got 2 containers scheduled, both with soft memory limits, one with 1GB limit and one with 2 GB limit, and both of them are using more than soft limit configured, if docker daemon detect saturation of host memory, it kills the container with smallest memory limit.

I can see @OferE is just using exec driver to do the same, but Docker driver perfectly fits it, using soft memory option.

I do agree, that flag name can be questionable, but i think opening this option of docker driver could be really great feature for batch processing.

Cheers!

jzvelc · 2017-07-29T11:55:28Z

Our use case is similar as described in #606.
I would go with a more generic approach since other options may be required in future.

Add docker.allowed_run_opts to client configuration where operators could specify which additional docker run options are allowed to be specified with docker driver e.g.:

client {
  options {
    "docker.allowed_run_opts": "memory,memory-reservation"
  }
}

This will give operators control over multi-tenant environments.
In task's config we could then set:

task {
  resources {
    memory = 500
  }
  config {
    run_opt = [
      "memory=1g",
      "memory-reservation=200m"
    ]
  }
}

In this example Nomad would use memory requirement specified in resources stanza for task placement (same as it does now) and set hard memory limit on docker container (memory=500m). This is actually current implementation.

All provided options in run_opt (which should be validated against allowed_run_opts) will override currently set docker run options. So in this example docker hard memory limit would be set to 1000m and docker soft memory limit would be set to 200m.
This would be perfect for tasks which require around 200m of memory for normal operation with often 200-500m spikes. Setting docker hard memory limit to 1g would ensure that spikes are no bigger than 1g which would starve the host.

This would make docker driver way more flexible for our workloads. What do you think @dadgar @diptanu @burdandrei? We can still go only with memory limits but I would do something like suggested above (client config + task config). I would definitely make both soft and hard docker memory limits configurable.

@dadgar I understand your concerns around QoS. As @burdandrei said we can use constrains and place such jobs on specific subset of our nodes which wouldn't effect others. I am ok with relying on docker OOM killer to kill some containers since they will be restarted anyway.

I can help you work on this.

burdandrei · 2017-07-30T09:26:36Z

@jzvelc, i understand what stands behind your idea, but i think the real problem is that nomad should maintain resource allocation, and not give jobs to kill the cluster.
That's why I still use the limits from resources stanza, cause resource allocation is one of the most important and fragile thing in scheduler.

So maybe the best way to use it is to rename it to mem_limit: "soft", so by default we'd have hard memory limit, and with soft we'd have memory reservation, that will answer both resource allocation needs, and our use case.

jzvelc · 2017-07-30T18:42:35Z

I would still like to have control over both, hard and soft memory limits (something like ECS memory and memoryReservation parameters).

burdandrei · 2017-07-30T19:10:18Z

The thing is that ECS is only running dockers, and that's why you got memory and memoryReservation options, but in nomad resources are different stanza, but let's hear @dadgar, as an official represenatitve =)

burdandrei · 2017-08-06T06:22:43Z

@dadgar @diptanu Don't want to be annoying, but it's something we really want, and we really don't want to start running our own fork.
Please let me know if there is something that should be done this to be accepted.

Regards.

OferE · 2017-08-06T06:43:12Z

Don't know if it worth something but i do think @burdandrei is right - and soft limits should be an option. BTW not just for memory but for all other resources. I am using raw-exec but this is a hack (though i am really happy with it).

dadgar · 2017-08-07T18:11:09Z

Hey @burdandrei ,

Unfortunately I will be closing this. This isn't without consideration because I do understand there is value here but it has implications that I am not comfortable with.

QoS

By enabling this feature you are bypassing all memory enforcement. While this may not seem bad at first, it has an effect on every allocation running on the host. We allow this via the raw_exec driver and would like to keep it such that there is only one way to bypass Nomad's enforcement that is opt-in.

Operator Confidence

By enabling this feature there is less determinism in the system and that can cause much confusion. Running a job with this set at a high count may cause some allocations to fail as others run successful just because of what it is collocated with. This is a highly un-desirable side effect.

Architecture

In the example you gave there is an architectural problem. You are essentially running a batch job as a long lived service but want over-subscription semantics. Over-subscription is a feature that we will add in the long term. This will allow batch jobs to use resources that are unused but potentially allocated by the service jobs on a host.

You are better off dispatching jobs on an add needed basis to process work from the queue. This way you can configure them for their peak load and when they finish, their resources are free'd up. We have written a blog on this pattern: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch/

burdandrei · 2017-08-07T18:16:32Z

thanks for your answer @dadgar, wethought about using job dispatch, but with legacy projects, when worker statup time takes more than 30 seconds and work itself takes 200ms, workers with queues doing great work.
I do understand your conserns about the memory overcommit, very pricey and fragile resource.
we'll think how to do a workaround.

shantanugadgil · 2017-08-17T18:46:23Z

@dadgar @burdandrei
Sorry to join this discussion late, but I too have a need for the over-subscription semantics.

My requirement is more of a percentage distribution of the resource.

In my scenario, I know that 3 containers of a specific low overhead run fine on one machine, yet I have to keep calculating memory and cpu divided by 3 (for each job).

I have a scenario where I isolate certain type of workloads on machines by setting the node.class constraint.
I would like to start with a 2 CPU 2 GB machine.
If I feel 2 GB is inadequate for the 3 services, I would want to create a 2 CPU 4 GB machine and ideally not have to fiddle with the job file.
Just create a new bigger machine with the same node.class and stop the smaller machine.

Nomad would just restart jobs on the bigger machine.

Regards,
Shantanu

CumpsD · 2018-04-16T13:27:10Z

@dadgar is this still the case, or is there an option for jobs to burst over their allocated resources? Perhaps something an operator can specify at server level?

burdandrei · 2018-04-16T15:08:13Z

to be honest, @dadgar I'm thinking to bring this back alive after #3825 was merged.
What do you say? will give it a chance?

zonnie · 2019-06-20T15:20:40Z

@dadgar This is super important for our use-case.

A lot of services running in a very dense environment which is sometimes on-premise (i.e. no ability to spin up new nodes with ease)
Our memory bounds can be very fluid - some services peak by a lot
We still want to allow Nomad to distribute tasks (we used to override the client's memory to allow high overcommitment values but lost any sensible load distribution)
We are getting loads of oom killing even though our machines have a lot of resources to go.
Even if this is not the preferred solution we must have some way to handle peaks to some extent.

burdandrei · 2019-06-21T05:18:30Z

@zonnie probably we'll be able to add this as a pluggable driver, that should become a feature soon 🤞

zonnie · 2019-06-23T14:30:31Z

@burdandrei - can you give me any lead on how to implement this type of thing ?
I looked at the plugin documentation and got lost...it's very complicated and lacking real examples.
I just want a Docker-like driver that will be different only when limiting memory...

burdandrei · 2019-07-23T06:33:35Z

@schmichael, @nickethier after our talk at HashiConf, do you want me to revive this one, or open a new PR?

github-actions · 2023-02-06T02:15:33Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

burdandrei mentioned this pull request Jul 4, 2017

High CPU usage #2169

Closed

burdandrei force-pushed the docker-soft-memory-limit branch 2 times, most recently from 8673994 to 7410464 Compare July 4, 2017 14:20

jzvelc mentioned this pull request Jul 5, 2017

Oversubscription support #606

Closed

burdandrei mentioned this pull request Jul 20, 2017

Telemetry: reporting CPU, Memory, Disk and network throughput allocated per task #2330

Closed

diptanu reviewed Jul 25, 2017

View reviewed changes

Use soft docker memory limit instead of hard one

9abdb01

burdandrei force-pushed the docker-soft-memory-limit branch from 7410464 to 9abdb01 Compare July 26, 2017 10:45

dadgar closed this Aug 7, 2017

burdandrei mentioned this pull request Mar 29, 2021

Memory oversubscription #10247

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use soft docker memory limit instead of hard one #2771

Use soft docker memory limit instead of hard one #2771

burdandrei commented Jul 4, 2017

burdandrei commented Jul 10, 2017

diptanu commented Jul 25, 2017 •

edited

Loading

diptanu Jul 25, 2017 •

edited

Loading

diptanu Jul 25, 2017

diptanu commented Jul 25, 2017

burdandrei commented Jul 26, 2017

dadgar commented Jul 28, 2017

burdandrei commented Jul 28, 2017

jzvelc commented Jul 29, 2017 •

edited

Loading

burdandrei commented Jul 30, 2017

jzvelc commented Jul 30, 2017

burdandrei commented Jul 30, 2017

burdandrei commented Aug 6, 2017

OferE commented Aug 6, 2017

dadgar commented Aug 7, 2017

burdandrei commented Aug 7, 2017

shantanugadgil commented Aug 17, 2017

CumpsD commented Apr 16, 2018

burdandrei commented Apr 16, 2018

zonnie commented Jun 20, 2019 •

edited

Loading

burdandrei commented Jun 21, 2019

zonnie commented Jun 23, 2019

burdandrei commented Jul 23, 2019

github-actions bot commented Feb 6, 2023


		waitForExist(t, client, handle.(*DockerHandle))

		_, err := client.InspectContainer(handle.(*DockerHandle).ContainerID())

Use soft docker memory limit instead of hard one #2771

Use soft docker memory limit instead of hard one #2771

Conversation

burdandrei commented Jul 4, 2017

burdandrei commented Jul 10, 2017

diptanu commented Jul 25, 2017 • edited Loading

diptanu Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

diptanu Jul 25, 2017

Choose a reason for hiding this comment

diptanu commented Jul 25, 2017

burdandrei commented Jul 26, 2017

dadgar commented Jul 28, 2017

burdandrei commented Jul 28, 2017

jzvelc commented Jul 29, 2017 • edited Loading

burdandrei commented Jul 30, 2017

jzvelc commented Jul 30, 2017

burdandrei commented Jul 30, 2017

burdandrei commented Aug 6, 2017

OferE commented Aug 6, 2017

dadgar commented Aug 7, 2017

QoS

Operator Confidence

Architecture

burdandrei commented Aug 7, 2017

shantanugadgil commented Aug 17, 2017

CumpsD commented Apr 16, 2018

burdandrei commented Apr 16, 2018

zonnie commented Jun 20, 2019 • edited Loading

burdandrei commented Jun 21, 2019

zonnie commented Jun 23, 2019

burdandrei commented Jul 23, 2019

github-actions bot commented Feb 6, 2023

diptanu commented Jul 25, 2017 •

edited

Loading

diptanu Jul 25, 2017 •

edited

Loading

jzvelc commented Jul 29, 2017 •

edited

Loading

zonnie commented Jun 20, 2019 •

edited

Loading