Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use soft docker memory limit instead of hard one #2771

Closed

Conversation

burdandrei
Copy link
Contributor

As duscussed in couple of open issues(like this) there is sometime a need to run docker containters without hard memory limit.
mem_limit_disable option will run docker container as it described with soft memory limit. So docker will not limit memory allocation for container, but will honor it to know who to kill if host memory is exhausted.

@burdandrei burdandrei mentioned this pull request Jul 4, 2017
@burdandrei burdandrei force-pushed the docker-soft-memory-limit branch 2 times, most recently from 8673994 to 7410464 Compare July 4, 2017 14:20
@jzvelc jzvelc mentioned this pull request Jul 5, 2017
@burdandrei
Copy link
Contributor Author

@dadgar any chance this will hit 0.6?
As i can understand it fits not only our usecase.
Le'ts discuss the flag name or anything what is questionable.
Cheers!

@diptanu
Copy link
Contributor

diptanu commented Jul 25, 2017

@dadgar If you want this in, I would suggest making this a client config which the operator can control and not something which lives in the job config. I think allowing users to control such a thing in a multi-tenant environment could be bad from a QOS perspective for the tenants.

@@ -840,7 +844,15 @@ func (d *DockerDriver) createContainerConfig(ctx *ExecContext, task *structs.Tas
config.WorkingDir = driverConfig.WorkDir
}

memLimit := int64(task.Resources.MemoryMB) * 1024 * 1024
memLimit := int64(0)
memReservation := int64(0)
Copy link
Contributor

@diptanu diptanu Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var memLimit, memReservation int64


waitForExist(t, client, handle.(*DockerHandle))

_, err := client.InspectContainer(handle.(*DockerHandle).ContainerID())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to test that the MemReservation is set properly in this test. And test the other case too.

@diptanu
Copy link
Contributor

diptanu commented Jul 25, 2017

Left a few comments but probably worth waiting for @dadgar to weigh in.

@burdandrei
Copy link
Contributor Author

Thanks for review @diptanu, i updated variable definition, and added test to check docker config.

I discussed with our Architect should this be agent configuration, or per job flag. And we came to agreement, that nomad is great scheduler, and it helps us to use all the features drivers are providing.
And from Docker side, memory reservation can only be configured on the running stage, it can't be configured host wize.

But you are right, and we can add a flad that will permit on the client level this option like it's done with execs in consul.

Let's wait what @dadgar will say, and i hope it'll land in 0.6

@dadgar
Copy link
Contributor

dadgar commented Jul 28, 2017

@burdandrei Can you explain your use case for this more. With the current implementation of this PR, a single allocation with this set would be able to starve the memory of the host and effect the QoS of every other allocation on the host.

@burdandrei
Copy link
Contributor Author

Ok, @dadgar.
so in our case we got a real time application serving HTTP requests, and Resque workers doing all the background, async, daily ,umonthly jobs.
For resque we're running worker per queue, and we got more than 150 queues. We're running resque workers on dedicated machines, using constrains. Basic memory footprint of the worker is 200 MB, with spikes to 400-500 MB while forking to process the job(this is resque flow, we don't really control it).
Problem is, that we got daily and weekly jobs, that should generate summary reports or other stuff, and these workers RAM usage can jump to 12 GB. But because it's running on the machine with 70 more workes, that are not actively working now, we got lot of spare memory to use.
Before running in nomad, we were running this using systemd, and it worked perfectly for us.

While working on my pull request i checked that behavior of OOM killer when using soft memory limits:
If we got 2 containers scheduled, both with soft memory limits, one with 1GB limit and one with 2 GB limit, and both of them are using more than soft limit configured, if docker daemon detect saturation of host memory, it kills the container with smallest memory limit.

I can see @OferE is just using exec driver to do the same, but Docker driver perfectly fits it, using soft memory option.

I do agree, that flag name can be questionable, but i think opening this option of docker driver could be really great feature for batch processing.

Cheers!

@jzvelc
Copy link

jzvelc commented Jul 29, 2017

Our use case is similar as described in #606.
I would go with a more generic approach since other options may be required in future.

Add docker.allowed_run_opts to client configuration where operators could specify which additional docker run options are allowed to be specified with docker driver e.g.:

client {
  options {
    "docker.allowed_run_opts": "memory,memory-reservation"
  }
}

This will give operators control over multi-tenant environments.
In task's config we could then set:

task {
  resources {
    memory = 500
  }
  config {
    run_opt = [
      "memory=1g",
      "memory-reservation=200m"
    ]
  }
}

In this example Nomad would use memory requirement specified in resources stanza for task placement (same as it does now) and set hard memory limit on docker container (memory=500m). This is actually current implementation.

All provided options in run_opt (which should be validated against allowed_run_opts) will override currently set docker run options. So in this example docker hard memory limit would be set to 1000m and docker soft memory limit would be set to 200m.
This would be perfect for tasks which require around 200m of memory for normal operation with often 200-500m spikes. Setting docker hard memory limit to 1g would ensure that spikes are no bigger than 1g which would starve the host.

This would make docker driver way more flexible for our workloads. What do you think @dadgar @diptanu @burdandrei? We can still go only with memory limits but I would do something like suggested above (client config + task config). I would definitely make both soft and hard docker memory limits configurable.

@dadgar I understand your concerns around QoS. As @burdandrei said we can use constrains and place such jobs on specific subset of our nodes which wouldn't effect others. I am ok with relying on docker OOM killer to kill some containers since they will be restarted anyway.

I can help you work on this.

@burdandrei
Copy link
Contributor Author

@jzvelc, i understand what stands behind your idea, but i think the real problem is that nomad should maintain resource allocation, and not give jobs to kill the cluster.
That's why I still use the limits from resources stanza, cause resource allocation is one of the most important and fragile thing in scheduler.

So maybe the best way to use it is to rename it to mem_limit: "soft", so by default we'd have hard memory limit, and with soft we'd have memory reservation, that will answer both resource allocation needs, and our use case.

@jzvelc
Copy link

jzvelc commented Jul 30, 2017

I would still like to have control over both, hard and soft memory limits (something like ECS memory and memoryReservation parameters).

@burdandrei
Copy link
Contributor Author

The thing is that ECS is only running dockers, and that's why you got memory and memoryReservation options, but in nomad resources are different stanza, but let's hear @dadgar, as an official represenatitve =)

@burdandrei
Copy link
Contributor Author

@dadgar @diptanu Don't want to be annoying, but it's something we really want, and we really don't want to start running our own fork.
Please let me know if there is something that should be done this to be accepted.

Regards.

@OferE
Copy link

OferE commented Aug 6, 2017

Don't know if it worth something but i do think @burdandrei is right - and soft limits should be an option. BTW not just for memory but for all other resources. I am using raw-exec but this is a hack (though i am really happy with it).

@dadgar
Copy link
Contributor

dadgar commented Aug 7, 2017

Hey @burdandrei ,

Unfortunately I will be closing this. This isn't without consideration because I do understand there is value here but it has implications that I am not comfortable with.

QoS

By enabling this feature you are bypassing all memory enforcement. While this may not seem bad at first, it has an effect on every allocation running on the host. We allow this via the raw_exec driver and would like to keep it such that there is only one way to bypass Nomad's enforcement that is opt-in.

Operator Confidence

By enabling this feature there is less determinism in the system and that can cause much confusion. Running a job with this set at a high count may cause some allocations to fail as others run successful just because of what it is collocated with. This is a highly un-desirable side effect.

Architecture

In the example you gave there is an architectural problem. You are essentially running a batch job as a long lived service but want over-subscription semantics. Over-subscription is a feature that we will add in the long term. This will allow batch jobs to use resources that are unused but potentially allocated by the service jobs on a host.

You are better off dispatching jobs on an add needed basis to process work from the queue. This way you can configure them for their peak load and when they finish, their resources are free'd up. We have written a blog on this pattern: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch/

@dadgar dadgar closed this Aug 7, 2017
@burdandrei
Copy link
Contributor Author

thanks for your answer @dadgar, wethought about using job dispatch, but with legacy projects, when worker statup time takes more than 30 seconds and work itself takes 200ms, workers with queues doing great work.
I do understand your conserns about the memory overcommit, very pricey and fragile resource.
we'll think how to do a workaround.

@shantanugadgil
Copy link
Contributor

@dadgar @burdandrei
Sorry to join this discussion late, but I too have a need for the over-subscription semantics.

My requirement is more of a percentage distribution of the resource.

In my scenario, I know that 3 containers of a specific low overhead run fine on one machine, yet I have to keep calculating memory and cpu divided by 3 (for each job).

I have a scenario where I isolate certain type of workloads on machines by setting the node.class constraint.
I would like to start with a 2 CPU 2 GB machine.
If I feel 2 GB is inadequate for the 3 services, I would want to create a 2 CPU 4 GB machine and ideally not have to fiddle with the job file.
Just create a new bigger machine with the same node.class and stop the smaller machine.

Nomad would just restart jobs on the bigger machine.

Regards,
Shantanu

@CumpsD
Copy link

CumpsD commented Apr 16, 2018

@dadgar is this still the case, or is there an option for jobs to burst over their allocated resources? Perhaps something an operator can specify at server level?

@burdandrei
Copy link
Contributor Author

to be honest, @dadgar I'm thinking to bring this back alive after #3825 was merged.
What do you say? will give it a chance?

@zonnie
Copy link

zonnie commented Jun 20, 2019

@dadgar This is super important for our use-case.

  • A lot of services running in a very dense environment which is sometimes on-premise (i.e. no ability to spin up new nodes with ease)
  • Our memory bounds can be very fluid - some services peak by a lot
  • We still want to allow Nomad to distribute tasks (we used to override the client's memory to allow high overcommitment values but lost any sensible load distribution)
  • We are getting loads of oom killing even though our machines have a lot of resources to go.
    Even if this is not the preferred solution we must have some way to handle peaks to some extent.

@burdandrei
Copy link
Contributor Author

@zonnie probably we'll be able to add this as a pluggable driver, that should become a feature soon 🤞

@zonnie
Copy link

zonnie commented Jun 23, 2019

@burdandrei - can you give me any lead on how to implement this type of thing ?
I looked at the plugin documentation and got lost...it's very complicated and lacking real examples.
I just want a Docker-like driver that will be different only when limiting memory...

@burdandrei
Copy link
Contributor Author

@schmichael, @nickethier after our talk at HashiConf, do you want me to revive this one, or open a new PR?

@github-actions
Copy link

github-actions bot commented Feb 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants