Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory oversubscription #10247

Merged
merged 10 commits into from
Mar 31, 2021
Merged

Memory oversubscription #10247

merged 10 commits into from
Mar 31, 2021

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Mar 26, 2021

This PR adds support for memory oversubscription to enable better packing and resource utilizations. A task may specify two memory limits: a reserved memory limit to be used by the scheduler when placing the allocation, and another higher limit for actual runtime. This allows job submitters to set lower and less conservative memory resources, while being able to use the excess memory on the client if there is some.

This PR lays the foundation for oversubscription - namely the internal tracking and driver plumbing, but the UX is still in flux and I'll add additional notes.

Proposed UX

A job submitter can configure a task to use excess memory capacity by setting memory_max under the task resource:

task "task" {
  ...
      resources {
        cpu        = 500
        memory     = 200 # reserve 200MB
        memory_max = 300 # but use up to 300MB
      }
}

nomad alloc status will report the memory limit:

$ nomad alloc status 96fbeb0b
ID                  = 96fbeb0b-a0b3-aa95-62bf-b8a39492fd5c
[...]

Task "task" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  176 KiB/20 MiB  300 MiB
           Max: 30 MiB

Task Events:
[...]

Notes for reviewers and technical implementation

The PR is relatively large (~1k LOC). I've attempted to organize to place logical changes together in separate PR. I'd recommend reviewing the PR by examining the individual commits.

Also, I've added inline comments with technical discussions and design choices to ease discussing them. in threads.

Follow up Work post PR

Will create new GitHub issues to track these, but adding here to set the vision for users as well as set context for the PR reviewer.

Short term:

  • Adds protection or alerting to prevent Nomad processes from OOMed when tasks when a client is undersubscribed
  • Update the UI to report the maximum memory limit
  • Tweak tasks oom_score_adj so that aggressive tasks may not end up OOM killing other jobs on the nodes.
  • Adds knobs to allow disabling oversubscription by clients.

Longer Term:

  • Update scheduler so that it factors in oversubscription in scheduling (e.g. avoid placing very high memory_max tasks on the same node)

@@ -988,7 +993,7 @@ func (tr *TaskRunner) buildTaskConfig() *drivers.TaskConfig {
Resources: &drivers.Resources{
NomadResources: taskResources,
LinuxResources: &drivers.LinuxResources{
MemoryLimitBytes: taskResources.Memory.MemoryMB * 1024 * 1024,
MemoryLimitBytes: memoryLimit * 1024 * 1024,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the linux resources is intended to ease drivers implementation and adoption of the features: drivers that use resources.LinuxResources.MemoryLimitBytes don't need to be updated.

Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.

I don't know the breakdown of how external 3rd party drivers check memory limit, but happy to change the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.

So if they don't get updated, they'll just end up setting their limit equal to the memory field value, just as they do today? They just end up ignoring memory_max?

From a customer/community impact standpoint, the two I'd worry the most about are containerd and podman. Also, do we want to update qemu to take whichever is greater?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the failure mode is ignoring memory_max and behaving like today. I'm researching soft vs hard limits a bit now, and will ensure containerd and podman are updated to the recommended pattern.

drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
}
}

task "cgroup-fetcher" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exec doesn't mount cgroup filesystem into the exec container - so I needed to have this raw_exec "sidecar" to lookup the relevant cgroup and memory limit instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. Out of scope for this PR, but should we be mounting that filesystem in the exec container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not sure why we didn't. Exec is hopelessly behind other drivers in goodness - it may make sense to combine all of that in a exec-v2 refresh.

@@ -62,5 +62,5 @@ func TestPlanNormalize(t *testing.T) {
}

optimizedLogSize := buf.Len()
assert.True(t, float64(optimizedLogSize)/float64(unoptimizedLogSize) < 0.62)
assert.Less(t, float64(optimizedLogSize)/float64(unoptimizedLogSize), 0.63)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compression value needed to be raised to account the new MemoryMaxMB field. Seems like a pretty odd test that will effectively fail anytime we add fields to allocs. It's nice to keep track of the value overtime, but don't know of a better way to track it, so just changed the value here.

Comment on lines +2238 to +2250
if r.MemoryMaxMB != 0 && r.MemoryMaxMB < r.MemoryMB {
mErr.Errors = append(mErr.Errors, fmt.Errorf("MemoryMaxMB value (%d) should be larger than MemoryMB value (%d)", r.MemoryMaxMB, r.MemoryMB))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only validation we do for MemoryMaxMB is that it needs to be equal or higher than MemoryMB. The scheduler may place the alloc on a client with less memory than MemoryMaxMB and the client may run it.

It's unclear what the behavior should be: ideally, we prioritize placing the job allocations on clients that exceed the MemoryMaxMB, but they should be run IMO even if the only nodes available are nodes that meet the basic memory requirements but not the max ones. Also, I suspect some operators will set high values as an optimistic "just be lenient and give me some access memory" rather than setting max values through rigorous analysis and experimentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question - when clusters start to fill up, there's an incentive to set Memory absurdly low to increase your chance of getting scheduled, then lean on MemoryMax for resources to actually run. Should MemoryMax at least feed into Quota?

scheduler/generic_sched_test.go Outdated Show resolved Hide resolved
Comment on lines +3696 to +3742
if delta.MemoryMaxMB != 0 {
a.MemoryMaxMB += delta.MemoryMaxMB
} else {
a.MemoryMaxMB += delta.MemoryMB
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates in AllocatedMemoryResource tracking isn't strictly needed. I'm adding them for consistency and to ease having the scheduler consider MemoryMaxMB in the future.

scheduler/generic_sched_test.go Outdated Show resolved Hide resolved
Devices []*RequestedDevice `hcl:"device,block"`
CPU *int `hcl:"cpu,optional"`
MemoryMB *int `mapstructure:"memory" hcl:"memory,optional"`
MemoryMaxMB *int `mapstructure:"memory_max" hcl:"memory_max,optional"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this iteration, I've opted to simply add a memory_max field in the job spec, with memory remaining as the "reserve"/base memory requirement. Happy to reconsider this and use an alternative name for the "base", e.g. memory_reserve,memory_required?

I considered memory_min - but I find it ambiguous. min indicates the minimum memory a task uses rather than how much memory we should reserve/allocate for the task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that never really got resolved on the RFC, but I'm totally 👍 for this. It avoids any migration issues later, too.

@notnoop notnoop requested review from shoenig and tgross March 26, 2021 21:41
@notnoop notnoop added this to the 1.1.0 milestone Mar 26, 2021
@burdandrei
Copy link
Contributor

waiting for this since #2771 ;)

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work here. I want to take a second pass through it but I want to press Submit on this review so you can answer any questions async. Also, do we need to do anything here in OSS to pass the max memory to quota stack checking in ENT?

(It's kind of a shocking amount of plumbing code required!)

jobspec/parse_task.go Show resolved Hide resolved
nomad/fsm_test.go Show resolved Hide resolved
}
}

task "cgroup-fetcher" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. Out of scope for this PR, but should we be mounting that filesystem in the exec container?

Devices []*RequestedDevice `hcl:"device,block"`
CPU *int `hcl:"cpu,optional"`
MemoryMB *int `mapstructure:"memory" hcl:"memory,optional"`
MemoryMaxMB *int `mapstructure:"memory_max" hcl:"memory_max,optional"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that never really got resolved on the RFC, but I'm totally 👍 for this. It avoids any migration issues later, too.

@@ -988,7 +993,7 @@ func (tr *TaskRunner) buildTaskConfig() *drivers.TaskConfig {
Resources: &drivers.Resources{
NomadResources: taskResources,
LinuxResources: &drivers.LinuxResources{
MemoryLimitBytes: taskResources.Memory.MemoryMB * 1024 * 1024,
MemoryLimitBytes: memoryLimit * 1024 * 1024,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.

So if they don't get updated, they'll just end up setting their limit equal to the memory field value, just as they do today? They just end up ignoring memory_max?

From a customer/community impact standpoint, the two I'd worry the most about are containerd and podman. Also, do we want to update qemu to take whichever is greater?

drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
Mahmood Ali added 8 commits March 30, 2021 16:55
Start tracking a new MemoryMaxMB field that represents the maximum memory a task
may use in the client. This allows tasks to specify a memory reservation (to be
used by scheduler when placing the task) but use excess memory used on the
client if the client has any.

This commit adds the server tracking for the value, and ensures that allocations
AllocatedResource fields include the value.
This commit updates the API to pass the MemoryMaxMB field, and the CLI to show
the max set for the task.

Also, start parsing the MemoryMaxMB in hcl2, as it's set by tags.

A sample CLI output; note the additional `Max: ` for "task":

```
$ nomad alloc status 96fbeb0b
ID                  = 96fbeb0b-a0b3-aa95-62bf-b8a39492fd5c
[...]

Task "cgroup-fetcher" is "running"
Task Resources
CPU        Memory         Disk     Addresses
0/500 MHz  32 MiB/20 MiB  300 MiB

Task Events:
[...]

Task "task" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  176 KiB/20 MiB  300 MiB
           Max: 30 MiB

Task Events:
[...]
```
Allow specifying the `memory_max` field in HCL under the resources block.

Though HCLv1 is deprecated, I've updated them to ease our testing.
Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease
drivers implementation and adoption of the features: drivers that use
`resources.LinuxResources.MemoryLimitBytes` don't need to be updated.

Drivers that use NomadResources will need to updated to track the new
field value. Given that tasks aren't guaranteed to use up the excess
memory limit, this is a reasonable compromise.
@notnoop
Copy link
Contributor Author

notnoop commented Mar 30, 2021

I've updated the PR, by rebasing to address merge conflicts with the core pinning changes. Also, added a change so that we set soft memory limit for the exec/java task cgroups.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

There's a raft leadership test failure currently, but it may be an unrelated flake... not sure I've seen that one in particular but the servers don't look like they can reach each other, so that doesn't seem relevant to this PR.

{
Type: DiffTypeEdited,
Name: "Resources",
Fields: []*FieldDiff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to add the Cores field here:

=== FAIL: nomad/structs TestTaskDiff/Resources_edited_memory_max_with_context (0.00s)
    diff_test.go:7038: case 16: got:
        Task "" (Edited):
        
        "Resources" (Edited) {
        "CPU" (None): "100" => "100"
        "Cores" (None): "0" => "0"
        "DiskMB" (None): "100" => "100"
        "IOPS" (None): "0" => "0"
        "MemoryMB" (None): "100" => "100"
        "MemoryMaxMB" (Edited): "200" => "300"
        }
        
         want:
        Task "" (Edited):
        
        "Resources" (Edited) {
        "CPU" (None): "100" => "100"
        "DiskMB" (None): "100" => "100"
        "IOPS" (None): "0" => "0"
        "MemoryMB" (None): "100" => "100"
        "MemoryMaxMB" (Edited): "200" => "300"
        }
        
    --- FAIL: TestTaskDiff/Resources_edited_memory_max_with_context (0.00s)

@vercel vercel bot temporarily deployed to Preview – nomad March 31, 2021 13:27 Inactive
@notnoop notnoop merged commit e3ea516 into main Mar 31, 2021
@notnoop notnoop deleted the f-memory-oversubscription-2 branch March 31, 2021 13:57
@schmichael
Copy link
Member

Should this close #606?

@henrikjohansen
Copy link

Will there be a way to disable oversubscription? 🤔

Personally, I would like to see this both as an ACL policy option and as a knob under client {} to control this per node.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants