Memory oversubscription #10247

notnoop · 2021-03-26T21:19:02Z

This PR adds support for memory oversubscription to enable better packing and resource utilizations. A task may specify two memory limits: a reserved memory limit to be used by the scheduler when placing the allocation, and another higher limit for actual runtime. This allows job submitters to set lower and less conservative memory resources, while being able to use the excess memory on the client if there is some.

This PR lays the foundation for oversubscription - namely the internal tracking and driver plumbing, but the UX is still in flux and I'll add additional notes.

Proposed UX

A job submitter can configure a task to use excess memory capacity by setting memory_max under the task resource:

task "task" {
  ...
      resources {
        cpu        = 500
        memory     = 200 # reserve 200MB
        memory_max = 300 # but use up to 300MB
      }
}

nomad alloc status will report the memory limit:

$ nomad alloc status 96fbeb0b
ID                  = 96fbeb0b-a0b3-aa95-62bf-b8a39492fd5c
[...]

Task "task" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  176 KiB/20 MiB  300 MiB
           Max: 30 MiB

Task Events:
[...]

Notes for reviewers and technical implementation

The PR is relatively large (~1k LOC). I've attempted to organize to place logical changes together in separate PR. I'd recommend reviewing the PR by examining the individual commits.

Also, I've added inline comments with technical discussions and design choices to ease discussing them. in threads.

Follow up Work post PR

Will create new GitHub issues to track these, but adding here to set the vision for users as well as set context for the PR reviewer.

Short term:

Adds protection or alerting to prevent Nomad processes from OOMed when tasks when a client is undersubscribed
Update the UI to report the maximum memory limit
Tweak tasks oom_score_adj so that aggressive tasks may not end up OOM killing other jobs on the nodes.
Adds knobs to allow disabling oversubscription by clients.

Longer Term:

Update scheduler so that it factors in oversubscription in scheduling (e.g. avoid placing very high memory_max tasks on the same node)

notnoop · 2021-03-26T21:21:38Z

client/allocrunner/taskrunner/task_runner.go

@@ -988,7 +993,7 @@ func (tr *TaskRunner) buildTaskConfig() *drivers.TaskConfig {
 		Resources: &drivers.Resources{
 			NomadResources: taskResources,
 			LinuxResources: &drivers.LinuxResources{
-				MemoryLimitBytes: taskResources.Memory.MemoryMB * 1024 * 1024,
+				MemoryLimitBytes: memoryLimit * 1024 * 1024,


Updating the linux resources is intended to ease drivers implementation and adoption of the features: drivers that use resources.LinuxResources.MemoryLimitBytes don't need to be updated.

Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.

I don't know the breakdown of how external 3rd party drivers check memory limit, but happy to change the default.

Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.

So if they don't get updated, they'll just end up setting their limit equal to the memory field value, just as they do today? They just end up ignoring memory_max?

From a customer/community impact standpoint, the two I'd worry the most about are containerd and podman. Also, do we want to update qemu to take whichever is greater?

Yes, the failure mode is ignoring memory_max and behaving like today. I'm researching soft vs hard limits a bit now, and will ensure containerd and podman are updated to the recommended pattern.

drivers/shared/executor/executor_linux.go

notnoop · 2021-03-26T21:24:48Z

e2e/oversubscription/testdata/exec.nomad

+      }
+    }
+
+    task "cgroup-fetcher" {


Exec doesn't mount cgroup filesystem into the exec container - so I needed to have this raw_exec "sidecar" to lookup the relevant cgroup and memory limit instead.

Clever. Out of scope for this PR, but should we be mounting that filesystem in the exec container?

Yes, not sure why we didn't. Exec is hopelessly behind other drivers in goodness - it may make sense to combine all of that in a exec-v2 refresh.

notnoop · 2021-03-26T21:26:13Z

nomad/plan_normalization_test.go

@@ -62,5 +62,5 @@ func TestPlanNormalize(t *testing.T) {
 	}

 	optimizedLogSize := buf.Len()
-	assert.True(t, float64(optimizedLogSize)/float64(unoptimizedLogSize) < 0.62)
+	assert.Less(t, float64(optimizedLogSize)/float64(unoptimizedLogSize), 0.63)


The compression value needed to be raised to account the new MemoryMaxMB field. Seems like a pretty odd test that will effectively fail anytime we add fields to allocs. It's nice to keep track of the value overtime, but don't know of a better way to track it, so just changed the value here.

notnoop · 2021-03-26T21:32:19Z

nomad/structs/structs.go

+	if r.MemoryMaxMB != 0 && r.MemoryMaxMB < r.MemoryMB {
+		mErr.Errors = append(mErr.Errors, fmt.Errorf("MemoryMaxMB value (%d) should be larger than MemoryMB value (%d)", r.MemoryMaxMB, r.MemoryMB))
+	}


The only validation we do for MemoryMaxMB is that it needs to be equal or higher than MemoryMB. The scheduler may place the alloc on a client with less memory than MemoryMaxMB and the client may run it.

It's unclear what the behavior should be: ideally, we prioritize placing the job allocations on clients that exceed the MemoryMaxMB, but they should be run IMO even if the only nodes available are nodes that meet the basic memory requirements but not the max ones. Also, I suspect some operators will set high values as an optimistic "just be lenient and give me some access memory" rather than setting max values through rigorous analysis and experimentation.

This is an interesting question - when clusters start to fill up, there's an incentive to set Memory absurdly low to increase your chance of getting scheduled, then lean on MemoryMax for resources to actually run. Should MemoryMax at least feed into Quota?

scheduler/generic_sched_test.go

notnoop · 2021-03-26T21:34:59Z

nomad/structs/structs.go

+	if delta.MemoryMaxMB != 0 {
+		a.MemoryMaxMB += delta.MemoryMaxMB
+	} else {
+		a.MemoryMaxMB += delta.MemoryMB
+	}


The updates in AllocatedMemoryResource tracking isn't strictly needed. I'm adding them for consistency and to ease having the scheduler consider MemoryMaxMB in the future.

scheduler/generic_sched_test.go

notnoop · 2021-03-26T21:41:14Z

api/resources.go

-	Devices  []*RequestedDevice `hcl:"device,block"`
+	CPU         *int               `hcl:"cpu,optional"`
+	MemoryMB    *int               `mapstructure:"memory" hcl:"memory,optional"`
+	MemoryMaxMB *int               `mapstructure:"memory_max" hcl:"memory_max,optional"`


In this iteration, I've opted to simply add a memory_max field in the job spec, with memory remaining as the "reserve"/base memory requirement. Happy to reconsider this and use an alternative name for the "base", e.g. memory_reserve,memory_required?

I considered memory_min - but I find it ambiguous. min indicates the minimum memory a task uses rather than how much memory we should reserve/allocate for the task.

Looks like that never really got resolved on the RFC, but I'm totally 👍 for this. It avoids any migration issues later, too.

burdandrei · 2021-03-29T07:12:35Z

waiting for this since #2771 ;)

tgross

Solid work here. I want to take a second pass through it but I want to press Submit on this review so you can answer any questions async. Also, do we need to do anything here in OSS to pass the max memory to quota stack checking in ENT?

(It's kind of a shocking amount of plumbing code required!)

jobspec/parse_task.go

nomad/fsm_test.go

tgross · 2021-03-29T13:53:35Z

e2e/oversubscription/testdata/exec.nomad

+      }
+    }
+
+    task "cgroup-fetcher" {


Clever. Out of scope for this PR, but should we be mounting that filesystem in the exec container?

tgross · 2021-03-29T13:56:01Z

api/resources.go

-	Devices  []*RequestedDevice `hcl:"device,block"`
+	CPU         *int               `hcl:"cpu,optional"`
+	MemoryMB    *int               `mapstructure:"memory" hcl:"memory,optional"`
+	MemoryMaxMB *int               `mapstructure:"memory_max" hcl:"memory_max,optional"`


Looks like that never really got resolved on the RFC, but I'm totally 👍 for this. It avoids any migration issues later, too.

tgross · 2021-03-29T14:05:33Z

client/allocrunner/taskrunner/task_runner.go

@@ -988,7 +993,7 @@ func (tr *TaskRunner) buildTaskConfig() *drivers.TaskConfig {
 		Resources: &drivers.Resources{
 			NomadResources: taskResources,
 			LinuxResources: &drivers.LinuxResources{
-				MemoryLimitBytes: taskResources.Memory.MemoryMB * 1024 * 1024,
+				MemoryLimitBytes: memoryLimit * 1024 * 1024,


Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.

So if they don't get updated, they'll just end up setting their limit equal to the memory field value, just as they do today? They just end up ignoring memory_max?

From a customer/community impact standpoint, the two I'd worry the most about are containerd and podman. Also, do we want to update qemu to take whichever is greater?

drivers/shared/executor/executor_linux.go

Start tracking a new MemoryMaxMB field that represents the maximum memory a task may use in the client. This allows tasks to specify a memory reservation (to be used by scheduler when placing the task) but use excess memory used on the client if the client has any. This commit adds the server tracking for the value, and ensures that allocations AllocatedResource fields include the value.

This commit updates the API to pass the MemoryMaxMB field, and the CLI to show the max set for the task. Also, start parsing the MemoryMaxMB in hcl2, as it's set by tags. A sample CLI output; note the additional `Max: ` for "task": ``` $ nomad alloc status 96fbeb0b ID = 96fbeb0b-a0b3-aa95-62bf-b8a39492fd5c [...] Task "cgroup-fetcher" is "running" Task Resources CPU Memory Disk Addresses 0/500 MHz 32 MiB/20 MiB 300 MiB Task Events: [...] Task "task" is "running" Task Resources CPU Memory Disk Addresses 0/500 MHz 176 KiB/20 MiB 300 MiB Max: 30 MiB Task Events: [...] ```

Allow specifying the `memory_max` field in HCL under the resources block. Though HCLv1 is deprecated, I've updated them to ease our testing.

Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease drivers implementation and adoption of the features: drivers that use `resources.LinuxResources.MemoryLimitBytes` don't need to be updated. Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.

Linux offers soft memory limit: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/memory.html#soft-limits , and https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html?highlight=memory.low . We can set soft memory limits through libcontainer `Resources.MemoryReservation`: https://pkg.go.dev/github.com/opencontainers/runc@v0.1.1/libcontainer/configs#Resources

notnoop · 2021-03-30T21:13:53Z

I've updated the PR, by rebasing to address merge conflicts with the core pinning changes. Also, added a change so that we set soft memory limit for the exec/java task cgroups.

tgross

LGTM!

There's a raft leadership test failure currently, but it may be an unrelated flake... not sure I've seen that one in particular but the servers don't look like they can reach each other, so that doesn't seem relevant to this PR.

tgross · 2021-03-31T12:23:28Z

nomad/structs/diff_test.go

+					{
+						Type: DiffTypeEdited,
+						Name: "Resources",
+						Fields: []*FieldDiff{


Looks like we need to add the Cores field here:

=== FAIL: nomad/structs TestTaskDiff/Resources_edited_memory_max_with_context (0.00s) diff_test.go:7038: case 16: got: Task "" (Edited): "Resources" (Edited) { "CPU" (None): "100" => "100" "Cores" (None): "0" => "0" "DiskMB" (None): "100" => "100" "IOPS" (None): "0" => "0" "MemoryMB" (None): "100" => "100" "MemoryMaxMB" (Edited): "200" => "300" } want: Task "" (Edited): "Resources" (Edited) { "CPU" (None): "100" => "100" "DiskMB" (None): "100" => "100" "IOPS" (None): "0" => "0" "MemoryMB" (None): "100" => "100" "MemoryMaxMB" (Edited): "200" => "300" } --- FAIL: TestTaskDiff/Resources_edited_memory_max_with_context (0.00s)

schmichael · 2021-04-21T20:25:32Z

Should this close #606?

henrikjohansen · 2021-04-22T19:25:20Z

Will there be a way to disable oversubscription? 🤔

Personally, I would like to see this both as an ACL policy option and as a knob under client {} to control this per node.

github-actions · 2022-11-24T02:21:44Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop commented Mar 26, 2021

View reviewed changes

notnoop requested review from shoenig and tgross March 26, 2021 21:41

notnoop added the type/enhancement label Mar 26, 2021

notnoop added this to the 1.1.0 milestone Mar 26, 2021

tgross reviewed Mar 29, 2021

View reviewed changes

notnoop force-pushed the f-memory-oversubscription-2 branch from 1b34849 to 34ea86e Compare March 30, 2021 20:20

vercel bot temporarily deployed to Preview – nomad March 30, 2021 20:21 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 30, 2021 20:21 View deployment

Mahmood Ali added 8 commits March 30, 2021 16:55

oversubscription: add memory_max to hclv1

18e1a59

Allow specifying the `memory_max` field in HCL under the resources block. Though HCLv1 is deprecated, I've updated them to ease our testing.

oversubscription: docker to honor MemoryMaxMB values

b1ff06f

oversubscription: driver/exec to honor MemoryMaxMB

5e3fbd5

oversubscription: e2e tests!

750e665

notnoop force-pushed the f-memory-oversubscription-2 branch from 34ea86e to 43549b4 Compare March 30, 2021 20:58

vercel bot temporarily deployed to Preview – nomad March 30, 2021 20:58 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 30, 2021 20:58 View deployment

notnoop requested a review from tgross March 30, 2021 21:14

DingoEatingFuzz mentioned this pull request Mar 31, 2021

Show max memory limit in the UI #10268

Open

tgross approved these changes Mar 31, 2021

View reviewed changes

fixup! oversubscription: Add MemoryMaxMB to internal structs

699cd7a

vercel bot deployed to Preview – nomad-storybook-and-ui March 31, 2021 12:52 View deployment

vercel bot temporarily deployed to Preview – nomad March 31, 2021 12:52 Inactive

fixup! oversubscription: Add MemoryMaxMB to internal structs

6d89131

vercel bot deployed to Preview – nomad-storybook-and-ui March 31, 2021 13:27 View deployment

vercel bot temporarily deployed to Preview – nomad March 31, 2021 13:27 Inactive

notnoop merged commit e3ea516 into main Mar 31, 2021

notnoop deleted the f-memory-oversubscription-2 branch March 31, 2021 13:57

notnoop mentioned this pull request Apr 6, 2021

UI: set memory max for oversubscription. #10309

Closed

notnoop mentioned this pull request Apr 20, 2021

Handling memory contention and OOM Killer #10414

Open

tgross mentioned this pull request May 4, 2021

Oversubscription support #606

Closed

notnoop mentioned this pull request May 18, 2021

Support Memory Oversubscription Roblox/nomad-driver-containerd#90

Merged

MikeN123 mentioned this pull request May 27, 2021

Exec/Java: Default OOMScoreAdjust causes memory limit to be ignored #10663

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory oversubscription #10247

Memory oversubscription #10247

notnoop commented Mar 26, 2021

notnoop Mar 26, 2021

tgross Mar 29, 2021

notnoop Mar 29, 2021

notnoop Mar 26, 2021

tgross Mar 29, 2021

notnoop Mar 30, 2021

notnoop Mar 26, 2021

notnoop Mar 26, 2021

shoenig Mar 29, 2021

notnoop Mar 26, 2021

notnoop Mar 26, 2021

tgross Mar 29, 2021

burdandrei commented Mar 29, 2021

tgross left a comment

tgross Mar 29, 2021

tgross Mar 29, 2021

tgross Mar 29, 2021

notnoop commented Mar 30, 2021

tgross left a comment

tgross Mar 31, 2021

schmichael commented Apr 21, 2021

henrikjohansen commented Apr 22, 2021

github-actions bot commented Nov 24, 2022

Memory oversubscription #10247

Memory oversubscription #10247

Conversation

notnoop commented Mar 26, 2021

Proposed UX

Notes for reviewers and technical implementation

Follow up Work post PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burdandrei commented Mar 29, 2021

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop commented Mar 30, 2021

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael commented Apr 21, 2021

henrikjohansen commented Apr 22, 2021

github-actions bot commented Nov 24, 2022