docs: add job-specification docs for numa #18864

shoenig · 2023-10-25T14:33:23Z

Separate from a more comprehensive "CPU Resources" concepts docs we spoke of

shoenig · 2023-10-25T14:38:12Z

cc @schmichael who had opinions on the wording around fragmentation / suggesting preemption (which I left out)

website/content/docs/job-specification/numa.mdx

Co-authored-by: Tim Gross <tgross@hashicorp.com>

schmichael

LGTM. Left some comments, but not blockers... not even sure they're good ideas. 😅

This is just complex stuff! Not sure if we should keep the docs minimal because presumably folks who go looking for NUMA features already have a strong sense of how this stuff works, or if we should steal more from the RFC that helped educate NUMA-newbies like me.

schmichael · 2023-10-25T23:25:25Z

website/content/docs/job-specification/numa.mdx

+  - `none` - Nomad is free to allocate CPU cores using any strategy. Nomad uses
+  this freedom to allocate cores in such a way that minimizes the amount of
+  fragmentation of core availability per NUMA node.


to allocate cores in such a way that minimizes the amount of fragmentation of core availability per NUMA node.

Can we steal more from the RFC's wording? It's not as succinct, but this took me a minute to parse.

The RFC for reference:

The images make it really immediately obvious. I wonder if we should just copy/paste the RFC in 🤷

schmichael · 2023-10-26T00:08:26Z

website/content/docs/job-specification/numa.mdx

+The `require` affinity option should be used sparingly due to
+the implied fragmentation caused by reserving CPU cores based on the NUMA node
+they are associated with. Use it for workloads known to be highly sensitive
+to memory latencies.


...used sparingly... ...implied fragmentation...

I know people often want more prescriptive guidance from us, but I would rather frame this in terms of the tradeoff being presented than to discourage the use of a feature that could reduce overall resource consumption dramatically. (Assuming if you're avoiding 300% performance penalties due to cross-node latencies, you can run fewer instances of a service to serve the same number of requests.)

So perhaps something like:

The require affinity option may cause workload fragmentation by reserving CPU cores based on the NUMA node they are associated with. Use it for workloads known to be highly sensitive to memory latencies.

Might even be worth defining workload fragmentation somewhere as something like:

A jobspec constraint that prevents optimal binpacking of Clients. This can waste cluster resources by leaving some Client resources free but unusable. For example when numa.affinity = "require", workloads cannot be scheduled on Clients which may have ample free compute resources unless those compute resources happen to be colocated on a single NUMA node.

idk where an appropriate place for that would be though.

shoenig · 2023-10-26T15:46:52Z

LGTM. Left some comments, but not blockers... not even sure they're good ideas. 😅

This is just complex stuff! Not sure if we should keep the docs minimal because presumably folks who go looking for NUMA features already have a strong sense of how this stuff works, or if we should steal more from the RFC that helped educate NUMA-newbies like me.

I'm working on a Concepts/CPU doc that will cover "everything" to do with how Nomad interacts with your processor. I'm thinking we should keep the jobspec doc fairly minimal and then link to the concepts doc for further reference (once it exists).

schmichael · 2023-10-26T18:26:34Z

I'm thinking we should keep the jobspec doc fairly minimal and then link to the concepts doc for further reference (once it exists).

Yeah I love this approach and what I plan on doing for workload identity. Leave the reference docs concise so folks can just find what they're looking for quickly. Concept docs provide context. Tutorials provide walkthroughs/howtos. 👍

* docs: add job-specification docs for numa * docs: take suggestions Co-authored-by: Tim Gross <tgross@hashicorp.com> * docs: more cr suggestions --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>

docs: add job-specification docs for numa

0f63036

vercel bot deployed to Preview – nomad-storybook-and-ui October 25, 2023 14:35 View deployment

shoenig requested review from schmichael and tgross October 25, 2023 14:37

vercel bot deployed to Preview – nomad October 25, 2023 14:39 View deployment

shoenig added this to the 1.7.0 milestone Oct 25, 2023

tgross reviewed Oct 25, 2023

View reviewed changes

docs: take suggestions

40edc8a

Co-authored-by: Tim Gross <tgross@hashicorp.com>

vercel bot deployed to Preview – nomad-storybook-and-ui October 25, 2023 15:50 View deployment

vercel bot deployed to Preview – nomad October 25, 2023 15:51 View deployment

schmichael approved these changes Oct 26, 2023

View reviewed changes

docs: more cr suggestions

ba5c560

vercel bot deployed to Preview – nomad-storybook-and-ui October 26, 2023 16:33 View deployment

vercel bot deployed to Preview – nomad October 26, 2023 16:34 View deployment

shoenig merged commit fdde8a5 into main Oct 26, 2023
7 checks passed

shoenig deleted the docs-numa-jobspec branch October 26, 2023 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add job-specification docs for numa #18864

docs: add job-specification docs for numa #18864

shoenig commented Oct 25, 2023 •

edited

Loading

shoenig commented Oct 25, 2023

schmichael left a comment

schmichael Oct 25, 2023

schmichael Oct 26, 2023

shoenig commented Oct 26, 2023

schmichael commented Oct 26, 2023

docs: add job-specification docs for numa #18864

docs: add job-specification docs for numa #18864

Conversation

shoenig commented Oct 25, 2023 • edited Loading

shoenig commented Oct 25, 2023

schmichael left a comment

Choose a reason for hiding this comment

schmichael Oct 25, 2023

Choose a reason for hiding this comment

schmichael Oct 26, 2023

Choose a reason for hiding this comment

shoenig commented Oct 26, 2023

schmichael commented Oct 26, 2023

shoenig commented Oct 25, 2023 •

edited

Loading