Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add job-specification docs for numa #18864

Merged
merged 3 commits into from
Oct 26, 2023
Merged

docs: add job-specification docs for numa #18864

merged 3 commits into from
Oct 26, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Oct 25, 2023

Separate from a more comprehensive "CPU Resources" concepts docs we spoke of

@shoenig
Copy link
Member Author

shoenig commented Oct 25, 2023

cc @schmichael who had opinions on the wording around fragmentation / suggesting preemption (which I left out)

website/content/docs/job-specification/numa.mdx Outdated Show resolved Hide resolved
website/content/docs/job-specification/numa.mdx Outdated Show resolved Hide resolved
website/content/docs/job-specification/numa.mdx Outdated Show resolved Hide resolved
website/content/docs/job-specification/numa.mdx Outdated Show resolved Hide resolved
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some comments, but not blockers... not even sure they're good ideas. 😅

This is just complex stuff! Not sure if we should keep the docs minimal because presumably folks who go looking for NUMA features already have a strong sense of how this stuff works, or if we should steal more from the RFC that helped educate NUMA-newbies like me.

Comment on lines 56 to 58
- `none` - Nomad is free to allocate CPU cores using any strategy. Nomad uses
this freedom to allocate cores in such a way that minimizes the amount of
fragmentation of core availability per NUMA node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to allocate cores in such a way that minimizes the amount of fragmentation of core availability per NUMA node.

Can we steal more from the RFC's wording? It's not as succinct, but this took me a minute to parse.

The RFC for reference:
image

The images make it really immediately obvious. I wonder if we should just copy/paste the RFC in 🤷

Comment on lines 69 to 72
The `require` affinity option should be used sparingly due to
the implied fragmentation caused by reserving CPU cores based on the NUMA node
they are associated with. Use it for workloads known to be highly sensitive
to memory latencies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...used sparingly... ...implied fragmentation...

I know people often want more prescriptive guidance from us, but I would rather frame this in terms of the tradeoff being presented than to discourage the use of a feature that could reduce overall resource consumption dramatically. (Assuming if you're avoiding 300% performance penalties due to cross-node latencies, you can run fewer instances of a service to serve the same number of requests.)

So perhaps something like:

The require affinity option may cause workload fragmentation by reserving CPU cores based on the NUMA node they are associated with. Use it for workloads known to be highly sensitive to memory latencies.

Might even be worth defining workload fragmentation somewhere as something like:

A jobspec constraint that prevents optimal binpacking of Clients. This can waste cluster resources by leaving some Client resources free but unusable. For example when numa.affinity = "require", workloads cannot be scheduled on Clients which may have ample free compute resources unless those compute resources happen to be colocated on a single NUMA node.

idk where an appropriate place for that would be though.

@shoenig
Copy link
Member Author

shoenig commented Oct 26, 2023

LGTM. Left some comments, but not blockers... not even sure they're good ideas. 😅

This is just complex stuff! Not sure if we should keep the docs minimal because presumably folks who go looking for NUMA features already have a strong sense of how this stuff works, or if we should steal more from the RFC that helped educate NUMA-newbies like me.

I'm working on a Concepts/CPU doc that will cover "everything" to do with how Nomad interacts with your processor. I'm thinking we should keep the jobspec doc fairly minimal and then link to the concepts doc for further reference (once it exists).

@schmichael
Copy link
Member

I'm thinking we should keep the jobspec doc fairly minimal and then link to the concepts doc for further reference (once it exists).

Yeah I love this approach and what I plan on doing for workload identity. Leave the reference docs concise so folks can just find what they're looking for quickly. Concept docs provide context. Tutorials provide walkthroughs/howtos. 👍

pkazmierczak pushed a commit that referenced this pull request Oct 30, 2023
* docs: add job-specification docs for numa

* docs: take suggestions

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* docs: more cr suggestions

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
* docs: add job-specification docs for numa

* docs: take suggestions

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* docs: more cr suggestions

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
* docs: add job-specification docs for numa

* docs: take suggestions

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* docs: more cr suggestions

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants