Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul: support for admin partitions #13139

Closed
shoenig opened this issue May 26, 2022 · 5 comments · Fixed by #19665
Closed

consul: support for admin partitions #13139

shoenig opened this issue May 26, 2022 · 5 comments · Fixed by #19665

Comments

@shoenig
Copy link
Member

shoenig commented May 26, 2022

Consul 1.11 added support for admin partitions.

Similar to the work done for adding support for Consul namespaces, Nomad should also add support for Consul admin partitions.

@jrasell jrasell added theme/consul/connect Consul Connect integration stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/consul labels May 30, 2022
@shoenig shoenig removed the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Sep 22, 2022
@mikenomitch
Copy link
Contributor

mikenomitch commented Oct 28, 2022

(EDIT: @davidfleming's reply below spells out some issues with my suggestion below)

Wanted to comment on the current state of things for anybody interested. Currently, Nomad doesn’t “know” about Admin Partitions, but it still works with them. There’s nothing that should break using the two together assuming you split up nodes and workloads properly. The question is, how do you split that up? The rough steps would be something like:

  • Step 1: Get Consul Admin Partitions running and split up your Consul nodes into partitions as you see fit.
  • Step 2: Tag the nodes in Nomad appropriately so that you can send specific workloads to specific nodes. I think the easiest way to do this is to use datacenters. If Admin Partitions map directly to Nomad datacenters, then it is really easy to know which Nomad node maps to which admin partition.
  • Step 3: When deploying Nomad jobs, make sure that you send in the right Consul creds for the right Admin Partition. If you are passing in a Consul token that has access to that partition.

Notes:

  • You don’t have to use datacenter to do this split. You could technically just use node_class or metadata to define each partition and then make sure Nomad workloads are placed onto the right node with those. In my opinion, this is more error-prone.
  • Splitting up Nomad clusters would also work. Each Admin partition could be a different Nomad cluster.
  • Nomad might introduce the concept of “node pools” at some point which would function somewhat like datacenter in terms of how you can split nodes up. This might eventually become the suggested way.

More official support for admin partitions is something that we've looked at briefly, but not explored in depth. If you have a use case where Nomad having first-class knowledge of admin partitions would be helpful, please let us know. A common use case where splitting by datacenter, cluster, or node_class is not sufficient would help us make the case for this internally.

@davidfleming
Copy link

davidfleming commented Oct 29, 2022

Hi @mikenomitch

First thank you for commenting on this. We have been looking forward to seeing progress on this and a discussion on the direction this will potentially take would be appreciated.

Our use case:

First class support for Admin Partitions would be very useful for what we are trying to accomplish. Particularly we are looking to separate workload by jobspec and bin pack on a cluster of clients. What that looks like for us is a breakdown of "production" vs "non-production" clients. The "non-production" workloads will consist of N environments (Think dev, staging, test, PR-123, etc). Ideally we would like to easily be able to spin up and down environments by bin packing them into existing clusters. The pattern you mentioned would require us to spin up new consul clients for each environment. While we could hack this together with namespaces it would become cumbersome and ideally it would not provide the same level of isolation we are looking for(ie the environments should not gossip or talk to each other).

We did start down this route already as an interim step but already hit some roadblocks in implementing it.

Some technical problems with the steps you listed:

  1. When the nomad client tries to query the consul agent, it complains about mismatch partition. Example of the consul agent being configured to a "staging" partition and the nomad client defaulting to "default" since it is unaware of partitions:

nomad[392012]: 2022-10-28T17:46:26.521Z [ERROR] consul.sync: still unable to update services in Consul: failures=10 error="failed to query Consul services: Unexpected response code: 400 (request targets partition "default" which does not match agent partition "staging")"

  1. If the nomad client ends up using the partition of the consul agent will the functionality of client_auto_join still work? Ie will the servers be registered in default but the nomad client be looking for them in the wrong partition (that of the client)?

Thanks,
David

@mikenomitch
Copy link
Contributor

Hey @davidfleming, we were looking into this on the Nomad engineering team, and unfortunately to "do it right" it'll take more effort than we have time for in the near future. We do plan to get to it, but not in the next month or two.

In the meantime, I noticed that there's a CONSUL_PARTITION env var you can set. Nomad itself isn't setting "default" so I think if you set that wherever you run the Consul client, that should fix the first problem you noted.

I think this should solve problem 2 as well, but to be honest I'm not sure.

Sorry about the delay on all of this, but hope that workaround works 🤞

@tgross tgross self-assigned this Dec 14, 2023
@tgross
Copy link
Member

tgross commented Dec 14, 2023

Here's our plan for implementing Admin Partitions support in Nomad. I'll break this down into three sections.

Fingerprinting

A Consul Enterprise agent can belong to exactly one partition. We require that each Nomad agent has its own agent (if you're using Consul). So the partition becomes an easy target for us to fingerprint. You then immediately get two options for allocating Nomad workloads to Consul partitions:

  • Job authors can add a constraint to their job on the attribute attr.consul.partition (or attr.consul.$clusterName.partition for non-default clusters).
  • Cluster administrators can set a 1:1 relationship between Consul partitions and Nomad node pools by having the Consul agent configured for the appropriate partition on the nodes where Nomad is in a particular node pool. (Ex. you can set Nomad agent node_pool = "prod" and the Consul agent partition = "prod".)

The partition is exposed in Consul's existing /v1/agent/self endpoint, so implementing fingerprinting turns out to be fairly trivial. One minor annoyance is that the API returns .Config.Partition = "default" only if the partition is explicitly set, rather than just the default. So when we fingerprint, we'll check the SKU and if it's Consul Enterprise we'll fill in the default partition if missing.

The fingerprinting work turns out to be trivial, so I've got a draft PR up for that here #19485

fingerprinting output

Consul CE agent (no partitions):

$ curl -s "http://localhost:8500/v1/agent/self" | jq .Config
{
  "Datacenter": "dc1",
  "PrimaryDatacenter": "dc1",
  "NodeName": "nomad0",
  "NodeID": "ec86d276-1c51-edb0-ad58-c79ec07f07e2",
  "Revision": "61547a41",
  "Server": false,
  "Version": "1.13.6",
  "BuildDate": "2023-01-26T15:59:13Z"
}
$ nomad node status -verbose -self | grep consul
consul.connect                      = true
consul.datacenter                   = dc1
consul.ft.namespaces                = false
consul.grpc                         = 8502
consul.revision                     = 61547a41
consul.server                       = false
consul.sku                          = oss
consul.version                      = 1.13.6
unique.consul.name                  = nomad0

Consul Enterprise agent with non-default partition:

$ curl -s "http://localhost:8500/v1/agent/self" | jq .Config
{
  "Datacenter": "dc1",
  "PrimaryDatacenter": "dc1",
  "NodeName": "nomad0",
  "NodeID": "ec86d276-1c51-edb0-ad58-c79ec07f07e2",
  "Partition": "example",
  "Revision": "d6969061",
  "Server": false,
  "Version": "1.16.0+ent",
  "BuildDate": "2023-06-26T20:27:46Z"
}
$ nomad node status -verbose -self | grep consul
consul.connect                      = true
consul.datacenter                   = dc1
consul.ft.namespaces                = true
consul.grpc                         = 8502
consul.partition                    = example
consul.revision                     = d6969061
consul.server                       = false
consul.sku                          = ent
consul.version                      = 1.16.0+ent
unique.consul.name                  = nomad0

Jobspec

Next, we can add a partition to the consul block in the jobspec.

If consul.partition is set in the job, we'd add an implicit constraint in one of the job mutating hooks we already have for Consul (either job_endpoint_hook_consul_ce.go#L92 or more likely job_endpoint_hooks.go#L179).

Enterprise Considerations

Consul admin partitions are a Consul Enterprise feature, so at first glance it would make sense to restrict this option to Nomad Enterprise as well. But we currently allow users to set a Consul namespace for their Nomad CE cluster, and this feature maps rather directly to that. Once the fingerprinting is added, adding the constraint is trivial for any user, so restricting just the jobspec portion of this to ENT wouldn't make sense either. So this feature will be fully implemented in Nomad Community Edition.

tgross added a commit that referenced this issue Dec 14, 2023
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.

This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.

Ref: #13139 (comment)
tgross added a commit that referenced this issue Dec 14, 2023
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.

This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.

Ref: #13139 (comment)
tgross added a commit that referenced this issue Dec 14, 2023
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.

This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.

Ref: #13139 (comment)
tgross added a commit that referenced this issue Dec 15, 2023
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.

This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.

Ref: #13139 (comment)
@tgross
Copy link
Member

tgross commented Dec 15, 2023

Fingerprinting has been merged and will ship in the next regular 1.7.x release of Nomad (most likely 1.7.3).

tgross added a commit that referenced this issue Jan 8, 2024
Add support for Consul Enterprise admin partitions. We added fingerprinting in
#19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: #13139
tgross added a commit that referenced this issue Jan 8, 2024
Add support for Consul Enterprise admin partitions. We added fingerprinting in
#19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: #13139
@tgross tgross added this to the 1.7.x milestone Jan 9, 2024
tgross added a commit that referenced this issue Jan 10, 2024
Add support for Consul Enterprise admin partitions. We added fingerprinting in
#19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: #13139
nvanthao pushed a commit to nvanthao/nomad that referenced this issue Mar 1, 2024
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.

This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.

Ref: hashicorp#13139 (comment)
nvanthao pushed a commit to nvanthao/nomad that referenced this issue Mar 1, 2024
Add support for Consul Enterprise admin partitions. We added fingerprinting in
hashicorp#19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: hashicorp#13139
nvanthao pushed a commit to nvanthao/nomad that referenced this issue Mar 1, 2024
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.

This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.

Ref: hashicorp#13139 (comment)
nvanthao pushed a commit to nvanthao/nomad that referenced this issue Mar 1, 2024
Add support for Consul Enterprise admin partitions. We added fingerprinting in
hashicorp#19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: hashicorp#13139
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 1.8
Development

Successfully merging a pull request may close this issue.

5 participants