Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to configure Envoy concurrency/worker thread count for Nomad jobs [feature request] #9341

Closed
chrisboulton opened this issue Nov 12, 2020 · 3 comments · Fixed by #9487
Assignees
Labels

Comments

@chrisboulton
Copy link

Feature Request

By default, Envoy configures itself with as many worker threads as there are CPU cores available (per https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-concurrency)

In some environments, this may be okay but similarly to how resources (CPU and memory) are assignable to Envoy, it is desirable to be able to tune this -- specifically in our environment where we're going to be running a large number of sidecar containers and a large number of threads leads to increased memory utilisation, and in the case of smaller throughput services lower connection pool hit ratios. Lyft themselves suggest that in their environment - they tune Envoy sidecars down to a small number of threads as is reasonable for the proxied service.

This is (hopefully) possible today, by overriding sidecar_task.config.args to supply a custom --concurrency option, kinda like so:

args = [
  "--concurrency", "${NOMAD_envoy_threads}",
  "-c", "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
  "-l", ${meta.connect.log_level}",
  "--disable-hot-restart",
]

It'd be nice to not have to do that and possibly end up out of sync with how Nomad is configuring Envoy.

Possible Implementation

In my mind, this is something that should be tuneable per sidecar task, so either a parameter (possibly a known metadata key/value like) should work here, or alternatively an environment variable that can be set on the task. The value should possibly default to 0 (the Envoy default):

@shoenig shoenig added theme/consul/connect Consul Connect integration type/enhancement labels Nov 13, 2020
@shoenig shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 13, 2020
@shoenig shoenig self-assigned this Nov 30, 2020
@shoenig
Copy link
Member

shoenig commented Nov 30, 2020

Thanks for pointing this out, @chrisboulton !

Something like ${meta.connect.proxy_concurrency} would be consistent with existing meta fields for configuring the connect proxy task, and would allow configuring the value at the job, group, or task level. I actually think we should set the default to 1 going forward (last minute backwards incompatibility before 1.0), as doing so would be more reasonable for the default amount of cpu resources assigned to the sidecar (cpu = 250 # mhz)

@chrisboulton
Copy link
Author

@shoenig: Yup, your proposed implementation sounds great. It sounds like it'd give us control/defaults at the client level as well similar to how custom Envoy images, the log level etc is handled. The proposed default value of 1 makes sense also (we ended up defaulting to 2 in our environment)

As an anecdote, we implemented the workaround I suggested, and it really helped us stabilise the memory usage of Envoy (especially at startup), in our rather large deployment.

It'd be wonderful if y'all could bring this down before 1.0 is cut, and we'd be happy to test as always. 🥳

shoenig added a commit that referenced this issue Dec 1, 2020
Previously, every Envoy Connect sidecar would spawn as many worker
threads as logical CPU cores. That is Envoy's default behavior when
`--concurrency` is not explicitly set. Nomad now sets the concurrency
flag to 1, which is sensible for the default cpu = 250 Mhz resources
allocated for sidecar proxies. The concurrency value can be configured
in Client configuration by setting `meta.connect.proxy_concurrency`.

Closes #9341
@shoenig shoenig moved this from Needs Triage to In Review in Nomad - Community Issues Triage Dec 1, 2020
Nomad - Community Issues Triage automation moved this from In Review to Done Dec 1, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants