Add additional ECS API configurations #2527

arnaldo2792 · 2022-10-26T21:53:14Z

Issue number:

Description of changes:

With this, the following configurations will be supported through API settings:

ECS_CONTAINER_STOP_TIMEOUT
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
ECS_TASK_METADATA_RPS_LIMIT
ECS_RESERVED_MEMORY

For ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION and ECS_CONTAINER_STOP_TIMEOUT, I added the new ECSDurationValue struct used to represent a time.Duration string. I decided to use a new struct instead of refactoring KubernetesDurationValue to keep both duration values decoupled from one another, in case the type of any of them changes in future releases. The two aforementioned configurations are rendered in the /etc/ecs/ecs.config file, which holds environment variables passed to the ECS agent's daemon. This is because parsing valid time.Duration strings in golang requires calls to time.ParseDuration which isn't called by the json.Unmarshall function used by the ECS agent to parse the JSON formatted configuration at /etc/ecs/ecs.config.json. Even though it is possible to deserialize time.Duration values from a JSON string, I decided against this since the values in the JSON file must be of type int64 which isn't as user friendly as the duration strings 1m, 1s, 1h, etc.

ECS_TASK_METADATA_RPS_LIMIT contains a comma-separated string with two values used to set up the throttling configurations of the ECS agent's metadata service. The two values are assigned to two fields of the struct type used to represent the agent's configuration. These two values are independent from one another, meaning that a) the agent doesn't perform coupled validations on the values, and b) if any of them is missing or is zero, the agent will use default values. Since both values must be integers, they can be deserialized without any helper functions (unlike time.Duration). As a result, I decided to keep these two values as separate settings in the API (metadata-service-rps and metadata-service-burst) instead of one single configuration, and render them in /etc/ecs/ecs.config.json . The official AWS documentation for the agent's configurations clearly states what each value in the tuple represents, so it should be relatively simple for users to migrate from the comma-separated configuration. No validation is required to verify that the new configurations are positive integers since the ECS agent won't fail to start, also a useful message is printed.

ECS_RESERVED_MEMORY represents the amount of memory (in MB) that the agent won't use for the allocated tasks. Since this configuration is of type unit16 and it can be serialized without any helper functions, I decided to render it in /etc/ecs/ecs.config.json.

TODO:

Add documentation for the new ECS settings.

Testing done:

In my own 1.11.0 variant, I verified I could set the new API settings, and that the ECS agent picked them up:
- task-cleanup-wait: I confirmed that the container resources were cleaned up after the specified time
- container-stop-timeout: I created a container with a process that ignored SIGTERM signals; the agent sent SIGKILL after the specified time
- metadata-service-rps, metadata-service-burst: I used -1 as the values for the settings and confirmed that the ECS agent logged a warning
- reserved-memory: I confirmed the ECS agent logged the remaining memory
I did upgrade/downgrade testing and confirmed the host came back up

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

etungsten · 2022-10-27T17:55:18Z

sources/models/src/modeled_types/ecs.rs

+
+lazy_static! {
+    pub(crate) static ref ECS_DURATION_VALUE: Regex = Regex::new(
+        r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?$"


The ECS documentation for ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION reads

(Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", and "h".)

Should we try and cover "ns", "us/µs" as well?

Actually, the ECS agent checks that both duration values are at least 1s: container stop timeout and task cleanup

So I think we can keep seconds as the minimum unit supported and even drop ms

arnaldo2792 · 2022-10-27T21:18:15Z

Forced push includes:

Add documentation for the new settings
Remove ms as a supported unit for ECS duration types, since the minimal value supported in the ECS agent is 1s.

README.md

arnaldo2792 · 2022-10-27T21:40:27Z

(Forced push to add missing . in documentation)

etungsten · 2022-10-27T21:45:33Z

README.md

+* `settings.ecs.metadata-service-burst-rate`: The burst rate limit of the throttling configurations set for the task metadata service
+* `settings.ecs.reserved-memory`: The amount of memory, in MB, reserved for critical system processes
+* `settings.ecs.task-cleanup-wait-duration`: Time to wait before the tasks resources are removed after they are stopped.
+Valid time units are `s`, `m`, and `h`, i.e: `1h`, `1m1s`


Sorry, should have caught this in my earlier review.
nit: i.e. should be e.g. in this context since you're listing examples and not rephrasing or elaborating the previous clause.

See https://www.merriam-webster.com/words-at-play/ie-vs-eg-abbreviation-meaning-usage-difference

jpmcb

Overall looks great - built an image off this patch and see the rendered environment variables under the ECS config file and in the json file

sources/models/src/modeled_types/ecs.rs

arnaldo2792 · 2022-10-27T22:44:48Z

(forced push replaces "i. e.:" with "e.g.")

sources/models/src/lib.rs

sources/models/src/modeled_types/ecs.rs

README.md

bcressey · 2022-10-28T15:18:20Z

Here is some additional design guidance around API settings to provide more context.

The input value should match the output value to the extent possible. Meaning, don't take strings as input for numeric output, and don't take large unsigned integers for small signed integers. This is confusing for users and creates potential security risks.

Some types, like strings, need additional constraints (e.g. SingleLineString) to avoid creating problems when rendered in a template. This is a purely host level concern, meant to guard against rendering garbage that breaks the file structure, or having the value for one setting used as a backdoor to apply a completely different setting.

Beyond that, validation has two goals which are in tension.

If the agent (kubelet, ecs-agent) dies with a fatal error when given input of the correct type that does not match some other constraint - e.g. a string that does not match a regex, or a value that's outside a specific range - then ideally the API should reject that. Otherwise, a typo in an automated command applied fleet-wide could take down a large number of nodes at once. This is not an unbreakable rule - people could just not make typos - but it's reasonable to expect the API to have your back.

The drawback with this approach is that on first boot, nodes will fail to come up at all if given this kind of input. That is not a great experience. However, at scale, my bias is toward prioritizing the health of nodes that are in service over the usability of nodes that have never been in service.

On the other hand, if the agent does not die with an error when presented with input that doesn't match the constraint, then the preferred approach can be reversed. This is especially true if the agent logs a message saying that the setting is being ignored. The automated fleet-wide command will succeed, and may appear to give the desired results ("all nodes running with the new setting!"), but the logs will tell the tale.

The advantage of this approach is that the first boot experience is better. The node will come up as a functioning system, even if it is not configured exactly as expected. There's no risk to nodes in service, so it's OK to prefer usability over exactitude.

Globally, I'd also prefer some measure of consistency - e.g. using the same KubernetesDurationValue for two settings, one of which causes a fatal error and warrants the extra safeguards, and the other of which is ignored. This is related to another aspect of usability - whether the user can learn to successfully predict the behavior of the API - which hopefully makes for fewer frustrations and surprises over time.

With that in mind, this is why I was unhappy with the rationale you provided:

We shouldn't use u64 instead of i64 just to enforce a positive number constraint, since that changes the input type and may lead to confusion or security risks.
If we add a setting, we need to also add the correct validation to the setting to protect fleets at scale. This isn't supposed to be left for a follow-up PR. If there's an operational safety risk, it should be addressed when the setting is first added.
However, if no additional validation is actually required for the agent to remain healthy, we shouldn't feel obligated to go beyond the minimum of matching up types.

arnaldo2792 · 2022-10-28T23:52:49Z

Forced push includes:

Fix supported values for duration type
Update API settings names as suggested

arnaldo2792 · 2022-11-03T19:12:05Z

(Forced push to fix merge conflicts in sources/Cargo.toml and Release.toml)

README.md

packages/ecs-agent/ecs.config

README.md

sources/api/migration/migrations/v1.11.0/ecs-additional-configurations/src/main.rs

bcressey · 2022-11-03T21:55:53Z

sources/models/src/modeled_types/ecs.rs

+
+lazy_static! {
+    pub(crate) static ref ECS_DURATION_VALUE: Regex =
+        Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(u|µ){1}s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap();


nit: regex101 called the {1} repetition "useless", and suggested using a non-capturing group for (u|µ) instead:

Suggested change

Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(u|µ){1}s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap();

Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(?:u|µ)s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap();

arnaldo2792 · 2022-11-04T02:01:24Z

Forced push includes:

Rename task-cleanup-wait-duration to task-cleanup-wait
Updated regular expression to validate duration values
Fixed comments in documentation

arnaldo2792 · 2022-11-04T02:06:48Z

(Forced push to fix merge coonflicts)

jpmcb · 2022-11-04T15:49:42Z

README.md

@@ -541,6 +543,11 @@ These settings can be changed at any time.
  Bottlerocket enables the `json-file`, `awslogs`, and `none` drivers by default.
 * `settings.ecs.loglevel`: The level of verbosity for the ECS agent's logs.
  Supported values are `debug`, `info`, `warn`, `error`, and `crit`, and the default is `info`.
+* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service
+* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service


For the metadata-service-rps and metadata-service-burst settings, I think we should add a small snippet of documentation commenting on the correlation between these two settings in the Bottlerocket API and the comma separated string for ECS_TASK_METADATA_RPS_LIMIT in the agent.

Even something like the following would suffice:

"these two settings will be combined to generate the ECS_TASK_METADATA_RPS_LIMIT ECS agent setting"

I don't think it'll be immediate clear to users that these two values are related to the ECS_TASK_METADATA_RPS_LIMIT agent setting. The ECS agent documentation is clear what this setting does, but a quick scan of setting names doesn't mention anything about "burst" so I see this as a likely place of confusion for users without abit of extra documentation.

I'm also wondering about the names here: both the names metadata-service-rps and metadata-service-burst don't strongly correlate to ECS_TASK_METADATA_RPS_LIMIT

What if we renamed

metadata-service-rps to task-metadata-rps-limit

metadata-service-burst to task-metadata-burst-limit

so it's clear these two relate to limits on each?

In #2527 (comment), it was decided to drop the rate part of the setting (which was similar to the suggestion you make with limit) to be consistent with other existing APIs.

I do think that I can make this setting a bit more "clear" by adding the prefix task-:

task-metadata-service-rps

task-metadata-service-burst

But if folks don't feel strongly about this, I can keep what I have 😅

I would prefer task-metadata-service-*, but I don't feel too strongly about it.

jpmcb · 2022-11-04T15:53:14Z

README.md

+* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service
+* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service
+* `settings.ecs.reserved-memory`: The amount of memory, in MiB, reserved for critical system processes
+* `settings.ecs.task-cleanup-wait`: Time to wait before the tasks containers are removed after they are stopped.


The name of this setting is confusing: which "cleanup" setting does it directly correlate to? Looking at the ECS agent documentation, there is ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, ECS_IMAGE_CLEANUP_INTERVAL, ECS_IMAGE_MINIMUM_CLEANUP_AGE, etc.

Reading deeper into the ECS agent docs, it becomes clear this is correlated to the ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION setting.

It would make sense to me to rename this setting to
settings.ecs.engine-task-cleanup-wait-duration so there's an obvious 1:1 relation with it's ECS agent setting.

Again, this is in the spirit of being as clear to users as possible (and therefore provide a better UX)

There is a clear distinction between ECS_IMAGE_CLEANUP_INTERVAL/ECS_IMAGE_MINIMUM_CLEANUP_AGE and ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, the latter is used for tasks (ECS_TASK) whereas the others are used for images (ECS_IMAGE). The name of the new setting (task-cleanup-wait) is part of the ECS setting name (TASK_CLEANUP_WAIT), which is unique among the ECS agent's settings.

jpmcb · 2022-11-04T16:06:52Z

On re-reading other comments before mine (#2527 (comment)) looks like maybe I'm missing something when it comes to naming our API settings.

I don't feel strongly on this and I'm happy to go with whatever prior art there is on naming bottlerocket API endpoints. But I would like to call out the potential confusion users might have when attempting to correlate settings in software we consume (and then expose in settings via our API). In the past, I've preferred to keep these kind of things as close as possible to a 1:1 ratio to avoid confusion.

arnaldo2792 · 2022-11-04T20:47:04Z

Forced push include a note to clarify how the API settings map to the ECS provided settings, and a few missing dots in the documentation for the new settings.

bcressey · 2022-11-04T21:12:24Z

README.md

+* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service.
+* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service.
+* `settings.ecs.reserved-memory`: The amount of memory, in MiB, reserved for critical system processes.
+* `settings.ecs.task-cleanup-wait`: Time to wait before the tasks containers are removed after they are stopped.


Suggested change

* `settings.ecs.task-cleanup-wait`: Time to wait before the tasks containers are removed after they are stopped.

* `settings.ecs.task-cleanup-wait`: Time to wait before the task's containers are removed after they are stopped.

bcressey · 2022-11-04T21:14:50Z

README.md

@@ -541,6 +543,11 @@ These settings can be changed at any time.
  Bottlerocket enables the `json-file`, `awslogs`, and `none` drivers by default.
 * `settings.ecs.loglevel`: The level of verbosity for the ECS agent's logs.
  Supported values are `debug`, `info`, `warn`, `error`, and `crit`, and the default is `info`.
+* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service
+* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service


I would prefer task-metadata-service-*, but I don't feel too strongly about it.

With this, 4 additional configurations for the ECS agent are supported though the API. There are two configuration files used to set up the ECS agent: - /etc/ecs/ecs.config.json - /etc/ecs/ecs.config We favor the former to add new configurations, and we only use the latter on special cases, i.e. when the configurations to be added aren't modeled as part of the struct that represents the agent's configuration, or when special deserialization is used to parse the configurations. The configurations added in this change are as follows: ECS_CONTAINER_STOP_TIMEOUT: supported through the container-stop-timeout API; this configuration is rendered in the /etc/ecs/ecs.config file since this configuration is of type Duration (1m, 1s, 1h). This type must be parsed by calling the time.ParseDuration function which isn't called under the hood by the serialization libraries used in the ECS agent. ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION: supported through the task-cleanup-wait API; this configuration is of the same type as the previous configuration and was rendered following the same reasoning. ECS_RESERVED_MEMORY: supported through the reserved-memory API; this configuration is rendered in /etc/ecs/config.ecs.json since the configuration's type can be deserialized without additional helper functions. ECS_TASK_METADATA_RPS_LIMIT: this configuration represents a comma-separated string with two values used to set the throttling rates in the metadata service exposed by the ECS agent. These values don't have to be set together, since the ECS agent will use default values if either is missing. Thus, this configuration is supported through the metadata-service-rps and metadata-service-burst APIs. Both configurations are rendered in the /etc/ecs/confing.ecs.json file, since the configurations' type can be deserialized without additional helper functions. Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>

arnaldo2792 · 2022-11-04T21:18:30Z

(forced push to fix documentation's typo)

arnaldo2792 requested review from jpmcb, etungsten and bcressey October 26, 2022 21:53

webern approved these changes Oct 27, 2022

View reviewed changes

etungsten reviewed Oct 27, 2022

View reviewed changes

arnaldo2792 force-pushed the new-ecs-settings branch from 7695f48 to 13f65b6 Compare October 27, 2022 21:16

etungsten reviewed Oct 27, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

arnaldo2792 force-pushed the new-ecs-settings branch from 13f65b6 to 68b7bf2 Compare October 27, 2022 21:40

arnaldo2792 requested review from etungsten and webern October 27, 2022 21:42

etungsten reviewed Oct 27, 2022

View reviewed changes

jpmcb reviewed Oct 27, 2022

View reviewed changes

sources/models/src/modeled_types/ecs.rs Outdated Show resolved Hide resolved

arnaldo2792 force-pushed the new-ecs-settings branch from 68b7bf2 to 1f27738 Compare October 27, 2022 22:44

arnaldo2792 requested a review from etungsten October 27, 2022 22:58

etungsten approved these changes Oct 27, 2022

View reviewed changes

bcressey reviewed Oct 28, 2022

View reviewed changes

sources/models/src/lib.rs Outdated Show resolved Hide resolved

sources/models/src/modeled_types/ecs.rs Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

arnaldo2792 force-pushed the new-ecs-settings branch from 1f27738 to 32dc9f3 Compare October 28, 2022 23:51

arnaldo2792 requested a review from bcressey October 29, 2022 00:00

arnaldo2792 force-pushed the new-ecs-settings branch from 32dc9f3 to 8f5b7d2 Compare November 3, 2022 19:11

bcressey reviewed Nov 3, 2022

View reviewed changes

arnaldo2792 force-pushed the new-ecs-settings branch from 8f5b7d2 to fe737bc Compare November 4, 2022 01:59

arnaldo2792 force-pushed the new-ecs-settings branch from fe737bc to cf3cbc6 Compare November 4, 2022 02:06

arnaldo2792 requested review from jpmcb, bcressey and etungsten November 4, 2022 02:06

jpmcb reviewed Nov 4, 2022

View reviewed changes

arnaldo2792 force-pushed the new-ecs-settings branch from cf3cbc6 to 7e7b11c Compare November 4, 2022 20:46

bcressey reviewed Nov 4, 2022

View reviewed changes

arnaldo2792 force-pushed the new-ecs-settings branch from 7e7b11c to 6763016 Compare November 4, 2022 21:17

bcressey approved these changes Nov 4, 2022

View reviewed changes

etungsten approved these changes Nov 4, 2022

View reviewed changes

arnaldo2792 merged commit 01d3a63 into bottlerocket-os:develop Nov 4, 2022

webern mentioned this pull request Nov 19, 2022

modeled-types: constrain kubelet log level #2595

Closed

arnaldo2792 deleted the new-ecs-settings branch June 19, 2023 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional ECS API configurations #2527

Add additional ECS API configurations #2527

arnaldo2792 commented Oct 26, 2022 •

edited

Loading

etungsten Oct 27, 2022

arnaldo2792 Oct 27, 2022 •

edited

Loading

arnaldo2792 commented Oct 27, 2022

arnaldo2792 commented Oct 27, 2022

etungsten Oct 27, 2022 •

edited

Loading

jpmcb left a comment

arnaldo2792 commented Oct 27, 2022

bcressey commented Oct 28, 2022

arnaldo2792 commented Oct 28, 2022

arnaldo2792 commented Nov 3, 2022 •

edited

Loading

bcressey Nov 3, 2022

arnaldo2792 commented Nov 4, 2022

arnaldo2792 commented Nov 4, 2022

jpmcb Nov 4, 2022

jpmcb Nov 4, 2022

arnaldo2792 Nov 4, 2022

bcressey Nov 4, 2022

jpmcb Nov 4, 2022

arnaldo2792 Nov 4, 2022

jpmcb commented Nov 4, 2022

arnaldo2792 commented Nov 4, 2022

bcressey Nov 4, 2022

bcressey Nov 4, 2022

arnaldo2792 commented Nov 4, 2022

	Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(u\|µ){1}s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap();
	Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(?:u\|µ)s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap();

	* `settings.ecs.task-cleanup-wait`: Time to wait before the tasks containers are removed after they are stopped.
	* `settings.ecs.task-cleanup-wait`: Time to wait before the task's containers are removed after they are stopped.

Add additional ECS API configurations #2527

Add additional ECS API configurations #2527

Conversation

arnaldo2792 commented Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

arnaldo2792 Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

arnaldo2792 commented Oct 27, 2022

arnaldo2792 commented Oct 27, 2022

etungsten Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

jpmcb left a comment

Choose a reason for hiding this comment

arnaldo2792 commented Oct 27, 2022

bcressey commented Oct 28, 2022

arnaldo2792 commented Oct 28, 2022

arnaldo2792 commented Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

arnaldo2792 commented Nov 4, 2022

arnaldo2792 commented Nov 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpmcb commented Nov 4, 2022

arnaldo2792 commented Nov 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnaldo2792 commented Nov 4, 2022

arnaldo2792 commented Oct 26, 2022 •

edited

Loading

arnaldo2792 Oct 27, 2022 •

edited

Loading

etungsten Oct 27, 2022 •

edited

Loading

arnaldo2792 commented Nov 3, 2022 •

edited

Loading