-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional ECS API configurations #2527
Add additional ECS API configurations #2527
Conversation
|
||
lazy_static! { | ||
pub(crate) static ref ECS_DURATION_VALUE: Regex = Regex::new( | ||
r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ECS documentation for ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
reads
(Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", and "h".)
Should we try and cover "ns", "us/µs" as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the ECS agent checks that both duration values are at least 1s: container stop timeout and task cleanup
So I think we can keep seconds as the minimum unit supported and even drop ms
7695f48
to
13f65b6
Compare
Forced push includes:
|
13f65b6
to
68b7bf2
Compare
(Forced push to add missing |
README.md
Outdated
* `settings.ecs.metadata-service-burst-rate`: The burst rate limit of the throttling configurations set for the task metadata service | ||
* `settings.ecs.reserved-memory`: The amount of memory, in MB, reserved for critical system processes | ||
* `settings.ecs.task-cleanup-wait-duration`: Time to wait before the tasks resources are removed after they are stopped. | ||
Valid time units are `s`, `m`, and `h`, i.e: `1h`, `1m1s` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should have caught this in my earlier review.
nit: i.e.
should be e.g.
in this context since you're listing examples and not rephrasing or elaborating the previous clause.
See https://www.merriam-webster.com/words-at-play/ie-vs-eg-abbreviation-meaning-usage-difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great - built an image off this patch and see the rendered environment variables under the ECS config file and in the json file
68b7bf2
to
1f27738
Compare
(forced push replaces "i. e.:" with "e.g.") |
Here is some additional design guidance around API settings to provide more context. The input value should match the output value to the extent possible. Meaning, don't take strings as input for numeric output, and don't take large unsigned integers for small signed integers. This is confusing for users and creates potential security risks. Some types, like strings, need additional constraints (e.g. Beyond that, validation has two goals which are in tension. If the agent ( The drawback with this approach is that on first boot, nodes will fail to come up at all if given this kind of input. That is not a great experience. However, at scale, my bias is toward prioritizing the health of nodes that are in service over the usability of nodes that have never been in service. On the other hand, if the agent does not die with an error when presented with input that doesn't match the constraint, then the preferred approach can be reversed. This is especially true if the agent logs a message saying that the setting is being ignored. The automated fleet-wide command will succeed, and may appear to give the desired results ("all nodes running with the new setting!"), but the logs will tell the tale. The advantage of this approach is that the first boot experience is better. The node will come up as a functioning system, even if it is not configured exactly as expected. There's no risk to nodes in service, so it's OK to prefer usability over exactitude. Globally, I'd also prefer some measure of consistency - e.g. using the same With that in mind, this is why I was unhappy with the rationale you provided:
|
1f27738
to
32dc9f3
Compare
Forced push includes:
|
32dc9f3
to
8f5b7d2
Compare
(Forced push to fix merge conflicts in |
sources/api/migration/migrations/v1.11.0/ecs-additional-configurations/src/main.rs
Outdated
Show resolved
Hide resolved
|
||
lazy_static! { | ||
pub(crate) static ref ECS_DURATION_VALUE: Regex = | ||
Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(u|µ){1}s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: regex101 called the {1}
repetition "useless", and suggested using a non-capturing group for (u|µ)
instead:
Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(u|µ){1}s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap(); | |
Regex::new(r"^(([0-9]+\.)?[0-9]+h)?(([0-9]+\.)?[0-9]+m)?(([0-9]+\.)?[0-9]+s)?(([0-9]+\.)?[0-9]+ms)?(([0-9]+\.)?[0-9]+(?:u|µ)s)?(([0-9]+\.)?[0-9]+ns)?$").unwrap(); |
8f5b7d2
to
fe737bc
Compare
Forced push includes:
|
fe737bc
to
cf3cbc6
Compare
(Forced push to fix merge coonflicts) |
README.md
Outdated
@@ -541,6 +543,11 @@ These settings can be changed at any time. | |||
Bottlerocket enables the `json-file`, `awslogs`, and `none` drivers by default. | |||
* `settings.ecs.loglevel`: The level of verbosity for the ECS agent's logs. | |||
Supported values are `debug`, `info`, `warn`, `error`, and `crit`, and the default is `info`. | |||
* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service | |||
* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the metadata-service-rps
and metadata-service-burst
settings, I think we should add a small snippet of documentation commenting on the correlation between these two settings in the Bottlerocket API and the comma separated string for ECS_TASK_METADATA_RPS_LIMIT
in the agent.
Even something like the following would suffice:
"these two settings will be combined to generate the
ECS_TASK_METADATA_RPS_LIMIT
ECS agent setting"
I don't think it'll be immediate clear to users that these two values are related to the ECS_TASK_METADATA_RPS_LIMIT
agent setting. The ECS agent documentation is clear what this setting does, but a quick scan of setting names doesn't mention anything about "burst" so I see this as a likely place of confusion for users without abit of extra documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering about the names here: both the names metadata-service-rps
and metadata-service-burst
don't strongly correlate to ECS_TASK_METADATA_RPS_LIMIT
What if we renamed
metadata-service-rps
totask-metadata-rps-limit
metadata-service-burst
totask-metadata-burst-limit
so it's clear these two relate to limits on each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #2527 (comment), it was decided to drop the rate
part of the setting (which was similar to the suggestion you make with limit
) to be consistent with other existing APIs.
I do think that I can make this setting a bit more "clear" by adding the prefix task-
:
task-metadata-service-rps
task-metadata-service-burst
But if folks don't feel strongly about this, I can keep what I have 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer task-metadata-service-*
, but I don't feel too strongly about it.
README.md
Outdated
* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service | ||
* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service | ||
* `settings.ecs.reserved-memory`: The amount of memory, in MiB, reserved for critical system processes | ||
* `settings.ecs.task-cleanup-wait`: Time to wait before the tasks containers are removed after they are stopped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this setting is confusing: which "cleanup" setting does it directly correlate to? Looking at the ECS agent documentation, there is ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
, ECS_IMAGE_CLEANUP_INTERVAL
, ECS_IMAGE_MINIMUM_CLEANUP_AGE
, etc.
Reading deeper into the ECS agent docs, it becomes clear this is correlated to the ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
setting.
It would make sense to me to rename this setting to
settings.ecs.engine-task-cleanup-wait-duration
so there's an obvious 1:1 relation with it's ECS agent setting.
Again, this is in the spirit of being as clear to users as possible (and therefore provide a better UX)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a clear distinction between ECS_IMAGE_CLEANUP_INTERVAL
/ECS_IMAGE_MINIMUM_CLEANUP_AGE
and ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
, the latter is used for tasks (ECS_TASK) whereas the others are used for images (ECS_IMAGE). The name of the new setting (task-cleanup-wait
) is part of the ECS setting name (TASK_CLEANUP_WAIT
), which is unique among the ECS agent's settings.
On re-reading other comments before mine (#2527 (comment)) looks like maybe I'm missing something when it comes to naming our API settings. I don't feel strongly on this and I'm happy to go with whatever prior art there is on naming bottlerocket API endpoints. But I would like to call out the potential confusion users might have when attempting to correlate settings in software we consume (and then expose in settings via our API). In the past, I've preferred to keep these kind of things as close as possible to a 1:1 ratio to avoid confusion. |
cf3cbc6
to
7e7b11c
Compare
Forced push include a note to clarify how the API settings map to the ECS provided settings, and a few missing dots in the documentation for the new settings. |
README.md
Outdated
* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service. | ||
* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service. | ||
* `settings.ecs.reserved-memory`: The amount of memory, in MiB, reserved for critical system processes. | ||
* `settings.ecs.task-cleanup-wait`: Time to wait before the tasks containers are removed after they are stopped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `settings.ecs.task-cleanup-wait`: Time to wait before the tasks containers are removed after they are stopped. | |
* `settings.ecs.task-cleanup-wait`: Time to wait before the task's containers are removed after they are stopped. |
README.md
Outdated
@@ -541,6 +543,11 @@ These settings can be changed at any time. | |||
Bottlerocket enables the `json-file`, `awslogs`, and `none` drivers by default. | |||
* `settings.ecs.loglevel`: The level of verbosity for the ECS agent's logs. | |||
Supported values are `debug`, `info`, `warn`, `error`, and `crit`, and the default is `info`. | |||
* `settings.ecs.metadata-service-rps`: The steady state rate limit of the throttling configurations set for the task metadata service | |||
* `settings.ecs.metadata-service-burst`: The burst rate limit of the throttling configurations set for the task metadata service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer task-metadata-service-*
, but I don't feel too strongly about it.
With this, 4 additional configurations for the ECS agent are supported though the API. There are two configuration files used to set up the ECS agent: - /etc/ecs/ecs.config.json - /etc/ecs/ecs.config We favor the former to add new configurations, and we only use the latter on special cases, i.e. when the configurations to be added aren't modeled as part of the struct that represents the agent's configuration, or when special deserialization is used to parse the configurations. The configurations added in this change are as follows: ECS_CONTAINER_STOP_TIMEOUT: supported through the container-stop-timeout API; this configuration is rendered in the /etc/ecs/ecs.config file since this configuration is of type Duration (1m, 1s, 1h). This type must be parsed by calling the time.ParseDuration function which isn't called under the hood by the serialization libraries used in the ECS agent. ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION: supported through the task-cleanup-wait API; this configuration is of the same type as the previous configuration and was rendered following the same reasoning. ECS_RESERVED_MEMORY: supported through the reserved-memory API; this configuration is rendered in /etc/ecs/config.ecs.json since the configuration's type can be deserialized without additional helper functions. ECS_TASK_METADATA_RPS_LIMIT: this configuration represents a comma-separated string with two values used to set the throttling rates in the metadata service exposed by the ECS agent. These values don't have to be set together, since the ECS agent will use default values if either is missing. Thus, this configuration is supported through the metadata-service-rps and metadata-service-burst APIs. Both configurations are rendered in the /etc/ecs/confing.ecs.json file, since the configurations' type can be deserialized without additional helper functions. Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
7e7b11c
to
6763016
Compare
(forced push to fix documentation's typo) |
Issue number:
Closes #2483
Description of changes:
With this, the following configurations will be supported through API settings:
For
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
andECS_CONTAINER_STOP_TIMEOUT
, I added the newECSDurationValue
struct used to represent atime.Duration
string. I decided to use a new struct instead of refactoringKubernetesDurationValue
to keep both duration values decoupled from one another, in case the type of any of them changes in future releases. The two aforementioned configurations are rendered in the/etc/ecs/ecs.config
file, which holds environment variables passed to the ECS agent's daemon. This is because parsing validtime.Duration
strings in golang requires calls totime.ParseDuration
which isn't called by thejson.Unmarshall
function used by the ECS agent to parse the JSON formatted configuration at/etc/ecs/ecs.config.json
. Even though it is possible to deserializetime.Duration
values from a JSON string, I decided against this since the values in the JSON file must be of typeint64
which isn't as user friendly as the duration strings1m
,1s
,1h
, etc.ECS_TASK_METADATA_RPS_LIMIT
contains a comma-separated string with two values used to set up the throttling configurations of the ECS agent's metadata service. The two values are assigned to two fields of the struct type used to represent the agent's configuration. These two values are independent from one another, meaning that a) the agent doesn't perform coupled validations on the values, and b) if any of them is missing or is zero, the agent will use default values. Since both values must be integers, they can be deserialized without any helper functions (unliketime.Duration
). As a result, I decided to keep these two values as separate settings in the API (metadata-service-rps
andmetadata-service-burst
) instead of one single configuration, and render them in/etc/ecs/ecs.config.json
. The official AWS documentation for the agent's configurations clearly states what each value in the tuple represents, so it should be relatively simple for users to migrate from the comma-separated configuration. No validation is required to verify that the new configurations are positive integers since the ECS agent won't fail to start, also a useful message is printed.ECS_RESERVED_MEMORY
represents the amount of memory (in MB) that the agent won't use for the allocated tasks. Since this configuration is of typeunit16
and it can be serialized without any helper functions, I decided to render it in/etc/ecs/ecs.config.json
.TODO:
Testing done:
task-cleanup-wait
: I confirmed that the container resources were cleaned up after the specified timecontainer-stop-timeout
: I created a container with a process that ignoredSIGTERM
signals; the agent sentSIGKILL
after the specified timemetadata-service-rps
,metadata-service-burst
: I used-1
as the values for the settings and confirmed that the ECS agent logged a warningreserved-memory
: I confirmed the ECS agent logged the remaining memoryTerms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.