Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecs-agent: don't start if not configured #1049

Merged
merged 2 commits into from
Aug 19, 2020

Conversation

samuelkarp
Copy link
Contributor

@samuelkarp samuelkarp commented Aug 18, 2020

Issue number:
#815

Description of changes:
Configuration errors can manifest in one of two ways:

  1. configured.target can fail if settings-applier.service fails (which can happen if user-data is garbage/invalid TOML)
  2. early-boot-config.service can fail if user-data includes validly-structured TOML with invalid fields or data

If configuration fails, the ECS agent should not start automatically.

Testing done:
Built an aws-ecs-1 AMI. Launched three instances:

  1. an instance with well-structured, valid user-data - everything started normally
  2. an instance with well-structured, invalid user-data - default host containers (control container) started. After enabling the admin container, I could verify that systemctl status showed the host in a degraded state and both ecs.service and chronyd.service were marked inactive (dead) with a note that this was due to a failed dependency
  3. an instance with garbage user-data - no host containers started, the console output showed a failure to construct the host container image names, and the console output showed [FAILED] Failed to start Bottlerocket userdata configuration system., [DEPEND] Dependency failed for Amazon Elasti…ntainer Service - container agent., and [DEPEND] Dependency failed for A versatile i…tion of the Network Time Protocol.

Built an aws-dev AMI. Launched three instances:

  1. an instance with well-structured, valid user-data - everything started normally
  2. an instance with well-structured, invalid user-data - default host containers (control container) started. After enabling the admin container, I could verify that systemctl status showed the host in a degraded state and chronyd.service was marked inactive (dead) with a note that this was due to a failed dependency
  3. an instance with garbage user-data - no host containers started, the console output showed a failure to construct the host container image names, and the console output showed [FAILED] Failed to start Bottlerocket userdata configuration system. and [DEPEND] Dependency failed for A versatile i…tion of the Network Time Protocol.

Built an aws-k8s-1.16 AMI. Launched three instances:

  1. an instance with well-structured, valid user-data - everything started normally
  2. an instance with well-structured, invalid user-data - default host containers (control container) started. After enabling the admin container, I could verify that systemctl status showed the host in a degraded state and chronyd.service was marked inactive (dead) with a note that this was due to a failed dependency
  3. an instance with garbage user-data - no host containers started, the console output showed a failure to construct the host container image names, and the console output showed [FAILED] Failed to start Bottlerocket userdata configuration system. and [DEPEND] Dependency failed for A versatile i…tion of the Network Time Protocol.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@samuelkarp samuelkarp requested review from tjkirch and jahkeup August 18, 2020 00:56
@bcressey
Copy link
Contributor

We have the same problem for kubelet and containerd today and need a general solution. I'd very much prefer not to propagate unit dependencies across the "configured" stage.

Ideally we wouldn't reach configured.target if one of the units fails, but if that's not workable then perhaps we need a keystone service (configured.service) on the way to that target, which only succeeds if its dependencies succeed.

@bcressey
Copy link
Contributor

Oh, didn't realize that we were missing configured.target before - does it work if we just include that dependency, and not also early-boot-config.service?

@samuelkarp
Copy link
Contributor Author

We have the same problem for kubelet and containerd today and need a general solution.

My understanding is that for Kubelet, missing configuration will result in the service attempting to start but not being able to do anything useful as configuration is required. In contrast, the ECS agent is designed to be able to run without configuration; leaving it unconfigured will cause it to join the "default" cluster (which is probably not what you want if you've made errors in your user-data).

I initially started looking at this through the lens of modifying configured.target. I then ended up confusing myself about the difference between Wants and Requires; I can test making changes to configured.target, but I'm concerned about it being a larger change. With adding Requires=early-boot-config.service to ecs.service, I at least know that I won't break any of the other variants.

Oh, didn't realize that we were missing configured.target before - does it work if we just include that dependency, and not also early-boot-config.service?

Adding configured.target alone handles one of the two distinct failure cases, the one where complete garbage user-data is provided. configured.target still succeeds if the user-data is validly-structured TOML with invalid values.

If early-boot-config.service fails, configured.target should also fail.
This allows services to require configured.target and conditionally fail
if there are errors applying settings from user-data.
@samuelkarp
Copy link
Contributor Author

@bcressey Updated to make configured.target require early-boot-config.service and tested across multiple variants.

@bcressey
Copy link
Contributor

no host containers started, [and] the console output showed a failure to construct the host container image names

This seems like a separate bug; it doesn't seem like invalid userdata should stop us from bringing up the default host containers.

Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's still some debate to be had about what to do when user data fails (and how we want it to fail) but this seems like a good change given our current understanding.

@samuelkarp samuelkarp merged commit 0722452 into bottlerocket-os:develop Aug 19, 2020
@samuelkarp samuelkarp deleted the ecs-configured branch August 19, 2020 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants