Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ACL Controller Race Conditions #45

Merged
merged 1 commit into from
Nov 5, 2021
Merged

Conversation

erichaberkorn
Copy link
Contributor

@erichaberkorn erichaberkorn commented Oct 28, 2021

Changes proposed in this PR:

Fix race conditions in the ACL controller

It does this by fetching all of the ACL tokens and ECS tasks. It then makes a mapping from inferred service name to the resources that are tied to that resource. The ACL controller modifies tokens in two cases:

  • Tokens are deleted when there are ACL tokens for a service that doesn't have any tasks running
  • Tokens are added when there are tasks running for a service that doesn't have an ACL token

How I've tested this PR:

Writing tests

How I expect reviewers to test this PR:

👀

Checklist:

  • Tests added
  • CHANGELOG entry added

@erichaberkorn erichaberkorn force-pushed the acl-controller-refactor branch from fec252a to 53bbb77 Compare October 28, 2021 18:05
}
}

aclTokens, err := s.fetchACLTokens()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a bit more expensive than the previous implementation. I'm interested in if this causes any issues when we do load testing. If it causes problems there are a number of things we can do such as slowing down the poll interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. We're also fetching all services from consul on every 10sec which we didn't do before.

🤔 do we need to fetch all tokens every time we list? What if we only fetch services from Consul but not tokens? Then if service doesn't exist, the resource.Delete will fetch the tokens and delete the token for the service if it exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass through this and eliminated a ton of queries. Now it just fetches all of the tokens (in a single query) and all of the tasks (the same as before). I was able to eliminate one request for each ACL token and the request for all of the services.

@erichaberkorn erichaberkorn marked this pull request as ready for review October 28, 2021 18:22
@erichaberkorn
Copy link
Contributor Author

Sorry the diff on this one is crazy. Let me know if you all would like to walk through it synchronously.

Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Eric! I love the refactor 😍 I think it makes it much more clear.

I left some comments and questions inline.

CHANGELOG.md Outdated
@@ -1,5 +1,10 @@
## UNRELEASED

IMPROVEMENTS
* Make the ACL controller stateless. Prior to this, there were a few cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by why it's now stateless. I can see what you mean that we no longer have the internal state for the controller, but that state was in-memory before, and so I think using this term will be confusing since "stateful" implies having some persistent state.

Also, should this be a bug fix?

Comment on lines 59 to 60
id, err := resource.ID()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we no longer are keeping track of resources in the resourceState, we probably no longer need the id?

We should also remove resourceState field in the controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call out on resourceState. What do you think about keeping id around for logging purposes?

tasks, err := s.fetchECSTasks()

if err != nil {
return resources, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: even though resources should be nil, I think it's more readable to return nil explicitly

Suggested change
return resources, err
return nil, err

// ConsulClient is the Consul client to be used by the TaskLister.
// TaskLister doesn't need to talk to Consul, but it passes this client
// ConsulClient is the Consul client to be used by the ServiceStateLister.
// ServiceStateLister doesn't need to talk to Consul, but it passes this client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer true since it's calling Catalog().Services()

Suggested change
// ServiceStateLister doesn't need to talk to Consul, but it passes this client

}

if _, ok := buildingResources[name]; !ok {
buildingResources[name] = &ServiceInfo{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if the service belongs to this ECS cluster? I'm thinking about a case when you might have multiple ECS clusters or services running outside of ECS. Then we don't want to delete a token that this controller does not manage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's setup this way to be super conservative about token deletion. Given the logic in Reconcile where tokens are only deleted if the service doesn't exist and there are tokens. Excluding services to be running on this specific ECS cluster could lead to a case where tokens are deleted for a service that still exists elsewhere.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluding services to be running on this specific ECS cluster could lead to a case where tokens are deleted for a service that still exists elsewhere.

Hmm... I think we should limit this controller to managing tokens only for tasks in this cluster.

  • If someone runs two ECS clusters, then (with current architecture) both clusters have an ACL controller. If there are tasks for a service in both clusters, then we should see two tokens created for the service, one per cluster/controller.
  • If there are service instances running elsewhere (outside of ECS), those service instances should create a separate service token rather than reusing a controller-managed token.

From that perspective, controller-managed tokens should not be in use elsewhere, so it should be safe for a controller to delete a token that it manages (once there are no tasks in the local ECS cluster using the token).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to only take tasks and ACL tokens into account. I'm super interested in what you all think

serviceName, err := t.serviceNameForTask()
if err != nil {
return fmt.Errorf("could not determine service name: %w", err)
func (s *ServiceInfo) ID() (ResourceID, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're using this anymore? We can delete it from interface and implementation.

return ResourceID(id), nil
}

// Reconcile inserts or deletes ACL tokens based on tyher ServiceState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Reconcile inserts or deletes ACL tokens based on tyher ServiceState
// Reconcile inserts or deletes ACL tokens based on their ServiceState.

// Reconcile inserts or deletes ACL tokens based on tyher ServiceState
func (s *ServiceInfo) Reconcile() error {
state := s.ServiceState
if len(state.ACLTokens) == 0 && len(state.ConsulECSTasks) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're only using ConsulECSTasks to know if they exist, should it be a boolean instead? I don't think we use individual tasks.

if _, ok := aclTokens[serviceName]; !ok {
aclTokens[serviceName] = []*api.ACLToken{token}
} else {
aclTokens[serviceName] = append(aclTokens[serviceName], token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case. In theory, we should never have multiple tokens for the same service since we always only create one per service. So if we found multiple tokens for the same service, they might be for a service from somewhere else (i.e. not this cluster).

The question is what should we do in this case. The good thing is that I don't think we'll delete it because we expect that the secret in the AWS secrets manager exists for this token.

Unfortunately, Consul does not allow you to add meta to the token, but maybe we can include cluster name in the token description and then parse it out to determine if the token is for this cluster.

I don't think this is blocking though since the original implementation also didn't handle this case properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added cluster name to the ACL token descriptions to address this

}
}

aclTokens, err := s.fetchACLTokens()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. We're also fetching all services from consul on every 10sec which we didn't do before.

🤔 do we need to fetch all tokens every time we list? What if we only fetch services from Consul but not tokens? Then if service doesn't exist, the resource.Delete will fetch the tokens and delete the token for the service if it exists

CHANGELOG.md Outdated Show resolved Hide resolved
controller/resource.go Outdated Show resolved Hide resolved
controller/resource.go Outdated Show resolved Hide resolved
}

if _, ok := buildingResources[name]; !ok {
buildingResources[name] = &ServiceInfo{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluding services to be running on this specific ECS cluster could lead to a case where tokens are deleted for a service that still exists elsewhere.

Hmm... I think we should limit this controller to managing tokens only for tasks in this cluster.

  • If someone runs two ECS clusters, then (with current architecture) both clusters have an ACL controller. If there are tasks for a service in both clusters, then we should see two tokens created for the service, one per cluster/controller.
  • If there are service instances running elsewhere (outside of ECS), those service instances should create a separate service token rather than reusing a controller-managed token.

From that perspective, controller-managed tokens should not be in use elsewhere, so it should be safe for a controller to delete a token that it manages (once there are no tasks in the local ECS cluster using the token).

@erichaberkorn erichaberkorn force-pushed the acl-controller-refactor branch 3 times, most recently from e2d0315 to 7fe1e89 Compare November 4, 2021 15:12
@erichaberkorn
Copy link
Contributor Author

@ishustava @pglass I just refactored this refactor and made it way more efficient. I'm super interested in what you all think

Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I really like the improvements - thanks so much for making the changes.

CHANGELOG.md Outdated
@@ -20,6 +20,10 @@ FEATURES
* Add a `envoy-entrypoint` subcommand, which can be used as the entrypoint to the Envoy container running in ECS
to support graceful shutdown. [[GH-42](https://github.com/hashicorp/consul-ecs/pull/42)]

BUG FIXES:
* Fix edge cases in the ACL controller where ACL tokens are never get cleaned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fix edge cases in the ACL controller where ACL tokens are never get cleaned
* Fix edge cases in the ACL controller where ACL tokens are never cleaned


// ServiceState contains all of the information needed to determine if an ACL
// token should be created for a Consul service or if an ACL token should be
// deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// deleted
// deleted.

sutServiceName: "service1",
expected: []*api.ACLToken{aclToken1},
},
"Does nothing when a task is running and a token exists for a given service": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add a case that it does nothing when there are no tasks and tokens?

@erichaberkorn erichaberkorn force-pushed the acl-controller-refactor branch from 7fe1e89 to 575c0e4 Compare November 5, 2021 16:10
Copy link

@pglass pglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! 🎉 I really like that each Resource can be independently reconciled! It makes the overall logic so simple.

Added some nits that are not blocking.

for k := range t.source {
resources = append(resources, testResource{name: string(k), sink: &t.sink})
for _, resource := range t.resources {
resources = append(resources, resource)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can now simplify the loop to return append(resources, t.resources...), nil

continue
}

familyName, err := serviceNameForTask(task)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is really the service name

Suggested change
familyName, err := serviceNameForTask(task)
serviceName, err := serviceNameForTask(task)

// fetchECSTasks retrieves all of the ECS tasks that are managed by consul-ecs
// for the given cluster and returns a mapping that shows if a task is running
// for each service name.
func (s ServiceStateLister) fetchECSTasks() (map[string]bool, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this really returning a set? Could we change this to a map[string]struct{} (per the eng handbook)?

Comment on lines 64 to 72
buildingResources[name] = &ServiceInfo{
SecretsManagerClient: s.SecretsManagerClient,
ConsulClient: s.ConsulClient,
Cluster: s.Cluster,
Log: s.Log,
SecretPrefix: s.SecretPrefix,
ServiceName: name,
ServiceState: ServiceState{ConsulECSTasks: tasksExist},
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe a quick helper for constructing the service info for readability,

Suggested change
buildingResources[name] = &ServiceInfo{
SecretsManagerClient: s.SecretsManagerClient,
ConsulClient: s.ConsulClient,
Cluster: s.Cluster,
Log: s.Log,
SecretPrefix: s.SecretPrefix,
ServiceName: name,
ServiceState: ServiceState{ConsulECSTasks: tasksExist},
}
buildingResources[name] = s.newServiceInfo(
name, ServiceState{ConsulECSTasks: tasksExist},
)


// ResourceID represents the ID of the resource.
type ResourceID string
const clusterTag = "consul.hashicorp.com/cluster"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe worth a comment that this is not actually used as an AWS tag (but is grouped with them and name similarly)

Suggested change
const clusterTag = "consul.hashicorp.com/cluster"
// Included in ACL token description.
const clusterTag = "consul.hashicorp.com/cluster"

}

for _, token := range tokenList {
if isInCluster(s.Cluster, token) && len(token.ServiceIdentities) == 1 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! So we could have checked len(token.ServiceIdentities) from the list response all along? 😲

I guess just wondering if you validated that token.ServiceIdentities is populated in the list response (maybe tests are enough)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tests should be enough along with the documentation implying that the only difference between ACLTokens and ACLTokenListEntrys is the presence of SecretID. If not, I can think through this again.

familyName, err := serviceNameForTask(task)

if err != nil {
s.Log.Error("couldn't get service name from task", "task-arn", task.TaskArn, "tags", task.Tags)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: include the message from err in the Error message here?

@erichaberkorn erichaberkorn force-pushed the acl-controller-refactor branch from 575c0e4 to 17a4fe4 Compare November 5, 2021 19:02
@erichaberkorn erichaberkorn changed the title Make the ACL controller Stateless Fix ACL Controller Race Conditions Nov 5, 2021
It does this  by fetching all of the ACL tokens and ECS tasks. It then makes a
mapping from inferred service name to the resources that are tied to that
resource. The ACL controller modifies tokens in two cases:
* Tokens are deleted when there are ACL tokens for a service that doesn't have
  any tasks running
* Tokens are added when there are tasks running for a service that doesn't have
  an ACL token
@erichaberkorn erichaberkorn force-pushed the acl-controller-refactor branch from 17a4fe4 to cf35d0e Compare November 5, 2021 19:06
@erichaberkorn erichaberkorn merged commit d440880 into main Nov 5, 2021
@erichaberkorn erichaberkorn deleted the acl-controller-refactor branch November 5, 2021 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants