-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ACL Controller Race Conditions #45
Conversation
fec252a
to
53bbb77
Compare
} | ||
} | ||
|
||
aclTokens, err := s.fetchACLTokens() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a bit more expensive than the previous implementation. I'm interested in if this causes any issues when we do load testing. If it causes problems there are a number of things we can do such as slowing down the poll interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. We're also fetching all services from consul on every 10sec which we didn't do before.
🤔 do we need to fetch all tokens every time we list? What if we only fetch services from Consul but not tokens? Then if service doesn't exist, the resource.Delete will fetch the tokens and delete the token for the service if it exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another pass through this and eliminated a ton of queries. Now it just fetches all of the tokens (in a single query) and all of the tasks (the same as before). I was able to eliminate one request for each ACL token and the request for all of the services.
Sorry the diff on this one is crazy. Let me know if you all would like to walk through it synchronously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Eric! I love the refactor 😍 I think it makes it much more clear.
I left some comments and questions inline.
CHANGELOG.md
Outdated
@@ -1,5 +1,10 @@ | |||
## UNRELEASED | |||
|
|||
IMPROVEMENTS | |||
* Make the ACL controller stateless. Prior to this, there were a few cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by why it's now stateless. I can see what you mean that we no longer have the internal state for the controller, but that state was in-memory before, and so I think using this term will be confusing since "stateful" implies having some persistent state.
Also, should this be a bug fix?
controller/controller.go
Outdated
id, err := resource.ID() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we no longer are keeping track of resources in the resourceState
, we probably no longer need the id
?
We should also remove resourceState
field in the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call out on resourceState
. What do you think about keeping id
around for logging purposes?
controller/resource.go
Outdated
tasks, err := s.fetchECSTasks() | ||
|
||
if err != nil { | ||
return resources, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: even though resources
should be nil, I think it's more readable to return nil
explicitly
return resources, err | |
return nil, err |
controller/resource.go
Outdated
// ConsulClient is the Consul client to be used by the TaskLister. | ||
// TaskLister doesn't need to talk to Consul, but it passes this client | ||
// ConsulClient is the Consul client to be used by the ServiceStateLister. | ||
// ServiceStateLister doesn't need to talk to Consul, but it passes this client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer true since it's calling Catalog().Services()
// ServiceStateLister doesn't need to talk to Consul, but it passes this client |
controller/resource.go
Outdated
} | ||
|
||
if _, ok := buildingResources[name]; !ok { | ||
buildingResources[name] = &ServiceInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if the service belongs to this ECS cluster? I'm thinking about a case when you might have multiple ECS clusters or services running outside of ECS. Then we don't want to delete a token that this controller does not manage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's setup this way to be super conservative about token deletion. Given the logic in Reconcile where tokens are only deleted if the service doesn't exist and there are tokens. Excluding services to be running on this specific ECS cluster could lead to a case where tokens are deleted for a service that still exists elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excluding services to be running on this specific ECS cluster could lead to a case where tokens are deleted for a service that still exists elsewhere.
Hmm... I think we should limit this controller to managing tokens only for tasks in this cluster.
- If someone runs two ECS clusters, then (with current architecture) both clusters have an ACL controller. If there are tasks for a service in both clusters, then we should see two tokens created for the service, one per cluster/controller.
- If there are service instances running elsewhere (outside of ECS), those service instances should create a separate service token rather than reusing a controller-managed token.
From that perspective, controller-managed tokens should not be in use elsewhere, so it should be safe for a controller to delete a token that it manages (once there are no tasks in the local ECS cluster using the token).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this to only take tasks and ACL tokens into account. I'm super interested in what you all think
controller/resource.go
Outdated
serviceName, err := t.serviceNameForTask() | ||
if err != nil { | ||
return fmt.Errorf("could not determine service name: %w", err) | ||
func (s *ServiceInfo) ID() (ResourceID, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're using this anymore? We can delete it from interface and implementation.
controller/resource.go
Outdated
return ResourceID(id), nil | ||
} | ||
|
||
// Reconcile inserts or deletes ACL tokens based on tyher ServiceState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Reconcile inserts or deletes ACL tokens based on tyher ServiceState | |
// Reconcile inserts or deletes ACL tokens based on their ServiceState. |
controller/resource.go
Outdated
// Reconcile inserts or deletes ACL tokens based on tyher ServiceState | ||
func (s *ServiceInfo) Reconcile() error { | ||
state := s.ServiceState | ||
if len(state.ACLTokens) == 0 && len(state.ConsulECSTasks) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're only using ConsulECSTasks
to know if they exist, should it be a boolean instead? I don't think we use individual tasks.
if _, ok := aclTokens[serviceName]; !ok { | ||
aclTokens[serviceName] = []*api.ACLToken{token} | ||
} else { | ||
aclTokens[serviceName] = append(aclTokens[serviceName], token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting case. In theory, we should never have multiple tokens for the same service since we always only create one per service. So if we found multiple tokens for the same service, they might be for a service from somewhere else (i.e. not this cluster).
The question is what should we do in this case. The good thing is that I don't think we'll delete it because we expect that the secret in the AWS secrets manager exists for this token.
Unfortunately, Consul does not allow you to add meta to the token, but maybe we can include cluster name in the token description and then parse it out to determine if the token is for this cluster.
I don't think this is blocking though since the original implementation also didn't handle this case properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added cluster name to the ACL token descriptions to address this
} | ||
} | ||
|
||
aclTokens, err := s.fetchACLTokens() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. We're also fetching all services from consul on every 10sec which we didn't do before.
🤔 do we need to fetch all tokens every time we list? What if we only fetch services from Consul but not tokens? Then if service doesn't exist, the resource.Delete will fetch the tokens and delete the token for the service if it exists
controller/resource.go
Outdated
} | ||
|
||
if _, ok := buildingResources[name]; !ok { | ||
buildingResources[name] = &ServiceInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excluding services to be running on this specific ECS cluster could lead to a case where tokens are deleted for a service that still exists elsewhere.
Hmm... I think we should limit this controller to managing tokens only for tasks in this cluster.
- If someone runs two ECS clusters, then (with current architecture) both clusters have an ACL controller. If there are tasks for a service in both clusters, then we should see two tokens created for the service, one per cluster/controller.
- If there are service instances running elsewhere (outside of ECS), those service instances should create a separate service token rather than reusing a controller-managed token.
From that perspective, controller-managed tokens should not be in use elsewhere, so it should be safe for a controller to delete a token that it manages (once there are no tasks in the local ECS cluster using the token).
e2d0315
to
7fe1e89
Compare
@ishustava @pglass I just refactored this refactor and made it way more efficient. I'm super interested in what you all think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I really like the improvements - thanks so much for making the changes.
CHANGELOG.md
Outdated
@@ -20,6 +20,10 @@ FEATURES | |||
* Add a `envoy-entrypoint` subcommand, which can be used as the entrypoint to the Envoy container running in ECS | |||
to support graceful shutdown. [[GH-42](https://github.com/hashicorp/consul-ecs/pull/42)] | |||
|
|||
BUG FIXES: | |||
* Fix edge cases in the ACL controller where ACL tokens are never get cleaned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Fix edge cases in the ACL controller where ACL tokens are never get cleaned | |
* Fix edge cases in the ACL controller where ACL tokens are never cleaned |
controller/resource.go
Outdated
|
||
// ServiceState contains all of the information needed to determine if an ACL | ||
// token should be created for a Consul service or if an ACL token should be | ||
// deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// deleted | |
// deleted. |
sutServiceName: "service1", | ||
expected: []*api.ACLToken{aclToken1}, | ||
}, | ||
"Does nothing when a task is running and a token exists for a given service": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also add a case that it does nothing when there are no tasks and tokens?
7fe1e89
to
575c0e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! 🎉 I really like that each Resource can be independently reconciled! It makes the overall logic so simple.
Added some nits that are not blocking.
for k := range t.source { | ||
resources = append(resources, testResource{name: string(k), sink: &t.sink}) | ||
for _, resource := range t.resources { | ||
resources = append(resources, resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can now simplify the loop to return append(resources, t.resources...), nil
controller/resource.go
Outdated
continue | ||
} | ||
|
||
familyName, err := serviceNameForTask(task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is really the service name
familyName, err := serviceNameForTask(task) | |
serviceName, err := serviceNameForTask(task) |
controller/resource.go
Outdated
// fetchECSTasks retrieves all of the ECS tasks that are managed by consul-ecs | ||
// for the given cluster and returns a mapping that shows if a task is running | ||
// for each service name. | ||
func (s ServiceStateLister) fetchECSTasks() (map[string]bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is this really returning a set? Could we change this to a map[string]struct{}
(per the eng handbook)?
controller/resource.go
Outdated
buildingResources[name] = &ServiceInfo{ | ||
SecretsManagerClient: s.SecretsManagerClient, | ||
ConsulClient: s.ConsulClient, | ||
Cluster: s.Cluster, | ||
Log: s.Log, | ||
SecretPrefix: s.SecretPrefix, | ||
ServiceName: name, | ||
ServiceState: ServiceState{ConsulECSTasks: tasksExist}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe a quick helper for constructing the service info for readability,
buildingResources[name] = &ServiceInfo{ | |
SecretsManagerClient: s.SecretsManagerClient, | |
ConsulClient: s.ConsulClient, | |
Cluster: s.Cluster, | |
Log: s.Log, | |
SecretPrefix: s.SecretPrefix, | |
ServiceName: name, | |
ServiceState: ServiceState{ConsulECSTasks: tasksExist}, | |
} | |
buildingResources[name] = s.newServiceInfo( | |
name, ServiceState{ConsulECSTasks: tasksExist}, | |
) |
|
||
// ResourceID represents the ID of the resource. | ||
type ResourceID string | ||
const clusterTag = "consul.hashicorp.com/cluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe worth a comment that this is not actually used as an AWS tag (but is grouped with them and name similarly)
const clusterTag = "consul.hashicorp.com/cluster" | |
// Included in ACL token description. | |
const clusterTag = "consul.hashicorp.com/cluster" |
} | ||
|
||
for _, token := range tokenList { | ||
if isInCluster(s.Cluster, token) && len(token.ServiceIdentities) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! So we could have checked len(token.ServiceIdentities)
from the list response all along? 😲
I guess just wondering if you validated that token.ServiceIdentities
is populated in the list response (maybe tests are enough)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tests should be enough along with the documentation implying that the only difference between ACLToken
s and ACLTokenListEntry
s is the presence of SecretID
. If not, I can think through this again.
controller/resource.go
Outdated
familyName, err := serviceNameForTask(task) | ||
|
||
if err != nil { | ||
s.Log.Error("couldn't get service name from task", "task-arn", task.TaskArn, "tags", task.Tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: include the message from err
in the Error message here?
575c0e4
to
17a4fe4
Compare
It does this by fetching all of the ACL tokens and ECS tasks. It then makes a mapping from inferred service name to the resources that are tied to that resource. The ACL controller modifies tokens in two cases: * Tokens are deleted when there are ACL tokens for a service that doesn't have any tasks running * Tokens are added when there are tasks running for a service that doesn't have an ACL token
17a4fe4
to
cf35d0e
Compare
Changes proposed in this PR:
Fix race conditions in the ACL controller
It does this by fetching all of the ACL tokens and ECS tasks. It then makes a mapping from inferred service name to the resources that are tied to that resource. The ACL controller modifies tokens in two cases:
How I've tested this PR:
Writing tests
How I expect reviewers to test this PR:
👀
Checklist: