-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFN Tempalte and Lambda to automatically provision and deprovision NAT gateways when Elastio workers are running #89
Conversation
Type: Number | ||
Default: 300 | ||
MinValue: 0 | ||
Description: How long to wait for new EC2 instances to appear before deleting the NAT Gateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description: How long to wait for new EC2 instances to appear before deleting the NAT Gateway | |
Description: How long to wait for no new EC2 instances to appear before deleting the NAT Gateway |
Type: String | ||
Default: elastio-nat-gateway- | ||
MinLength: 1 | ||
Description: Prefix of the name of the NAT Gateway CFN stack. The name will be <prefix><vpc-id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this deploys the NAT gateway in the subnet where the instance is running, then shouldn't this be <prefix><subnet-id>
? There could be multiple subnets in the VPC...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance is running in a private subnet, but NAT gateway is deployed in a public one. There may be multiple private subnets forwarding their trafic to the NAT gateway within a vpc/az
, so I think the name should include <vpc-id>/<az>
.
print("Unable to find a subnet in the same availability zone; exiting") | ||
return | ||
|
||
stack_name = f"{NAT_CFN_PREFIX}{instance_vpc_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are two instances, each spun up in a different subnet, this will break won't it? You should not assume that only one subnet is enabled for the vault deployment. The customer we have initially created this mechanism for typically deploys into two subnets for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both subnets are in the same AZ they should still use the same NAT. We just need to make sure the code is resilient to concurrent execution and it handles the case when two lambdas try to create the same CFN stack.
For example that is_stack_deployed
check doesn't really save us from trying to deploy a CFN stack if lambdas do it concurrently (TOCTOU), although it makes it highly unlikely. The main thing is that deploy_nat_stack
function must be resilient to "stack-already-exists" error or whatever CFN returns in such case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of review. Will submit the lambda review shortly
Type: String | ||
Default: elastio-nat-gateway- | ||
MinLength: 1 | ||
Description: Prefix of the name of the NAT Gateway CFN stack. The name will be <prefix><vpc-id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance is running in a private subnet, but NAT gateway is deployed in a public one. There may be multiple private subnets forwarding their trafic to the NAT gateway within a vpc/az
, so I think the name should include <vpc-id>/<az>
.
print("Unable to find a subnet in the same availability zone; exiting") | ||
return | ||
|
||
stack_name = f"{NAT_CFN_PREFIX}{instance_vpc_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both subnets are in the same AZ they should still use the same NAT. We just need to make sure the code is resilient to concurrent execution and it handles the case when two lambdas try to create the same CFN stack.
For example that is_stack_deployed
check doesn't really save us from trying to deploy a CFN stack if lambdas do it concurrently (TOCTOU), although it makes it highly unlikely. The main thing is that deploy_nat_stack
function must be resilient to "stack-already-exists" error or whatever CFN returns in such case.
Filters=[{'Name': 'tag:elastio:resource', 'Values': ['true']}], | ||
)} | ||
|
||
active_instances_count = sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here doesn't take into account that elastio vault can be deployed into multiple AZs where there may be multiple NATs one per each AZ. For example, if there are no instances in one AZ but there are some in other AZ I suppose we need to make sure the NAT in the first AZ is deleted.
delete_nat_gateway_stack(stack_name) | ||
|
||
|
||
def pending_cleanups_vpc_ids(elastio_instances, current_instance_id, event_time): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should group by vpc
+ az
, not just vpc
Automatically provision and de-provision NAT gateways when Elastio workers are running.
Closes https://github.com/elastio/elastio/issues/9395