Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFN Tempalte and Lambda to automatically provision and deprovision NAT gateways when Elastio workers are running #89

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

volatilecat
Copy link
Contributor

Automatically provision and de-provision NAT gateways when Elastio workers are running.

Closes https://github.com/elastio/elastio/issues/9395

Type: Number
Default: 300
MinValue: 0
Description: How long to wait for new EC2 instances to appear before deleting the NAT Gateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Description: How long to wait for new EC2 instances to appear before deleting the NAT Gateway
Description: How long to wait for no new EC2 instances to appear before deleting the NAT Gateway

Type: String
Default: elastio-nat-gateway-
MinLength: 1
Description: Prefix of the name of the NAT Gateway CFN stack. The name will be <prefix><vpc-id>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this deploys the NAT gateway in the subnet where the instance is running, then shouldn't this be <prefix><subnet-id>? There could be multiple subnets in the VPC...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance is running in a private subnet, but NAT gateway is deployed in a public one. There may be multiple private subnets forwarding their trafic to the NAT gateway within a vpc/az, so I think the name should include <vpc-id>/<az>.

print("Unable to find a subnet in the same availability zone; exiting")
return

stack_name = f"{NAT_CFN_PREFIX}{instance_vpc_id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are two instances, each spun up in a different subnet, this will break won't it? You should not assume that only one subnet is enabled for the vault deployment. The customer we have initially created this mechanism for typically deploys into two subnets for example.

Copy link
Contributor

@Veetaha Veetaha Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both subnets are in the same AZ they should still use the same NAT. We just need to make sure the code is resilient to concurrent execution and it handles the case when two lambdas try to create the same CFN stack.

For example that is_stack_deployed check doesn't really save us from trying to deploy a CFN stack if lambdas do it concurrently (TOCTOU), although it makes it highly unlikely. The main thing is that deploy_nat_stack function must be resilient to "stack-already-exists" error or whatever CFN returns in such case.

elastio-nat-provision-lambda/lambda.py Show resolved Hide resolved
elastio-nat-provision-lambda/lambda.py Outdated Show resolved Hide resolved
elastio-nat-provision-lambda/lambda.py Show resolved Hide resolved
elastio-nat-provision-lambda/lambda.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Veetaha Veetaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of review. Will submit the lambda review shortly

Type: String
Default: elastio-nat-gateway-
MinLength: 1
Description: Prefix of the name of the NAT Gateway CFN stack. The name will be <prefix><vpc-id>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance is running in a private subnet, but NAT gateway is deployed in a public one. There may be multiple private subnets forwarding their trafic to the NAT gateway within a vpc/az, so I think the name should include <vpc-id>/<az>.

elastio-nat-provision-lambda/requirements.txt Show resolved Hide resolved
elastio-nat-provision-lambda/cloudformation-lambda.yaml Outdated Show resolved Hide resolved
elastio-nat-provision-lambda/cloudformation-nat.yaml Outdated Show resolved Hide resolved
elastio-nat-provision-lambda/lambda.py Show resolved Hide resolved
elastio-nat-provision-lambda/lambda.py Outdated Show resolved Hide resolved
elastio-nat-provision-lambda/lambda.py Outdated Show resolved Hide resolved
print("Unable to find a subnet in the same availability zone; exiting")
return

stack_name = f"{NAT_CFN_PREFIX}{instance_vpc_id}"
Copy link
Contributor

@Veetaha Veetaha Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both subnets are in the same AZ they should still use the same NAT. We just need to make sure the code is resilient to concurrent execution and it handles the case when two lambdas try to create the same CFN stack.

For example that is_stack_deployed check doesn't really save us from trying to deploy a CFN stack if lambdas do it concurrently (TOCTOU), although it makes it highly unlikely. The main thing is that deploy_nat_stack function must be resilient to "stack-already-exists" error or whatever CFN returns in such case.

elastio-nat-provision-lambda/lambda.py Show resolved Hide resolved
elastio-nat-provision-lambda/lambda.py Outdated Show resolved Hide resolved
Filters=[{'Name': 'tag:elastio:resource', 'Values': ['true']}],
)}

active_instances_count = sum(
Copy link
Contributor

@Veetaha Veetaha Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here doesn't take into account that elastio vault can be deployed into multiple AZs where there may be multiple NATs one per each AZ. For example, if there are no instances in one AZ but there are some in other AZ I suppose we need to make sure the NAT in the first AZ is deleted.

delete_nat_gateway_stack(stack_name)


def pending_cleanups_vpc_ids(elastio_instances, current_instance_id, event_time):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should group by vpc + az, not just vpc

@volatilecat volatilecat requested review from anelson and Veetaha July 17, 2024 19:02
@Veetaha Veetaha merged commit 840c87a into master Jul 18, 2024
5 checks passed
@Veetaha Veetaha deleted the elastio-nat-provision-lambda branch July 18, 2024 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants