Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰: ec2/byoip: fix EIP leak when creating machine #5039

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mtulio
Copy link
Contributor

@mtulio mtulio commented Jun 27, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

The instance creation flow is creating by default EIP to instances even if the BYO IP flow is set. BYO IPv4 creates and associates the EIP to instance after it is created, preventing paying for additional EIP (amazon-provided) when creating the instance when the BYO IPv4 Pool is defined to be used by the machine.

Furthermore, the fix provides additional checks to prevent duplicated EIP in the BYO IP reconciliation loop. The extra checks include running the EIP association many times, while the EIP is already associated, and failures in the log when running the EIP association prematurely - when the instance isn't ready, Eg ec2 in pending state.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5038

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation (N/A)
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

fix duplicated/leaked EIP when using BYO IPv4 on Machines.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 27, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 27, 2024
@nrb
Copy link
Contributor

nrb commented Jun 27, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 27, 2024
pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
DeviceIndex: aws.Int64(0),
SubnetId: aws.String(i.SubnetID),
Groups: aws.StringSlice(i.SecurityGroupIDs),
AssociatePublicIpAddress: i.PublicIPOnLaunch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see you already set PublicIPOnLaunch to false if EIPPool is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, preserving the value from machine spec, and forcing to false when EIPPool

// Preserve user-defined PublicIp option.
input.PublicIPOnLaunch = scope.AWSMachine.Spec.PublicIP
// Public address from BYO Public IPv4 Pools need to be associated after launch (main machine
// reconciliate loop) preventing duplicated public IP. The map on launch is explicitly
// disabled in instances with PublicIP defined to true.
if scope.AWSMachine.Spec.ElasticIPPool != nil && scope.AWSMachine.Spec.ElasticIPPool.PublicIpv4Pool != nil {
input.PublicIPOnLaunch = ptr.To(false)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem in this point is leaving the flag unset (original code), the AssociatePublicIpAddress seems to assume True in some point making the EIP being created before BYO IP reconciliation loop is reached.

I may have some scenarios triggered by those tests to review:

--- FAIL: TestCreateInstance (0.01s)
    --- FAIL: TestCreateInstance/when_multiple_subnets_match_filters,_subnets_in_the_cluster_vpc_are_preferred (0.00s)
    --- FAIL: TestCreateInstance/with_a_subnet_outside_the_cluster_vpc (0.00s)
    --- FAIL: TestCreateInstance/with_dedicated_tenancy_cloud-config (0.00s)
    --- FAIL: TestCreateInstance/with_custom_placement_group_cloud-config (0.00s)
    --- FAIL: TestCreateInstance/with_dedicated_tenancy_and_placement_group_ignition (0.00s)
    --- FAIL: TestCreateInstance/with_custom_placement_group_and_partition_number (0.00s)

Need to review those tests if it is expected or also a bug in the expected return values from API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in pkg/cloud/services/ec2/instances_test.go fixes the RunInstanceWithContext api call to adapt to this change, it seems not impact in the use case described in the tests as it will result in the same.

pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
@mtulio mtulio force-pushed the OCPBUGS-36293-fix-byoip-eip branch from b58e3e0 to ace1bee Compare June 28, 2024 03:30
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2024
The instance creation flow is creating by default EIP to instances even
if the BYO IP flow is set. BYO IPv4 creates and associates the EIP to
instance after it is created, preventing paying for additional EIP
(amazon-provided) when creating the instance when the BYO IPv4 Pool is
defined to be used by the machine.

Furthermore, the fix provides additional checks to prevent duplicated EIP
in the BYO IP reconciliation loop. The extra checks include running the
EIP association many times, while the EIP is already associated, and
failures in the log when running the EIP association prematurely - when
the instance isn't ready, Eg ec2 in pending state.
@mtulio mtulio force-pushed the OCPBUGS-36293-fix-byoip-eip branch from ace1bee to d5882fa Compare June 28, 2024 04:26
@mtulio
Copy link
Contributor Author

mtulio commented Jun 28, 2024

/test ?

@k8s-ci-robot
Copy link
Contributor

@mtulio: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mtulio
Copy link
Contributor Author

mtulio commented Jun 28, 2024

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jun 28, 2024

Premature failure.

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jul 3, 2024

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jul 4, 2024

/test pull-cluster-api-provider-aws-e2e

Okay, previous test failures were flake. The latest run pass. OpenShift e2e BYOIP test is also passing install:

time="2024-06-28T19:16:32Z" level=debug msg="E0628 19:16:32.239704     331 
awsmachine_controller.go:544] \"failed to reconcile BYO Public IPv4\" 
err=\"unable to reconcile Elastic IP Pool to instance \\\"i-05af21b09d3552d0f\\\" with state: pending\""

[...]

time="2024-06-28T19:16:50Z" level=debug msg="I0628 19:16:50.432073     331 eip.go:44] 
\"machine is already associated with an Elastic IP with custom Public IPv4 pool\" 
controller=\"awsmachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" 
controllerKind=\"AWSMachine\" AWSMachine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\" namespace=\"openshift-cluster-api-guests\" 
name=\"ci-op-f04mmlsn-88881-49rgt-bootstrap\" 
reconcileID=\"825b335b-569b-4a16-891d-c6b7fb5a9db6\" 
machine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\" 
cluster=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt\" 
eip=\"eipalloc-0678f0da2c1ecd771\" eip-address=\"157.254.217.22\" 
eip-associationID=\"eipassoc-0ea51b9c68cdfb171\" eip-instance=\"i-05af21b09d3552d0f\""

This PR is ready for review. PTAL?
/assign @Ankitasw @dlipovetsky
cc @r4f4 @nrb

/test pull-cluster-api-provider-aws-e2e-eks

@@ -541,7 +541,7 @@ func (r *AWSMachineReconciler) reconcileNormal(_ context.Context, machineScope *
// a BYOIP without duplication.
if pool := machineScope.GetElasticIPPool(); pool != nil {
if err := ec2svc.ReconcileElasticIPFromPublicPool(pool, instance); err != nil {
machineScope.Error(err, "failed to associate elastic IP address")
machineScope.Error(err, "failed to reconcile BYO Public IPv4")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about what I did with registering the instance with the LB: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5040/files#diff-0b559bbd149f0e6d54d789235423f66fa8906fdc6ee9c99b9e85db912912011eR622-R629. Just returning an error because the instance is not yet running might confuse users looking at the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4f4 that's a great idea, specially now it is generating more warnings with this approach, instead of leak and failures. But non error/warn messages for expected states would be much better.

@mtulio
Copy link
Contributor Author

mtulio commented Jul 5, 2024

e2e EKS test is passing.

I am holding this PR to address the @r4f4 's feedback to decrease the warnings in the code, silent re-queuing expected states.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent leaking EIP when creating machines with BYO IPv4 Pool
6 participants