Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/step-registry/ipi/conf/aws: Drop us-east-1 zone overrides #17576

Merged

Conversation

wking
Copy link
Member

@wking wking commented Apr 8, 2021

This was originally part of avoiding broken zones, see e8921c3 (#3204) and b717933 (#6833). But the installer has had broken-zone avoidence since way back in openshift/installer@71aef620b6 (openshift/installer#1210). I dunno how reliably AWS sets state: impaired and similar; it didn't seem to protect us from e8921c3. But we're getting ready to pivot to using multiple AWS accounts, which creates two issues with hard-coding region names in the step:

  1. References by name are not stable between accounts. From the AWS docs:

    To ensure that resources are distributed across the Availability Zones for a Region, we independently map Availability Zones to names for each AWS account. For example, the Availability Zone us-east-1a for your AWS account might not be the same location as us-east-1a for another AWS account.

    So "aah, us-east-1a is broken, let's use b and c instead" might apply to one account but not the other. And the installer does not currently accept zone IDs.

  2. References by name may not exist in other accounts. From the AWS docs:

    As Availability Zones grow over time, our ability to expand them can become constrained. If this happens, we might restrict you from launching an instance in a constrained Availability Zone unless you already have an instance in that Availability Zone. Eventually, we might also remove the constrained Availability Zone from the list of Availability Zones for new accounts. Therefore, your account might have a different number of available Availability Zones in a Region than another account.

    And it turns out that for some reason they sometimes don't name sequentially, e.g. our new account lacks us-west-1a:

    $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
    us-west-1a usw1-az3 available
    us-west-1b usw1-az1 available
    $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
    us-west-1b usw1-az3 available
    us-west-1c usw1-az1 available

    I have no idea why they decided to do that, but we have to work with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although we are still vulnerable to (2) with the a/b default line. We'll do something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it to try to set up each available zone, and that causes more API contention and resource consumption than we want. Background on that in 51c4a37 (#3285) and d87fffb (#3615), as well as the rejected/rotted-out openshift/installer#1487.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2021
@wking wking changed the title ci-operator/step-registry/ipi/conf: Drop us-east-1 zone overrides ci-operator/step-registry/ipi/conf/aws: Drop us-east-1 zone overrides Apr 8, 2021
This was originally part of avoiding broken zones, see e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204) and b717933
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random AWS regions for IPI, 2020-01-23, openshift#6833).  But the installer has
had broken-zone avoidence since way back in
openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return
available zones, 2019-02-07, openshift/installer#1210) I dunno how
reliably AWS sets 'state: impaired' and similar; it didn't seem to
protect us from e8921c3.  But we're getting ready to pivot to using
multiple AWS accounts, which creates two issues with hard-coding
region names in the step:

1. References by name are not stable between accounts.  From the AWS
   docs [1]:

     To ensure that resources are distributed across the Availability
     Zones for a Region, we independently map Availability Zones to
     names for each AWS account. For example, the Availability Zone
     us-east-1a for your AWS account might not be the same location as
     us-east-1a for another AWS account.

   So "aah, us-east-1a is broken, let's use b and c instead" might
   apply to one account but not the other.  And the installer does not
   currently accept zone IDs.

2. References by name may not exist in other accounts.  From the AWS
   docs [1]:

     As Availability Zones grow over time, our ability to expand them
     can become constrained. If this happens, we might restrict you
     from launching an instance in a constrained Availability Zone
     unless you already have an instance in that Availability
     Zone. Eventually, we might also remove the constrained
     Availability Zone from the list of Availability Zones for new
     accounts. Therefore, your account might have a different number
     of available Availability Zones in a Region than another account.

   And it turns out that for some reason they sometimes don't name
   sequentially, e.g. our new account lacks us-west-1a:

     $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1a usw1-az3 available
     us-west-1b usw1-az1 available
     $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1b usw1-az3 available
     us-west-1c usw1-az1 available

   I have no idea why they decided to do that, but we have to work
   with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although
we are still vulnerable to (2) with the a/b default line.  We'll do
something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it
to try to set up each available zone, and that causes more API
contention and resource consumption than we want.  Background on that
in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS
availability zones, 2019-03-28, openshift#3285) and d87fffb
(ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615),
as well as the rejected/rotted-out [2].

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
[2]: openshift/installer#1487
@wking wking force-pushed the drop-hard-coded-zone-overrides branch from c5fa9c3 to 05c825a Compare April 8, 2021 18:15
@staebler
Copy link
Contributor

staebler commented Apr 8, 2021

I'm not following how switching from using us-east-1c to us-east-1a is improving anything. I don't follow the logic for how it reduces any exposure to (1). I would prefer to wait until we have improvements in order to make any change to availability-zone selection.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 8, 2021

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/origin/release-4.1/e2e-aws-image-ecosystem 05c825a link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.1/e2e-aws-builds 05c825a link /test pj-rehearse
ci/rehearse/periodic-ci-redhat-operator-ecosystem-playground-cvp-ocp-4.7-cvp-common-aws 05c825a link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.5/e2e-aws-csi 05c825a link /test pj-rehearse
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator 05c825a link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-workers-rhel7 05c825a link /test pj-rehearse
ci/rehearse/openshift/sdn/release-4.9/e2e-aws-multitenant 05c825a link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/release-4.9/e2e-ovn-ipsec-step-registry 05c825a link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-aws-jenkins 05c825a link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-fips 05c825a link /test pj-rehearse
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-ovn-hybrid-step-registry 05c825a link /test pj-rehearse
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-aws-ovn 05c825a link /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/release-4.9/e2e-aws-manual-oidc 05c825a link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-aws 05c825a link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-aws-shared-vpc 05c825a link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.2/e2e-cmd 05c825a link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node 05c825a link /test pj-rehearse
ci/rehearse/operator-framework/operator-marketplace/release-4.9/e2e-aws-serial 05c825a link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy 05c825a link /test pj-rehearse
ci/rehearse/operator-framework/operator-marketplace/release-4.9/e2e-aws-upgrade 05c825a link /test pj-rehearse
ci/rehearse/openshift/windows-machine-config-operator/release-4.9/aws-e2e-operator 05c825a link /test pj-rehearse
ci/rehearse/openshift/windows-machine-config-operator/release-4.9/aws-e2e-upgrade 05c825a link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade 05c825a link /test pj-rehearse
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-aws-ovn-windows 05c825a link /test pj-rehearse
ci/prow/pj-rehearse 05c825a link /test pj-rehearse

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@e-tienne
Copy link
Contributor

I'm not following how switching from using us-east-1c to us-east-1a is improving anything. I don't follow the logic for how it reduces any exposure to (1). I would prefer to wait until we have improvements in order to make any change to availability-zone selection.

@wking I've been looking at your PR, but I would like to get your point of view so we can have alignment on this before going further. Thanks!

@wking
Copy link
Member Author

wking commented May 25, 2021

Removing the us-east-1 overrides says we no longer care about being able to route the installer away from broken zones, when AWS has an outage. The remaining issue is just telling the installer to only populate two or three zones with subnets and such, and that's the bit I'm putting on here. But having this PR go in doesn't hurt anything, and means that when we do get to that we won't have broken-zone-avoidance code around to explain.

@e-tienne
Copy link
Contributor

Removing the us-east-1 overrides says we no longer care about being able to route the installer away from broken zones, when AWS has an outage. The remaining issue is just telling the installer to only populate two or three zones with subnets and such, and that's the bit I'm putting on here. But having this PR go in doesn't hurt anything, and means that when we do get to that we won't have broken-zone-avoidance code around to explain.

Sounds good. I do have subsequent work in #18638

Copy link
Contributor

@e-tienne e-tienne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: e-tienne, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 51602fa into openshift:master May 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2021

@wking: Updated the step-registry configmap in namespace ci at cluster app.ci using the following files:

  • key ipi-conf-aws-commands.sh using file ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-commands.sh

In response to this:

This was originally part of avoiding broken zones, see e8921c3 (#3204) and b717933 (#6833). But the installer has had broken-zone avoidence since way back in openshift/installer@71aef620b6 (openshift/installer#1210). I dunno how reliably AWS sets state: impaired and similar; it didn't seem to protect us from e8921c3. But we're getting ready to pivot to using multiple AWS accounts, which creates two issues with hard-coding region names in the step:

  1. References by name are not stable between accounts. From the AWS docs:

    To ensure that resources are distributed across the Availability Zones for a Region, we independently map Availability Zones to names for each AWS account. For example, the Availability Zone us-east-1a for your AWS account might not be the same location as us-east-1a for another AWS account.

    So "aah, us-east-1a is broken, let's use b and c instead" might apply to one account but not the other. And the installer does not currently accept zone IDs.

  2. References by name may not exist in other accounts. From the AWS docs:

    As Availability Zones grow over time, our ability to expand them can become constrained. If this happens, we might restrict you from launching an instance in a constrained Availability Zone unless you already have an instance in that Availability Zone. Eventually, we might also remove the constrained Availability Zone from the list of Availability Zones for new accounts. Therefore, your account might have a different number of available Availability Zones in a Region than another account.

    And it turns out that for some reason they sometimes don't name sequentially, e.g. our new account lacks us-west-1a:

    $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
    us-west-1a usw1-az3 available
    us-west-1b usw1-az1 available
    $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
    us-west-1b usw1-az3 available
    us-west-1c usw1-az1 available

    I have no idea why they decided to do that, but we have to work with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although we are still vulnerable to (2) with the a/b default line. We'll do something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it to try to set up each available zone, and that causes more API contention and resource consumption than we want. Background on that in 51c4a37 (#3285) and d87fffb (#3615), as well as the rejected/rotted-out openshift/installer#1487.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the drop-hard-coded-zone-overrides branch May 25, 2021 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants