-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/step-registry/ipi/conf/aws: Drop us-east-1 zone overrides #17576
ci-operator/step-registry/ipi/conf/aws: Drop us-east-1 zone overrides #17576
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This was originally part of avoiding broken zones, see e8921c3 (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b, 2019-03-22, openshift#3204) and b717933 (ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random AWS regions for IPI, 2020-01-23, openshift#6833). But the installer has had broken-zone avoidence since way back in openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return available zones, 2019-02-07, openshift/installer#1210) I dunno how reliably AWS sets 'state: impaired' and similar; it didn't seem to protect us from e8921c3. But we're getting ready to pivot to using multiple AWS accounts, which creates two issues with hard-coding region names in the step: 1. References by name are not stable between accounts. From the AWS docs [1]: To ensure that resources are distributed across the Availability Zones for a Region, we independently map Availability Zones to names for each AWS account. For example, the Availability Zone us-east-1a for your AWS account might not be the same location as us-east-1a for another AWS account. So "aah, us-east-1a is broken, let's use b and c instead" might apply to one account but not the other. And the installer does not currently accept zone IDs. 2. References by name may not exist in other accounts. From the AWS docs [1]: As Availability Zones grow over time, our ability to expand them can become constrained. If this happens, we might restrict you from launching an instance in a constrained Availability Zone unless you already have an instance in that Availability Zone. Eventually, we might also remove the constrained Availability Zone from the list of Availability Zones for new accounts. Therefore, your account might have a different number of available Availability Zones in a Region than another account. And it turns out that for some reason they sometimes don't name sequentially, e.g. our new account lacks us-west-1a: $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort us-west-1a usw1-az3 available us-west-1b usw1-az1 available $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort us-west-1b usw1-az3 available us-west-1c usw1-az1 available I have no idea why they decided to do that, but we have to work with the world as it is ;). Removing the us-east-1 overrides helps reduce our exposure, although we are still vulnerable to (2) with the a/b default line. We'll do something about that in follow-up work. Leaving the "which zones?" decision up to the installer would cause it to try to set up each available zone, and that causes more API contention and resource consumption than we want. Background on that in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS availability zones, 2019-03-28, openshift#3285) and d87fffb (ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615), as well as the rejected/rotted-out [2]. [1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html [2]: openshift/installer#1487
c5fa9c3
to
05c825a
Compare
I'm not following how switching from using us-east-1c to us-east-1a is improving anything. I don't follow the logic for how it reduces any exposure to (1). I would prefer to wait until we have improvements in order to make any change to availability-zone selection. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@wking I've been looking at your PR, but I would like to get your point of view so we can have alignment on this before going further. Thanks! |
Removing the us-east-1 overrides says we no longer care about being able to route the installer away from broken zones, when AWS has an outage. The remaining issue is just telling the installer to only populate two or three zones with subnets and such, and that's the bit I'm putting on here. But having this PR go in doesn't hurt anything, and means that when we do get to that we won't have broken-zone-avoidance code around to explain. |
Sounds good. I do have subsequent work in #18638 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: e-tienne, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
@wking: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This was originally part of avoiding broken zones, see e8921c3 (#3204) and b717933 (#6833). But the installer has had broken-zone avoidence since way back in openshift/installer@71aef620b6 (openshift/installer#1210). I dunno how reliably AWS sets
state: impaired
and similar; it didn't seem to protect us from e8921c3. But we're getting ready to pivot to using multiple AWS accounts, which creates two issues with hard-coding region names in the step:References by name are not stable between accounts. From the AWS docs:
So "aah, us-east-1a is broken, let's use b and c instead" might apply to one account but not the other. And the installer does not currently accept zone IDs.
References by name may not exist in other accounts. From the AWS docs:
And it turns out that for some reason they sometimes don't name sequentially, e.g. our new account lacks us-west-1a:
I have no idea why they decided to do that, but we have to work with the world as it is ;).
Removing the us-east-1 overrides helps reduce our exposure, although we are still vulnerable to (2) with the a/b default line. We'll do something about that in follow-up work.
Leaving the "which zones?" decision up to the installer would cause it to try to set up each available zone, and that causes more API contention and resource consumption than we want. Background on that in 51c4a37 (#3285) and d87fffb (#3615), as well as the rejected/rotted-out openshift/installer#1487.