-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alibaba: implement cluster destroy #5348
alibaba: implement cluster destroy #5348
Conversation
Hi @bd233. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3fce36d
to
e118273
Compare
e118273
to
13e0ac8
Compare
@patrickdillon I have been testing this destroy code and it does clean up the cluster. I'll go through and verify again. |
@kwoodson I have fixed it. Please verify |
@patrickdillon Please review whether this is the desired validation Alibaba: fix: add metadata server IP validation |
@kwoodson I have a question, what should the number of worker nodes in the default configuration (if I do not modify the install-config file) depend on? Is the number of all available zones available? |
@bd233 The zones are determined here https://github.com/openshift/installer/blob/master/pkg/asset/machines/worker.go#L239-L249 As you know, this can be determined in any method you desire. I think the current behavior is 1 worker per zone but I think the standard node count is generally 3 worker nodes. That provides sufficient space to run the ingress with HA as well as the other cluster resources (registry, console, etc). It appears that other providers do something similar. AWS looks at subnets or number of AZs. I think you will always get number of availabilityzones + 1. Due to this code:
|
Default cluster should always have 3 compute nodes. |
@bd233 I've noticed that the SLB that is created by the ingress operator has DeletionProtection enabled. Therefore the ingress controller's SLB does not get removed. Also appears that the OSS bucket that is created for the bootstrap node also does not get deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reviewed the metadata server validation. This stands alone as it's own PR so I would suggest moving it to a new one.
@@ -172,6 +173,7 @@ func validatePrivateZoneID(client *Client, ic *types.InstallConfig, path *field. | |||
// ValidateForProvisioning validates if the install config is valid for provisioning the cluster. | |||
func ValidateForProvisioning(client *Client, ic *types.InstallConfig, metadata *Metadata) error { | |||
allErrs := field.ErrorList{} | |||
allErrs = append(allErrs, validateMetadataServerIPNotInMachineCIDR(ic.Networking)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these functions are static validation--that is, they do not require connecting to the alibaba api--they should be moved to pkg/types
. Specifically here: https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L465
the validatePlatform
function will need to pass in the networking struct, similar to how OpenStack does.
func validateMetadataServerIPNotInMachineCIDR(n *types.Networking) field.ErrorList { | ||
allErrs := field.ErrorList{} | ||
fldPath := field.NewPath("networking").Child("machineNetwork") | ||
matedataServerIP := "100.100.100.200" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matedataServerIP
-> metadataServerIP
} | ||
|
||
func validateIPNotinMachineCIDR(ip string, n *types.Networking) error { | ||
for _, network := range n.MachineNetwork { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably check the cluster network and service network as well
func validateIPNotinMachineCIDR(ip string, n *types.Networking) error { | ||
for _, network := range n.MachineNetwork { | ||
if network.CIDR.Contains(net.ParseIP(ip)) { | ||
return fmt.Errorf("the IP must not be in one of the machine networks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the IP must not be in one of the machine networks" -> "the contains 100.100.100.200 which is reserved for the metadata service"
thx, I think I understand the logic here, |
@kwoodson I think this has nothing to do with deleting protection. If you run the In addition, as you know, the Installer will create two SLB instances. The installer searches for the two SLB instances through tags
I created the cluster and deleted it, but it did not reproduce the problem. Is there any more detailed information to help me reproduce it? |
a0f6d80
to
dfb149e
Compare
@bd233 Let me test the latest code and verify if the SLB gets removed. I'll also look into the OSS buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to review if you broke this in to multiple PRs. One for destroy, one for existing VPC, etc.
Also please squash where possible. I know there is at least one destroy commit that could be squashed.
for _, arn := range tagResources { | ||
notDeletedResources = append(notDeletedResources, arn.ResourceARN) | ||
} | ||
return errors.New(fmt.Sprintf("There are undeleted cloud resources %q", notDeletedResources)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As staebler pointed out in the previous PR, the destroy code should never stop running if it knows that there are outstanding resources to be deleted. The destroyer should loop, attempting to destroy any resources as long as they exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is inappropriate to raise an error here.
But I think waitComplete
here is redundant. The delete function of each cloud resource already contains the logic to wait for the deletion to be completed, so I should remove waitComplete
.
67d0892
to
4ddbe60
Compare
/retitle alibaba: implement cluster destroy |
/ok-to-test |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix this one small error to pass the test, and we should be ready to merge?
return tagResources, nil | ||
} | ||
|
||
func (o *ClusterUninstaller) ListTagResources(tags map[string]string) (tagResources []tag.TagResource, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing the go-lint test to fail:
/go/src/github.com/openshift/installer/pkg/destroy/alibabacloud/alibabacloud.go:987:1: exported method ClusterUninstaller.ListTagResources should have comment or be unexported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, please continue
4ddbe60
to
4a6ee69
Compare
@bd233 I attempted to run this destroy code today but it continued to run. I ran this with
If the record |
err := wait.PollImmediateInfinite( | ||
time.Second*10, | ||
func() (bool, error) { | ||
ferr := f.execute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not love this solution and I am not proposing it but this addition was able to skip the resources that are returning an error message like DEBUG Message: The specified domain name does not exist. Refresh the page and try again.
:
if ferr != nil {
if strings.Contains(ferr.Error(), "not exist") {
return true, nil
}
o.Logger.Debugf("%s: %v", f.name, ferr)
return false, nil
}
I recommend checking the ferr
message for objects that do not exist and return true
to continue the deletion.
@bd233 @patrickdillon WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the deleteDNSRecords
function incorrectly returning an error when it can't find the domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deleteDNSRecords
is returning an error that the DNS record does not exist. I think this behavior is expected from the Alibaba API. We need to gracefully handle NoExist
errors so that deletion can complete successfully.
@bd233 Can you confirm this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to check to see if it exists before trying to delete. That is what I thought this code was doing: https://github.com/openshift/installer/pull/5348/files#diff-5d31e7ab73dc252cc09331a0790fcc8b6cd3add977402effbe19a1b1b24fba39R1249-R1255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwoodson @patrickdillon Sorry to reply so late.
I think this is an abnormal scenario. I read the Start to search DNS records
log, but did not see Start to delete DNS records
, so the error should come from records, err: = o.listrecord (basedomain)
, But this should not be, if the domain name does not exist, it should be returned here:
if len(domains) == 0 {
return
}
I am trying to reproduce and fix this problem as soon as possible.
I recommend checking the ferr message for objects that do not exist and return true to continue the deletion.
@bd233 @patrickdillon WDYT?
I don't think 'ferr' should be handled. This error should be resolved in 'deleteDnsRecords'. This solution may cause some resources not to be deleted
@kwoodson Are there any manual deletion operations in the process of executing the installer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bd233 Generally speaking there shouldn't be any but that doesn't stop customers from updating or modifying the cluster objects. The reason this DNS item was modified is that we set up valid DNS and removed the original before removing the cluster. That caused the missing DNS record. I think that it is not unreasonable to have components get modified or deleted outside of the cluster. I think if objects return NoExist
they should be considered to be deleted and the destroy code can continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwoodson @patrickdillon
I think I understand this problem. Querying domainName is a fuzzy query. If baseDomain is openshift.com
, then bd.openshift.com
may appear in the result, and len (domains) == 0
is false, so...
I'll fix it as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this PR.
I think ListPrivateZones
and ListDNSDomain
in pkg/asset/installconfig/alibabacloud/client.go
also has this problem. I will create a PR to fix it.
4a6ee69
to
9553bdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain (briefly) the fix for the dns issue? I am not sure I understand how it is being fixed.
Also it looks like an azure stack commit snuck in during a rebase.
func (o *ClusterUninstaller) deleteDNSRecords() (err error) { | ||
o.Logger.Debug("Start to search DNS records") | ||
|
||
baseDomain := strings.Join(strings.Split(o.ClusterDomain, ".")[1:], ".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the fix for the fuzzy domain search error? If so can you explain how this solves the problem? Also it would be nice to have a comment in the code for why we need the string manipulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha~, about azure stack commit, it is a rebase mistake, I will updata it later.
The LIKE
is default search mode on DescribeDomains api. In this mode, if you search openshift.com
, can match xx.openshift.com
and openshift.com.xx
, but using EXACT
mode, only return openshift.com
.
Also it would be nice to have a comment in the code for why we need the string manipulation
I need base domain to search domain and delete it records, but I can not get base domain, so I use cluster domin to splite by .
(the format of cluster domain is <cluster name>.<base domain>
).
I will add comment here.
Adds destroy code for the Alibaba platform. Alibaba: fix: turn off SLB and ECS protection Before deleting ECS and SLB, turn off the protection state Alibaba: fix: update asynchronous destroy mode Refer to IMBCloud to optimize the part of destroying the cluster Alibaba: fix: destroy SLB created by ingress operator Query SLB created by ingress operator through tag, and destory it.
This commit was produced by running, and all modules verified
9553bdf
to
6a08c59
Compare
@bd233: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I tested this yesterday:
|
/lgtm |
Add code for the
openshift-install destroy cluster
command for cluster using the alibaba platform.