[wip] Alibaba recommitted #5291

patrickdillon · 2021-10-12T15:08:56Z

The Alibaba PR #5018 up until this point has been divided among dozens of commits; the PR has recently been squashed down into two large commits, one of all code/configuration, the other for all vendoring.

This PR, takes the most recent state of the PR with the two commits e7297fa443e64e842c7e7fa3166bd7f380ab4339 and 8962496f84393e5c6668330d5a054c622a599977, attempts to help reorganize them in a logical manner for easier review. This PR simply organizes the commits around the code structure of the Installer. There are separate commits for:

types
assets
terraform code and configuration
destroy code
vendoring

I propose that @bd233 and his team take the commits from this PR and either update #5018 with the new organization or open a new PR to replace #5018. Again this PR simply reorganizes the current state of #5018 with the goal of making it easier to review.

Moving forward changes to the PR would either be rebased into the appropriate commit or added using FIXUP commits. Let's make an agreement here before proceeding.

@staebler and @kwoodson thoughts on this plan?

openshift-ci · 2021-10-12T15:19:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from patrickdillon after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

patrickdillon · 2021-10-12T19:50:13Z

pkg/asset/installconfig/alibabacloud/client.go

+	os.Setenv(envAccessKeyID, accessKeyID)
+	os.Setenv(envAccessKeySecret, accessKeySecret)


We should not be setting environment variables. The API should be accessed programmatically.

OK, I'll delete it

patrickdillon · 2021-10-14T16:01:10Z

pkg/types/alibabacloud/metadata.go

+	// Before deploying the cluster, the user must manually create a resource group.
+	// The parameter ResourceGroupID is required.
+	ResourceGroupID string `json:"resourceGroupID"`


Why do users have to create the resource group rather than the installer?

There is a 7-day buffer time after alicloud resource groups are deleted. During this period, resource groups with the same name cannot be created. Therefore, in the initial design, users need to create them manually. If this change is not necessary, we plan to support the function of creating new resource groups in later versions.

patrickdillon · 2021-10-18T14:11:04Z

pkg/asset/ignition/node.go

+
+// GenerateIgnitionShim is used to generate an ignition file that contains a user ca bundle
+// in its Security section.
+func GenerateIgnitionShim(bootstrapConfigURL string, userCA string) ([]byte, error) {


this moves the AWS ignition shim into a reusable function

Yes, I separated this function from AWS. Is that ok？

@bd233 yes this is good. Sorry for the confusing comment. I meant that as a note to myself.

pkg/asset/installconfig/alibabacloud/validation.go

patrickdillon · 2021-10-18T14:46:07Z

pkg/asset/installconfig/alibabacloud/metadata.go

+// does not need to be user-supplied (e.g. because it can be retrieved
+// from external APIs).
+type Metadata struct {
+	client     *Client


This stored *Client is not being utilized, instead NewClient is called throughout the code.

Although it may be good to use the stored client rather than recreating the client multiple times.

patrickdillon · 2021-10-18T14:49:18Z

pkg/asset/installconfig/alibabacloud/validation.go

+	return allErrs
+}
+
+func validateMachinePool(client *Client, ic *types.InstallConfig, fldPath *field.Path, pool *alibabacloudtypes.MachinePool, replicas *int64) field.ErrorList {


It looks like none of this code is tested.

pkg/asset/installconfig/platformprovisioncheck.go

data/data/manifests/openshift/cloud-creds-secret.yaml.template

patrickdillon · 2021-10-18T14:55:28Z

pkg/asset/machines/alibabacloud/machines.go

+		UserDataSecret:     &corev1.LocalObjectReference{Name: userDataSecret},
+		CredentialsSecret:  &corev1.LocalObjectReference{Name: "alibabacloud-credentials"},


How does this work with manual credential mode?

Having creds in kube-system is not useful only for CCO. Even in manual mode the creds could be used for other purposes. We should, however, document that the creds are needed and what permissions the user must have.

I have created #5325 which is related and we can have discussion there. @staebler for the machine spec, shouldn't credentials requests from the machine api operator be used?

No. The credentials for the machine-api-operator are used to create the machine. The credentials in the machine spec are used for kubelet, kube-controller-manager, or the out-of-tree providers of such that run on the machine.

I see. But should this be handled by the cloud controller manager operator's credential request then?

pkg/asset/machines/master.go

pkg/asset/manifests/openshift.go

pkg/asset/manifests/template.go

pkg/tfvars/alibabacloud/alibabacloud.go

pkg/destroy/alibabacloud/alibabacloud.go

patrickdillon · 2021-10-18T16:07:29Z

/uncc @jstuever @rna-afk

patrickdillon

@bd233 @staebler @kwoodson I have just completed an initial review of this PR, which reorganizes #5018. The purpose is to figure out whether we can merge this PR, or further substantial changes are needed. I have taken the liberty of making small changes myself (reorganizing imports, fixing grammar, removing unnecessary code).

Here are the main outstanding issues I see after my review:

There is code to create credentials for the CCO despite this running in manual mode
There is a ValidateForProvisioning check that I think should be moved to the earlier validation stages.
A client is stored in Metadata but it is not used (a new client is created each time).
Authentication credentials for the client are being set programmatically through environment variables. I don't know if this is necessarily a problem, and it is platform specific. But it is unusual
machineset code is not tested

I would also like to point out two items which I do no think are necessarily problematic but worth drawing attention to:

I noticed that a user is required to create a resource group]([wip] Alibaba recommitted #5291 (comment)). This seems unneccessary to me, but has been considered already so I am fine with it, at least for the time being.
This is not a problem, but I wanted to draw attention to the fact that AWS code for ignition shim is extracted and made reusable for Alibaba. This is good because we will need this for ASH.

I did a cursory review of Terraform and destroy but I did not do an in-depth review.

So I would like to discuss what would be the best path forward. I am not sure if it is worth holding the PR for these items, or we should merge and fix in follow-up PRs. If we are going to hold, I would like to come up with a plan for how to integrate the changes.

Adds the Alibaba platform and validation to types package. Also adds supporting files for explain.

Adds preliminary assets for the Alibaba platform: cluster, install config, machines, manifests, quota, rhcos.

Adds Terraform plugin, tfvars and stages for Alibaba.

Adds Terraform configurations for the Alibaba platform.

Adds destroy code for the Alibaba platform.

This commit was produced by running , , and all modules verified. Signed-off-by: sunhui <wb-sh373163@alibaba-inc.com>

bd233 · 2021-10-19T08:53:52Z

@bd233 @staebler @kwoodson I have just completed an initial review of this PR, which reorganizes #5018. The purpose is to figure out whether we can merge this PR, or further substantial changes are needed. I have taken the liberty of making small changes myself (reorganizing imports, fixing grammar, removing unnecessary code).

Here are the main outstanding issues I see after my review:

There is code to create credentials for the CCO despite this running in manual mode

Yes, I think I should removed these codes

There is a ValidateForProvisioning check that I think should be moved to the earlier validation stages.

A client is stored in Metadata but it is not used (a new client is created each time).

Authentication credentials for the client are being set programmatically through environment variables. I don't know if this is necessarily a problem, and it is platform specific. But it is unusual

machineset code is not tested

I would also like to point out two items which I do no think are necessarily problematic but worth drawing attention to:

I noticed that a user is required to create a resource group](#5291 (comment)). This seems unneccessary to me, but has been considered already so I am fine with it, at least for the time being.

This is not a problem, but I wanted to draw attention to the fact that AWS code for ignition shim is extracted and made reusable for Alibaba. This is good because we will need this for ASH.

I did a cursory review of Terraform and destroy but I did not do an in-depth review.

So I would like to discuss what would be the best path forward. I am not sure if it is worth holding the PR for these items, or we should merge and fix in follow-up PRs. If we are going to hold, I would like to come up with a plan for how to integrate the changes.

Thank you very much for your work.

Based on the modification of this PR, I recreated a new branch and fixed the above problems one by one. If this is the path you expect, then I should use this branch to create a new PR? If there is anything I need to do, please let me know.

openshift-ci · 2021-10-19T08:53:59Z

@patrickdillon: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

patrickdillon · 2021-10-21T13:44:56Z

Based on the modification of this PR, I recreated a new branch and fixed the above problems one by one. If this is the path you expect, then I should use this branch to create a new PR? If there is anything I need to do, please let me know.

Sure a new PR would be fine.

kwoodson · 2021-10-21T20:03:03Z

@bd233 @dongchen126
I think there needs some discussion around the control plane and worker machines that get created. The machines are defined as well as the machinesets.

NAMESPACE               NAME                                 PHASE         TYPE           REGION      ZONE         AGE
openshift-machine-api   test-mbdxv-master-0                  Failed                                                42m
openshift-machine-api   test-mbdxv-master-1                  Failed                                                42m
openshift-machine-api   test-mbdxv-master-2                  Failed                                                42m
openshift-machine-api   test-mbdxv-worker-us-east-1a-7fm9f   
openshift-machine-api   test-mbdxv-worker-us-east-1b-q67lc   
openshift-machine-api   test-mbdxv-worker-us-east-1b-tf7fm

These are generated during the openshift-install create manifests stage of the installer. Once the cluster is running the machine-api-operator which runs the cluster-api-provider-alibaba attempts to reconcile these machine and machinesets which in turn creates the worker instances.

Here are the errors that I see when running the cluster:

  Warning  FailedCreate  9m22s  alibabacloud-controller  InvalidConfiguration: failed to reconcile machine "test-mbdxv-master-2": failed to create instance: error creating ECS instance: SDK.ServerError
ErrorCode: InvalidUserData.NotSupported
Recommend: https://error-center.aliyun.com/status/search?Keyword=InvalidUserData.NotSupported&source=PopGw
RequestId: 44B86884-3861-585F-905C-928D099BC053
Message: TThe specified parameter "UserData" only support the vpc and IoOptimized Instance.

I am able to resolve these for the worker nodes by adding a few fields:

            ioOptimized: true
            securityGroupId: sg-0xi1jns0qync9tw4wvok
            vSwitchId: vsw-0xi35xsixgexag4dbpdqa
            vpcId: vpc-0xi62g2ft1fv45gmih3vk

Since the installation occurs before these variables are set I'm not sure how to resolve these until after the cluster installation has started. I believe this can be done but wanted to report this as an extra step that is required before installation can complete successfully. If we need to merge this PR and fix this afterwards that should be okay as then the Alibaba team can reproduce. I wanted to bring this up and begin to think about how we populate these fields during the installation?

staebler · 2021-10-23T00:31:06Z

Since the installation occurs before these variables are set I'm not sure how to resolve these until after the cluster installation has started. I believe this can be done but wanted to report this as an extra step that is required before installation can complete successfully. If we need to merge this PR and fix this afterwards that should be okay as then the Alibaba team can reproduce. I wanted to bring this up and begin to think about how we populate these fields during the installation?

I don't think we'll be able to use a VPC ID in the machinesets. As you point out, the actual VPC ID is not known until after the terraform runs. Other platforms use a well-known VPC name instead.

pkg/types/alibabacloud/machinepool.go

pkg/destroy/alibabacloud/alibabacloud.go

staebler · 2021-10-23T02:21:09Z

pkg/destroy/alibabacloud/alibabacloud.go

+
+	// TODO: more appropriate to use asynchronous. It is advisable to optimise in the future
+	for _, execute := range deletedFuncs {
+		err = o.executeDeleteFunction(execute.executeFunc, execute.resourceName)


The destroyer cannot wait for a given delete function to complete successfully before moving on to the next delete functions. Instead of waiting indefinitely on one delete function, the destroyer should instead loop through each delete function, making one attempt at each delete function during each iteration of the loop.

staebler · 2021-10-23T02:23:11Z

pkg/destroy/alibabacloud/alibabacloud.go

+		for _, arn := range tagResources {
+			notDeletedResources = append(notDeletedResources, arn.ResourceARN)
+		}
+		return errors.New(fmt.Sprintf("There are undeleted cloud resources %q", notDeletedResources))


The destroyer must not stop when there are resources that have not been deleted. The destroyer must keep trying to delete the resources until the user stops the destroyer.

go.mod

bd233 · 2021-10-25T12:48:46Z

@patrickdillon @staebler @kwoodson
Thank you for reviewing, I am fixing these problems, and I will submit a new PR as soon as possible.

openshift-ci · 2021-10-26T21:11:31Z

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-single-node-live-iso	`e1d3c17`	link	false	`/test e2e-metal-single-node-live-iso`
ci/prow/e2e-crc	`e1d3c17`	link	false	`/test e2e-crc`
ci/prow/e2e-aws-workers-rhel7	`e1d3c17`	link	false	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-aws-workers-rhel8	`e1d3c17`	link	false	`/test e2e-aws-workers-rhel8`
ci/prow/okd-unit	`e1d3c17`	link	true	`/test okd-unit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

patrickdillon · 2021-10-27T20:57:59Z

/close in favor of #5333

patrickdillon · 2021-10-29T00:11:15Z

/close

openshift-ci · 2021-10-29T00:11:45Z

@patrickdillon: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2021

openshift-ci bot requested review from jstuever and rna-afk October 12, 2021 15:19

patrickdillon mentioned this pull request Oct 12, 2021

[WIP] Add Alibaba Cloud platform #5018

Closed

patrickdillon commented Oct 12, 2021

View reviewed changes

patrickdillon commented Oct 14, 2021

View reviewed changes