Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Alibaba recommitted #5291

Closed

Conversation

patrickdillon
Copy link
Contributor

@patrickdillon patrickdillon commented Oct 12, 2021

The Alibaba PR #5018 up until this point has been divided among dozens of commits; the PR has recently been squashed down into two large commits, one of all code/configuration, the other for all vendoring.

This PR, takes the most recent state of the PR with the two commits e7297fa443e64e842c7e7fa3166bd7f380ab4339 and 8962496f84393e5c6668330d5a054c622a599977, attempts to help reorganize them in a logical manner for easier review. This PR simply organizes the commits around the code structure of the Installer. There are separate commits for:

  • types
  • assets
  • terraform code and configuration
  • destroy code
  • vendoring

I propose that @bd233 and his team take the commits from this PR and either update #5018 with the new organization or open a new PR to replace #5018. Again this PR simply reorganizes the current state of #5018 with the goal of making it easier to review.

Moving forward changes to the PR would either be rebased into the appropriate commit or added using FIXUP commits. Let's make an agreement here before proceeding.

@staebler and @kwoodson thoughts on this plan?

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from patrickdillon after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +276 to +277
os.Setenv(envAccessKeyID, accessKeyID)
os.Setenv(envAccessKeySecret, accessKeySecret)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be setting environment variables. The API should be accessed programmatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll delete it

Comment on lines +9 to +11
// Before deploying the cluster, the user must manually create a resource group.
// The parameter ResourceGroupID is required.
ResourceGroupID string `json:"resourceGroupID"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do users have to create the resource group rather than the installer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a 7-day buffer time after alicloud resource groups are deleted. During this period, resource groups with the same name cannot be created. Therefore, in the initial design, users need to create them manually. If this change is not necessary, we plan to support the function of creating new resource groups in later versions.


// GenerateIgnitionShim is used to generate an ignition file that contains a user ca bundle
// in its Security section.
func GenerateIgnitionShim(bootstrapConfigURL string, userCA string) ([]byte, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this moves the AWS ignition shim into a reusable function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I separated this function from AWS. Is that ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bd233 yes this is good. Sorry for the confusing comment. I meant that as a note to myself.

// does not need to be user-supplied (e.g. because it can be retrieved
// from external APIs).
type Metadata struct {
client *Client
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stored *Client is not being utilized, instead NewClient is called throughout the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it may be good to use the stored client rather than recreating the client multiple times.

return allErrs
}

func validateMachinePool(client *Client, ic *types.InstallConfig, fldPath *field.Path, pool *alibabacloudtypes.MachinePool, replicas *int64) field.ErrorList {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like none of this code is tested.

Comment on lines +102 to +104
UserDataSecret: &corev1.LocalObjectReference{Name: userDataSecret},
CredentialsSecret: &corev1.LocalObjectReference{Name: "alibabacloud-credentials"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with manual credential mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having creds in kube-system is not useful only for CCO. Even in manual mode the creds could be used for other purposes. We should, however, document that the creds are needed and what permissions the user must have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created #5325 which is related and we can have discussion there. @staebler for the machine spec, shouldn't credentials requests from the machine api operator be used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The credentials for the machine-api-operator are used to create the machine. The credentials in the machine spec are used for kubelet, kube-controller-manager, or the out-of-tree providers of such that run on the machine.

Copy link
Contributor Author

@patrickdillon patrickdillon Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But should this be handled by the cloud controller manager operator's credential request then?

@patrickdillon
Copy link
Contributor Author

/uncc @jstuever @rna-afk

@openshift-ci openshift-ci bot removed request for jstuever and rna-afk October 18, 2021 16:07
Copy link
Contributor Author

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bd233 @staebler @kwoodson I have just completed an initial review of this PR, which reorganizes #5018. The purpose is to figure out whether we can merge this PR, or further substantial changes are needed. I have taken the liberty of making small changes myself (reorganizing imports, fixing grammar, removing unnecessary code).

Here are the main outstanding issues I see after my review:

I would also like to point out two items which I do no think are necessarily problematic but worth drawing attention to:

I did a cursory review of Terraform and destroy but I did not do an in-depth review.

So I would like to discuss what would be the best path forward. I am not sure if it is worth holding the PR for these items, or we should merge and fix in follow-up PRs. If we are going to hold, I would like to come up with a plan for how to integrate the changes.

bd233 added 5 commits October 18, 2021 14:16
Adds the Alibaba platform and validation to types package. Also
adds supporting files for explain.
Adds preliminary assets for the Alibaba platform: cluster,
install config, machines, manifests, quota, rhcos.
Adds Terraform plugin, tfvars and stages for Alibaba.
Adds Terraform configurations for the Alibaba platform.
Adds destroy code for the Alibaba platform.
bd233 added 2 commits October 18, 2021 14:16
This commit was produced by running , , and
all modules verified.

Signed-off-by: sunhui <wb-sh373163@alibaba-inc.com>
@bd233
Copy link
Contributor

bd233 commented Oct 19, 2021

@bd233 @staebler @kwoodson I have just completed an initial review of this PR, which reorganizes #5018. The purpose is to figure out whether we can merge this PR, or further substantial changes are needed. I have taken the liberty of making small changes myself (reorganizing imports, fixing grammar, removing unnecessary code).

Here are the main outstanding issues I see after my review:

Yes, I think I should removed these codes

I would also like to point out two items which I do no think are necessarily problematic but worth drawing attention to:

I did a cursory review of Terraform and destroy but I did not do an in-depth review.

So I would like to discuss what would be the best path forward. I am not sure if it is worth holding the PR for these items, or we should merge and fix in follow-up PRs. If we are going to hold, I would like to come up with a plan for how to integrate the changes.

Thank you very much for your work.

Based on the modification of this PR, I recreated a new branch and fixed the above problems one by one. If this is the path you expect, then I should use this branch to create a new PR? If there is anything I need to do, please let me know.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2021

@patrickdillon: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@patrickdillon
Copy link
Contributor Author

Based on the modification of this PR, I recreated a new branch and fixed the above problems one by one. If this is the path you expect, then I should use this branch to create a new PR? If there is anything I need to do, please let me know.

Sure a new PR would be fine.

@kwoodson
Copy link

@bd233 @dongchen126
I think there needs some discussion around the control plane and worker machines that get created. The machines are defined as well as the machinesets.

NAMESPACE               NAME                                 PHASE         TYPE           REGION      ZONE         AGE
openshift-machine-api   test-mbdxv-master-0                  Failed                                                42m
openshift-machine-api   test-mbdxv-master-1                  Failed                                                42m
openshift-machine-api   test-mbdxv-master-2                  Failed                                                42m
openshift-machine-api   test-mbdxv-worker-us-east-1a-7fm9f   
openshift-machine-api   test-mbdxv-worker-us-east-1b-q67lc   
openshift-machine-api   test-mbdxv-worker-us-east-1b-tf7fm   

These are generated during the openshift-install create manifests stage of the installer. Once the cluster is running the machine-api-operator which runs the cluster-api-provider-alibaba attempts to reconcile these machine and machinesets which in turn creates the worker instances.

Here are the errors that I see when running the cluster:

  Warning  FailedCreate  9m22s  alibabacloud-controller  InvalidConfiguration: failed to reconcile machine "test-mbdxv-master-2": failed to create instance: error creating ECS instance: SDK.ServerError
ErrorCode: InvalidUserData.NotSupported
Recommend: https://error-center.aliyun.com/status/search?Keyword=InvalidUserData.NotSupported&source=PopGw
RequestId: 44B86884-3861-585F-905C-928D099BC053
Message: TThe specified parameter "UserData" only support the vpc and IoOptimized Instance.

I am able to resolve these for the worker nodes by adding a few fields:

            ioOptimized: true
            securityGroupId: sg-0xi1jns0qync9tw4wvok
            vSwitchId: vsw-0xi35xsixgexag4dbpdqa
            vpcId: vpc-0xi62g2ft1fv45gmih3vk

Since the installation occurs before these variables are set I'm not sure how to resolve these until after the cluster installation has started. I believe this can be done but wanted to report this as an extra step that is required before installation can complete successfully. If we need to merge this PR and fix this afterwards that should be okay as then the Alibaba team can reproduce. I wanted to bring this up and begin to think about how we populate these fields during the installation?

@staebler
Copy link
Contributor

Since the installation occurs before these variables are set I'm not sure how to resolve these until after the cluster installation has started. I believe this can be done but wanted to report this as an extra step that is required before installation can complete successfully. If we need to merge this PR and fix this afterwards that should be okay as then the Alibaba team can reproduce. I wanted to bring this up and begin to think about how we populate these fields during the installation?

I don't think we'll be able to use a VPC ID in the machinesets. As you point out, the actual VPC ID is not known until after the terraform runs. Other platforms use a well-known VPC name instead.


// TODO: more appropriate to use asynchronous. It is advisable to optimise in the future
for _, execute := range deletedFuncs {
err = o.executeDeleteFunction(execute.executeFunc, execute.resourceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destroyer cannot wait for a given delete function to complete successfully before moving on to the next delete functions. Instead of waiting indefinitely on one delete function, the destroyer should instead loop through each delete function, making one attempt at each delete function during each iteration of the loop.

for _, arn := range tagResources {
notDeletedResources = append(notDeletedResources, arn.ResourceARN)
}
return errors.New(fmt.Sprintf("There are undeleted cloud resources %q", notDeletedResources))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destroyer must not stop when there are resources that have not been deleted. The destroyer must keep trying to delete the resources until the user stops the destroyer.

@bd233
Copy link
Contributor

bd233 commented Oct 25, 2021

@patrickdillon @staebler @kwoodson
Thank you for reviewing, I am fixing these problems, and I will submit a new PR as soon as possible.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2021

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-single-node-live-iso e1d3c17 link false /test e2e-metal-single-node-live-iso
ci/prow/e2e-crc e1d3c17 link false /test e2e-crc
ci/prow/e2e-aws-workers-rhel7 e1d3c17 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-workers-rhel8 e1d3c17 link false /test e2e-aws-workers-rhel8
ci/prow/okd-unit e1d3c17 link true /test okd-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@patrickdillon
Copy link
Contributor Author

/close in favor of #5333

@patrickdillon
Copy link
Contributor Author

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2021

@patrickdillon: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants