Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure VMs pool] Introducing agentpool client #6685

Conversation

wenxuan0923
Copy link
Contributor

@wenxuan0923 wenxuan0923 commented Apr 3, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

We will introduce a new vmType: vms to support the autoscaling for single instance vm pool. This new agentpool type will rely on AKS rp for scaling up and down instead of CRP call, which requires the agentpool client, cluster name and cluster resource group info.

In this PR, I'm adding the required info to the config and adding agentpool client into azClient. So far, the construction error is ignored so that no existing behavior will be impacted by this change.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is the first step of a series of changes coming to support the vms pool autoscaling.

Does this PR introduce a user-facing change?

No, this change should not impact any existing behavior


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Copy link

linux-foundation-easycla bot commented Apr 3, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 3, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @wenxuan0923!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @wenxuan0923. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 3, 2024
@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Apr 3, 2024
@wenxuan0923 wenxuan0923 force-pushed the wenx/update-azure-config branch 7 times, most recently from 6ee1b55 to e44f1a1 Compare April 3, 2024 19:45
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 3, 2024
Copy link
Member

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial review, mostly about the arm options and better practices with the shared track 2 sdk.

Let me know what you think

Comment on lines 211 to 233
&policy.ClientOptions{
ClientOptions: azurecore_policy.ClientOptions{
Cloud: cloud.Configuration{
Services: map[cloud.ServiceName]cloud.ServiceConfiguration{
cloud.ResourceManager: {
Endpoint: baseURL,
Audience: "UNKNOWN",
},
},
},
Telemetry: azurecore_policy.TelemetryOptions{
ApplicationID: userAgent,
},
Retry: retryOptions,
},
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love to see usage of the track 2 sdk! I would recommend some shared middleware the project karpenter, fleet and some other projects consume via https://github.com/Azure/azure-sdk-for-go-extensions

There are a couple of benefits.

  1. Armbalancer will load balance the requests on the client side to prevent arm instance level throttling
  2. There is a bug in go's tcp implementation for http2(which track 2 will use) where the connections are not effectively closed and we will get tcp timeout errors where the connection isn't closed for 17 minutes until it finally fails. Stephane patched this here

We could keep going but you get the point, we want to have this shared middleware used for track 2 implementations so that you don't see the same bugs and we can share all of our throttling lessons. We maintain one core library for these arm middleware and all share the benefits.

https://github.com/Azure/azure-sdk-for-go-extensions/blob/main/pkg/middleware/arm_client_opts.go#L25 are the arm client options i would reccomend as the base then you can override things that dont work for you.

Comment on lines +171 to +195
func getAgentpoolClientCredentials(cfg *Config) (azcore.TokenCredential, error) {
var cred azcore.TokenCredential
var err error
if cfg.AuthMethod == authMethodCLI {
cred, err = azidentity.NewAzureCLICredential(&azidentity.AzureCLICredentialOptions{
TenantID: cfg.TenantID})
if err != nil {
klog.Errorf("NewAzureCLICredential failed: %v", err)
return nil, err
}
} else if cfg.AuthMethod == "" || cfg.AuthMethod == authMethodPrincipal {
cred, err = azidentity.NewClientSecretCredential(cfg.TenantID, cfg.AADClientID, cfg.AADClientSecret, nil)
if err != nil {
klog.Errorf("NewClientSecretCredential failed: %v", err)
return nil, err
}
} else {
return nil, fmt.Errorf("unsupported authorization method: %s", cfg.AuthMethod)
}
return cred, nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of code seems to be copy pasted around quite a bit. I wonder if we can start to share some of it via the azure-sdk-for-go-extensions repo

Karpenter has a very similar implementation of the same thing essentially we want to authenticate for SP, MSI and CLI cred we all share this code.
https://github.com/Azure/karpenter-provider-azure/blob/main/pkg/auth/cred.go#L28

Comment on lines 199 to 211
return azurecore_policy.RetryOptions{
MaxRetries: 5,
RetryDelay: 1 * time.Second,
MaxRetryDelay: 2 * time.Second,
TryTimeout: 10 * time.Second,
}
Copy link
Member

@Bryce-Soghigian Bryce-Soghigian Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With retries it would be great to get a clearer picture as to why we want to follow this pattern and what throttling concerns this avoids.

Could you leave some comments here explaining that pattern?

I am trying to understand the following.
Why did we choose RetryDelay of 1s, and why is max retries 5? How does this affect the subscription level quota limits? When would we hit too many requests aka 429 errors with throttling following this retry pattern?

If you have thorough learnings about retries and want to stick to this pattern because its better than the ones declared in the DefaultRetryOptions in the shared azure-sdk-for-go-extensions, please tell us why and contribute there so we can all benefit.

Comment on lines 148 to 169
// AgentPoolsClient interface defines the methods needed for scaling vms pool.
// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient
type AgentPoolsClient interface {
Get(ctx context.Context,
resourceGroupName, resourceName, agentPoolName string,
options *armcontainerservice.AgentPoolsClientGetOptions) (
armcontainerservice.AgentPoolsClientGetResponse, error)
BeginCreateOrUpdate(
ctx context.Context,
resourceGroupName, resourceName, agentPoolName string,
parameters armcontainerservice.AgentPool,
options *armcontainerservice.AgentPoolsClientBeginCreateOrUpdateOptions) (
*runtime.Poller[armcontainerservice.AgentPoolsClientCreateOrUpdateResponse], error)
BeginDeleteMachines(
ctx context.Context,
resourceGroupName, resourceName, agentPoolName string,
machines armcontainerservice.AgentPoolDeleteMachinesParameter,
options *armcontainerservice.AgentPoolsClientBeginDeleteMachinesOptions) (
*runtime.Poller[armcontainerservice.AgentPoolsClientDeleteMachinesResponse], error)
}

const userAgent = "AKS-autoscaler"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandhipr can maybe tell you where, but i think we already have userAgent somewhere and should share the same user agent for CAS for all of the clients.

Would have to go digging to find it but seems to write queries against cas or detect when we are hitting throtlting or quota from kusto queries from CAS would be good to have all cas calls under a single user agent.

Comment on lines +148 to +173
// AgentPoolsClient interface defines the methods needed for scaling vms pool.
// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient
type AgentPoolsClient interface {
Get(ctx context.Context,
resourceGroupName, resourceName, agentPoolName string,
options *armcontainerservice.AgentPoolsClientGetOptions) (
armcontainerservice.AgentPoolsClientGetResponse, error)
BeginCreateOrUpdate(
ctx context.Context,
resourceGroupName, resourceName, agentPoolName string,
parameters armcontainerservice.AgentPool,
options *armcontainerservice.AgentPoolsClientBeginCreateOrUpdateOptions) (
*runtime.Poller[armcontainerservice.AgentPoolsClientCreateOrUpdateResponse], error)
BeginDeleteMachines(
ctx context.Context,
resourceGroupName, resourceName, agentPoolName string,
machines armcontainerservice.AgentPoolDeleteMachinesParameter,
options *armcontainerservice.AgentPoolsClientBeginDeleteMachinesOptions) (
*runtime.Poller[armcontainerservice.AgentPoolsClientDeleteMachinesResponse], error)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing against this interface is something I would like to start and think about and is important for our current work to improve the testing scenarios for cluster autoscaler.

For karpenter, we use env test to simulate teh kubernetes environment, and we have fakes for all of the azure apis. https://github.com/Azure/karpenter-provider-azure/tree/main/pkg/fake

I want us to start thinking about the best way to fake the Agentpools client. It doesn't have to be solved in this pr, but its good to start thinking about


func newAgentpoolClientWithEndpoint(baseURL, subscriptionID string, retryOptions azurecore_policy.RetryOptions) (AgentPoolsClient, error) {
// for AKS managed CAS, we will use the ARMBaseURL to create a fake agent pool client
// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint
// so that the request will be sent to node provisioner endpoint instead of the public ARM endpoint


func newAgentpoolClientWithEndpoint(baseURL, subscriptionID string, retryOptions azurecore_policy.RetryOptions) (AgentPoolsClient, error) {
// for AKS managed CAS, we will use the ARMBaseURL to create a fake agent pool client
// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to mention the node provisioner publically?

Comment on lines 259 to 275
agentPoolsClient, err := armcontainerservice.NewAgentPoolsClient(cfg.SubscriptionID, cred,
&policy.ClientOptions{
ClientOptions: azurecore_policy.ClientOptions{
Cloud: cloud.Configuration{
Services: map[cloud.ServiceName]cloud.ServiceConfiguration{
cloud.ResourceManager: {
Endpoint: env.ResourceManagerEndpoint,
Audience: env.TokenAudience,
},
},
},
Telemetry: azurecore_policy.TelemetryOptions{
ApplicationID: userAgent,
},
Retry: retryOptions,
},
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are initializing the client twice here, can we share one fucntion for initalization and just pass in the right policy.ClientOptions? Also again would be good to use the base of the shared client options so that you dont' enouunter all the tcp bugs, throttling etc we encountered.

@@ -5,6 +5,10 @@ go 1.21
require (
cloud.google.com/go/compute/metadata v0.2.3
github.com/Azure/azure-sdk-for-go v68.0.0+incompatible
github.com/Azure/azure-sdk-for-go-extensions v0.1.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.com/Azure/azure-sdk-for-go-extensions v0.1.6

Try v0.1.6 its the one we use in karpenter and we dont see that telemetry error you are getting

}
}

return newAgentpoolClientWithConfig(cfg.SubscriptionID, cred, env.ResourceManagerEndpoint, env.TokenAudience, retryOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful!

@Bryce-Soghigian
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 4, 2024
@wenxuan0923
Copy link
Contributor Author

/retest


if cfg.ARMBaseURL != "" {
klog.V(10).Infof("Using ARMBaseURL to create agent pool client")
return newAgentpoolClientWithConfig(cfg.SubscriptionID, nil, cfg.ARMBaseURL, "UNKNOWN", retryOptions)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check cfg.SubscriptionID? is it possible that it is empty by mistake?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why tokenCredential is nil?

@@ -287,5 +417,6 @@ func newAzClient(cfg *Config, env *azure.Environment) (*azClient, error) {
virtualMachinesClient: virtualMachinesClient,
storageAccountsClient: storageAccountsClient,
skuClient: skuClient,
agentPoolClient: agentPoolClient,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on L408, it is possible that the agentPoolClient is nil. Be sure to check it when refer it in later code development.

@gandhipr
Copy link
Contributor

gandhipr commented Apr 12, 2024

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2024
Comment on lines +153 to +155
// AgentPoolsClient interface defines the methods needed for scaling vms pool.
// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient
type AgentPoolsClient interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an implicit implementation choice here of implementing a new client here, rather than using those in cloud-provider-azure (and maybe extending them; looks like there is an MC client, but not AP one), which are used throughout the rest of CAS Azure provider. What are the tradeoffs, and which ones push us in this direction?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2024
@tallaxes
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2024
@wenxuan0923
Copy link
Contributor Author

/retest

@tallaxes
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2024
@tallaxes
Copy link
Contributor

/approve

@tallaxes
Copy link
Contributor

/assign @feiskyer

klog.Errorf("NewClientSecretCredential failed: %v", err)
return nil, err
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no msi support here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get this PR in first and add MSI support in a following PR

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bryce-Soghigian, feiskyer, gandhipr, tallaxes, wenxuan0923

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2024
@k8s-ci-robot k8s-ci-robot merged commit b36d4fb into kubernetes:cluster-autoscaler-release-1.29 Jun 14, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants