-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure VMs pool] Introducing agentpool client #6685
[Azure VMs pool] Introducing agentpool client #6685
Conversation
Welcome @wenxuan0923! |
Hi @wenxuan0923. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
6ee1b55
to
e44f1a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial review, mostly about the arm options and better practices with the shared track 2 sdk.
Let me know what you think
&policy.ClientOptions{ | ||
ClientOptions: azurecore_policy.ClientOptions{ | ||
Cloud: cloud.Configuration{ | ||
Services: map[cloud.ServiceName]cloud.ServiceConfiguration{ | ||
cloud.ResourceManager: { | ||
Endpoint: baseURL, | ||
Audience: "UNKNOWN", | ||
}, | ||
}, | ||
}, | ||
Telemetry: azurecore_policy.TelemetryOptions{ | ||
ApplicationID: userAgent, | ||
}, | ||
Retry: retryOptions, | ||
}, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love to see usage of the track 2 sdk! I would recommend some shared middleware the project karpenter, fleet and some other projects consume via https://github.com/Azure/azure-sdk-for-go-extensions
There are a couple of benefits.
- Armbalancer will load balance the requests on the client side to prevent arm instance level throttling
- There is a bug in go's tcp implementation for http2(which track 2 will use) where the connections are not effectively closed and we will get tcp timeout errors where the connection isn't closed for 17 minutes until it finally fails. Stephane patched this here
We could keep going but you get the point, we want to have this shared middleware used for track 2 implementations so that you don't see the same bugs and we can share all of our throttling lessons. We maintain one core library for these arm middleware and all share the benefits.
https://github.com/Azure/azure-sdk-for-go-extensions/blob/main/pkg/middleware/arm_client_opts.go#L25 are the arm client options i would reccomend as the base then you can override things that dont work for you.
func getAgentpoolClientCredentials(cfg *Config) (azcore.TokenCredential, error) { | ||
var cred azcore.TokenCredential | ||
var err error | ||
if cfg.AuthMethod == authMethodCLI { | ||
cred, err = azidentity.NewAzureCLICredential(&azidentity.AzureCLICredentialOptions{ | ||
TenantID: cfg.TenantID}) | ||
if err != nil { | ||
klog.Errorf("NewAzureCLICredential failed: %v", err) | ||
return nil, err | ||
} | ||
} else if cfg.AuthMethod == "" || cfg.AuthMethod == authMethodPrincipal { | ||
cred, err = azidentity.NewClientSecretCredential(cfg.TenantID, cfg.AADClientID, cfg.AADClientSecret, nil) | ||
if err != nil { | ||
klog.Errorf("NewClientSecretCredential failed: %v", err) | ||
return nil, err | ||
} | ||
} else { | ||
return nil, fmt.Errorf("unsupported authorization method: %s", cfg.AuthMethod) | ||
} | ||
return cred, nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of code seems to be copy pasted around quite a bit. I wonder if we can start to share some of it via the azure-sdk-for-go-extensions repo
Karpenter has a very similar implementation of the same thing essentially we want to authenticate for SP, MSI and CLI cred we all share this code.
https://github.com/Azure/karpenter-provider-azure/blob/main/pkg/auth/cred.go#L28
return azurecore_policy.RetryOptions{ | ||
MaxRetries: 5, | ||
RetryDelay: 1 * time.Second, | ||
MaxRetryDelay: 2 * time.Second, | ||
TryTimeout: 10 * time.Second, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With retries it would be great to get a clearer picture as to why we want to follow this pattern and what throttling concerns this avoids.
Could you leave some comments here explaining that pattern?
I am trying to understand the following.
Why did we choose RetryDelay of 1s, and why is max retries 5? How does this affect the subscription level quota limits? When would we hit too many requests aka 429 errors with throttling following this retry pattern?
If you have thorough learnings about retries and want to stick to this pattern because its better than the ones declared in the DefaultRetryOptions in the shared azure-sdk-for-go-extensions, please tell us why and contribute there so we can all benefit.
// AgentPoolsClient interface defines the methods needed for scaling vms pool. | ||
// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient | ||
type AgentPoolsClient interface { | ||
Get(ctx context.Context, | ||
resourceGroupName, resourceName, agentPoolName string, | ||
options *armcontainerservice.AgentPoolsClientGetOptions) ( | ||
armcontainerservice.AgentPoolsClientGetResponse, error) | ||
BeginCreateOrUpdate( | ||
ctx context.Context, | ||
resourceGroupName, resourceName, agentPoolName string, | ||
parameters armcontainerservice.AgentPool, | ||
options *armcontainerservice.AgentPoolsClientBeginCreateOrUpdateOptions) ( | ||
*runtime.Poller[armcontainerservice.AgentPoolsClientCreateOrUpdateResponse], error) | ||
BeginDeleteMachines( | ||
ctx context.Context, | ||
resourceGroupName, resourceName, agentPoolName string, | ||
machines armcontainerservice.AgentPoolDeleteMachinesParameter, | ||
options *armcontainerservice.AgentPoolsClientBeginDeleteMachinesOptions) ( | ||
*runtime.Poller[armcontainerservice.AgentPoolsClientDeleteMachinesResponse], error) | ||
} | ||
|
||
const userAgent = "AKS-autoscaler" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandhipr can maybe tell you where, but i think we already have userAgent somewhere and should share the same user agent for CAS for all of the clients.
Would have to go digging to find it but seems to write queries against cas or detect when we are hitting throtlting or quota from kusto queries from CAS would be good to have all cas calls under a single user agent.
// AgentPoolsClient interface defines the methods needed for scaling vms pool. | ||
// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient | ||
type AgentPoolsClient interface { | ||
Get(ctx context.Context, | ||
resourceGroupName, resourceName, agentPoolName string, | ||
options *armcontainerservice.AgentPoolsClientGetOptions) ( | ||
armcontainerservice.AgentPoolsClientGetResponse, error) | ||
BeginCreateOrUpdate( | ||
ctx context.Context, | ||
resourceGroupName, resourceName, agentPoolName string, | ||
parameters armcontainerservice.AgentPool, | ||
options *armcontainerservice.AgentPoolsClientBeginCreateOrUpdateOptions) ( | ||
*runtime.Poller[armcontainerservice.AgentPoolsClientCreateOrUpdateResponse], error) | ||
BeginDeleteMachines( | ||
ctx context.Context, | ||
resourceGroupName, resourceName, agentPoolName string, | ||
machines armcontainerservice.AgentPoolDeleteMachinesParameter, | ||
options *armcontainerservice.AgentPoolsClientBeginDeleteMachinesOptions) ( | ||
*runtime.Poller[armcontainerservice.AgentPoolsClientDeleteMachinesResponse], error) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing against this interface is something I would like to start and think about and is important for our current work to improve the testing scenarios for cluster autoscaler.
For karpenter, we use env test to simulate teh kubernetes environment, and we have fakes for all of the azure apis. https://github.com/Azure/karpenter-provider-azure/tree/main/pkg/fake
I want us to start thinking about the best way to fake the Agentpools client. It doesn't have to be solved in this pr, but its good to start thinking about
|
||
func newAgentpoolClientWithEndpoint(baseURL, subscriptionID string, retryOptions azurecore_policy.RetryOptions) (AgentPoolsClient, error) { | ||
// for AKS managed CAS, we will use the ARMBaseURL to create a fake agent pool client | ||
// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint | |
// so that the request will be sent to node provisioner endpoint instead of the public ARM endpoint |
|
||
func newAgentpoolClientWithEndpoint(baseURL, subscriptionID string, retryOptions azurecore_policy.RetryOptions) (AgentPoolsClient, error) { | ||
// for AKS managed CAS, we will use the ARMBaseURL to create a fake agent pool client | ||
// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to mention the node provisioner publically?
agentPoolsClient, err := armcontainerservice.NewAgentPoolsClient(cfg.SubscriptionID, cred, | ||
&policy.ClientOptions{ | ||
ClientOptions: azurecore_policy.ClientOptions{ | ||
Cloud: cloud.Configuration{ | ||
Services: map[cloud.ServiceName]cloud.ServiceConfiguration{ | ||
cloud.ResourceManager: { | ||
Endpoint: env.ResourceManagerEndpoint, | ||
Audience: env.TokenAudience, | ||
}, | ||
}, | ||
}, | ||
Telemetry: azurecore_policy.TelemetryOptions{ | ||
ApplicationID: userAgent, | ||
}, | ||
Retry: retryOptions, | ||
}, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are initializing the client twice here, can we share one fucntion for initalization and just pass in the right policy.ClientOptions? Also again would be good to use the base of the shared client options so that you dont' enouunter all the tcp bugs, throttling etc we encountered.
965e9a4
to
4b4c541
Compare
cluster-autoscaler/go.mod
Outdated
@@ -5,6 +5,10 @@ go 1.21 | |||
require ( | |||
cloud.google.com/go/compute/metadata v0.2.3 | |||
github.com/Azure/azure-sdk-for-go v68.0.0+incompatible | |||
github.com/Azure/azure-sdk-for-go-extensions v0.1.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github.com/Azure/azure-sdk-for-go-extensions v0.1.6
Try v0.1.6 its the one we use in karpenter and we dont see that telemetry error you are getting
} | ||
} | ||
|
||
return newAgentpoolClientWithConfig(cfg.SubscriptionID, cred, env.ResourceManagerEndpoint, env.TokenAudience, retryOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful!
/ok-to-test |
/retest |
|
||
if cfg.ARMBaseURL != "" { | ||
klog.V(10).Infof("Using ARMBaseURL to create agent pool client") | ||
return newAgentpoolClientWithConfig(cfg.SubscriptionID, nil, cfg.ARMBaseURL, "UNKNOWN", retryOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check cfg.SubscriptionID? is it possible that it is empty by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why tokenCredential is nil?
@@ -287,5 +417,6 @@ func newAzClient(cfg *Config, env *azure.Environment) (*azClient, error) { | |||
virtualMachinesClient: virtualMachinesClient, | |||
storageAccountsClient: storageAccountsClient, | |||
skuClient: skuClient, | |||
agentPoolClient: agentPoolClient, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on L408, it is possible that the agentPoolClient is nil. Be sure to check it when refer it in later code development.
/lgtm |
// AgentPoolsClient interface defines the methods needed for scaling vms pool. | ||
// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient | ||
type AgentPoolsClient interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an implicit implementation choice here of implementing a new client here, rather than using those in cloud-provider-azure (and maybe extending them; looks like there is an MC client, but not AP one), which are used throughout the rest of CAS Azure provider. What are the tradeoffs, and which ones push us in this direction?
de0550e
to
3b592b0
Compare
/retest |
586ade0
to
73f525c
Compare
/retest |
/lgtm |
/approve |
/assign @feiskyer |
klog.Errorf("NewClientSecretCredential failed: %v", err) | ||
return nil, err | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no msi support here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could get this PR in first and add MSI support in a following PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Bryce-Soghigian, feiskyer, gandhipr, tallaxes, wenxuan0923 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b36d4fb
into
kubernetes:cluster-autoscaler-release-1.29
What type of PR is this?
/kind feature
What this PR does / why we need it:
We will introduce a new
vmType: vms
to support the autoscaling for single instance vm pool. This new agentpool type will rely on AKS rp for scaling up and down instead of CRP call, which requires the agentpool client, cluster name and cluster resource group info.In this PR, I'm adding the required info to the config and adding agentpool client into
azClient
. So far, the construction error is ignored so that no existing behavior will be impacted by this change.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This is the first step of a series of changes coming to support the vms pool autoscaling.
Does this PR introduce a user-facing change?
No, this change should not impact any existing behavior
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: