Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for Azure Service Operator #3113

Merged
merged 1 commit into from
Apr 17, 2023

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Jan 27, 2023

What type of PR is this?
/kind design

What this PR does / why we need it: This PR adds a proposal suggesting the adoption of Azure Service Operator in CAPZ to manage infrastructure in Azure instead of the Azure SDK.

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/design Categorizes issue or PR as related to design. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 27, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 27, 2023
Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First draft is looking good! You thought of all the gotchas that I can think of.


- Leverage existing e2e tests
- Add unit tests for new ASO integration
- Run one-off tests against large clusters to catch performance regressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should talk about telemetry somewhere. Currently we have traces and metrics for every SDK call made in CAPZ https://capz.sigs.k8s.io/developers/development.html#viewing-telemetry, if we move to ASO we will lose that. @mattchr does ASO currently emit traces/metrics for SDK calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ASO exposes azure_successful_requests_total, azure_failed_requests_total, and azure_requests_time_seconds Prometheus metrics, but I don't see any OpenTelemetry integration.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any OpenTelemetry integration currently. We have prom metrics for every SDK call made, but not traces. As I mentioned in my other comment this is something we'd be open to improving, although I'm not sure how we'd get distributed tracing to work through CRs (so that you could have a top-level trace that spanned N ASO resource creations for example)

@nawazkh
Copy link
Member

nawazkh commented Jan 31, 2023

First draft looks great to me as well, thank you for putting it together!

Copy link
Contributor Author

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the feedback so far! I've addressed that for now in the form of bullet points and will start filling those sections out more.

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved
docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved
docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved
docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved
docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved
docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved
docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

- Leverage existing e2e tests
- Add unit tests for new ASO integration
- Run one-off tests against large clusters to catch performance regressions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ASO exposes azure_successful_requests_total, azure_failed_requests_total, and azure_requests_time_seconds Prometheus metrics, but I don't see any OpenTelemetry integration.

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

### Graduation Criteria

ASO integration will not be kept behind a feature flag or matriculate through the usual alpha, beta, and stable phases. Instead, the transition will be made one Azure service interface at a time so as to distribute potential impact over time.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this seems prudent.

Azure or Kubernetes API limits with fewer or smaller workload clusters being managed.
- Management cluster will have to manage many more Kubernetes resources per
workload cluster
- Because ASO has not yet been proven as a mission-critical interface to Azure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-phrased. I agree with this as a risk.

I think it makes a good bit of sense to make a shared bet. As you called out, ASO is solving the "2. Interfacing with the Azure platform to manage creating, updating, and deleting that infrastructure" problem, so it should end up reducing the work CAPZ has to do on that stuff, but this is a risk as obviously the Azure Go SDK has much broader adoption and is more mature (GA) than ASO is currently.

used instead of the API or SDK directly
- Conflicting user installations of ASO or ASO resources
- Future breaking changes in ASO
- Lower-fidelity telemetry compared to what CAPZ tracks currently
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we'd love to work with you guys on I think. We have some basic telemetry exposed already: https://azure.github.io/azure-service-operator/introduction/metrics/ - if you gave us a list of what exactly you wanted (or were losing in this migration) we could work to expose that data.

Or is the issue here more than you had integrations into the Azure SDK to track aggregate metrics such as "time it takes to fully provision a cluster" that you'd be losing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the little bit I've used CAPZ's tracing, I've found it helpful to have a breakdown of how long each step in a single CAPZ reconciliation takes. Since that includes Azure API calls currently, I think my main concern was losing that kind of association between a CAPZ reconciliation and Azure API calls. I updated this section to mention that I don't think that would really matter though since Azure API calls would be happening in ASO completely out-of-band with CAPZ reconciliations. Or at least recreating that mapping seems like it would be unnecessarily difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been discussions about tracking resource lifecycle and some related KEP work: https://groups.google.com/g/kubebuilder/c/tNI6ZpQ2loM/m/8rSX6HKVDgAJ. Correlation is going to be difficult. However, we might be able to trace with observed generation and namespace/name to get something close enough.

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CAPZ interacts with some Azure services that do not represent infrastructure, and thus cannot be represented in ASO. Resource Health, for example, is "reconciled" by CAPZ currently by getting a resource's health status and reflecting that in the corresponding CAPZ resource, but does not create or update any distinct Azure resources. The new SDK could be used to implement this existing functionality without affecting other service interfaces' use of ASO. Implementing Resource Health in ASO is being tracked in https://github.com/Azure/azure-service-operator/issues/2762.

Also, use of the `clusterctl move` command will require extra manual steps to move ASO resources as documented here: https://azure.github.io/azure-service-operator/introduction/frequently-asked-questions/#what-is-the-best-practice-for-transferring-aso-resources-from-one-cluster-to-another. Specifically, before `clusterctl move` is run, each ASO resource under the ownership hierarchy of a Cluster must have its `serviceoperator.azure.com/reconcile-policy` annotation set to `skip`. The necessary ASO resources can be enumerated by invoking `clusterctl move --dry-run -v 1`. `clusterctl move` will automatically detect and move the ASO resources. Then after `clusterctl move` is complete, the annotation should be changed back to its previous state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, before clusterctl move is run, each ASO resource under the ownership hierarchy of a Cluster must have its serviceoperator.azure.com/reconcile-policy annotation set to skip

that's not a great experience for users. They shouldn't have to care or even know about ASO as it's an implementation detail of CAPZ and not something they opt into. I think it's okay for a user to have to apply the annotation in the context where they are directly using ASO, but in the case where the CAPZ controller is the one "using" ASO to provision resources, the CAPZ controller should be the one applying these annotations. This might be tricky and might require some changes to clusterctl move but we should really try to avoid manual intervention from the user.

@codecov-commenter
Copy link

Codecov Report

Patch coverage has no change and project coverage change: +11.07 🎉

Comparison is base (4fc2041) 40.42% compared to head (e20a876) 51.50%.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3113       +/-   ##
===========================================
+ Coverage   40.42%   51.50%   +11.07%     
===========================================
  Files         241      182       -59     
  Lines       29560    18054    -11506     
===========================================
- Hits        11951     9298     -2653     
+ Misses      16700     8229     -8471     
+ Partials      909      527      -382     

see 109 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 6, 2023

I just pushed a couple small changes adding updates on clusterctl move and the gap in services that CAPZ uses that ASO doesn't support yet. Overall I think the proposal is complete even though there are a few identified gaps, but I think those are mostly implementation details don't affect how feasible it is overall to use ASO, so I'd advocate for starting lazy consensus on this again soon.

cc @dtzar

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0e3340eac8bb9c27333df20f45e2318541b27837

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 7, 2023

Officially starting lazy consensus on this, ending EOD 14 April (end of next week).

Copy link

@matthchr matthchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think this summarizes the pros/cons of using ASO quite well.

I will leave the actual decision of if the pros outweigh the cons to you experts as I don't have great visibility into the costs/benefits for CAPZ as a project when comparing ASO to something like the track2 SDKs.

@jackfrancis
Copy link
Contributor

/lgtm

I can’t add any more than what many others have said before me in these PR threads.

Great work @nojnhuh!

@CecileRobertMichon
Copy link
Contributor

/approve
/hold for lazy consensus expiration

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2023
@nawazkh
Copy link
Member

nawazkh commented Apr 14, 2023

Great work! Kudos @nojnhuh! 🚀
/lgtm

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 17, 2023

Time for slash hold cancel? 🤠

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2023
@CecileRobertMichon
Copy link
Contributor

/pony

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: pony image

In response to this:

/pony

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot merged commit 35f837e into kubernetes-sigs:main Apr 17, 2023
@nojnhuh nojnhuh deleted the aso-proposal branch April 17, 2023 15:37
@nojnhuh nojnhuh mentioned this pull request May 18, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.