Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Add AKS platform support #219

Merged
merged 12 commits into from
Apr 27, 2020
Merged

Add AKS platform support #219

merged 12 commits into from
Apr 27, 2020

Conversation

invidian
Copy link
Member

@invidian invidian commented Mar 26, 2020

Draft with PoC of AKS support

TODO:

  • full worker pool support
  • installing prometheus-operator component currently fails
  • errors.Wrapf -> fmt.Errorf
  • CI configuration
  • Add labels and taints support for worker pools
  • Tests
  • Validation for all fields + tests for it
  • Ensure that all e2e tests are passing
  • CI periodic cleanup script
  • plug in the CI
  • Clarify and implement how should we handle the secrets when creating the cluster
  • docs

Follow-up tasks:

To test, az binary is needed locally and you need to do az login before running lokoctl.

Refs #215 #216

@invidian invidian force-pushed the invidian/aks-support branch 2 times, most recently from 3b53c05 to 67c1202 Compare March 26, 2020 16:25
@invidian
Copy link
Member Author

installing prometheus-operator component currently fails

Hm, this seems to be problem of prometheus-operator helm chart and AKS 🤔

@invidian
Copy link
Member Author

It seems that with current latest version of Kubernetes on AKS (1.16.3), the prometheus-operator installation will fail. If one upgrades to 1.17.3, which is currently in preview, things start working just fine. The workaround is to make webhook failures be ignored for the time being.

@invidian
Copy link
Member Author

Added worker labels and taints as the missing feature for it.

@invidian
Copy link
Member Author

As part of CI process, we need to define, what information we require from the user to create a cluster. Here are the options:

  • Full automation: the user logs in using az login and has a full access to the subscription, including creating applications and granting roles to them (currently implemented). Terraform will then create service principle and resource group for running AKS.
  • Require service principle ID and secret: we would still manage the resource group used for AKS, but the service principle will be provided.
  • Require service principle ID, secret and resource group: we don't create any other extra resource than AKS, user must create everything by hand before running lokoctl
  • Make all options optional, so full automation by default, but available fallback lower privileges

If one tries to create AKS cluster using Azure portal, one needs to create resource group to create a resource, but the service principle is being created automatically (and if you cancel creation of AKS, it leaves garbage application registered 😄, it also has no identifier linking to the attempt of creating AKS cluster. At least it does not create the access secret too).

@invidian
Copy link
Member Author

It seems that with current latest version of Kubernetes on AKS (1.16.3), the prometheus-operator installation will fail. If one upgrades to 1.17.3, which is currently in preview, things start working just fine. The workaround is to make webhook failures be ignored for the time being.

Hm, the workaround worked just fine yesterday and now I'm not able to get it to work anymore :(

@invidian
Copy link
Member Author

Hm, the workaround worked just fine yesterday and now I'm not able to get it to work anymore :(

I managed to eventually get it to work. But it seems just ignoring webhooks is not sufficient to get it to work, they must be disabled.

invidian added a commit that referenced this pull request Mar 27, 2020
This is needed to install this component on AKS with Kubernetes 1.16.3.

See #219 for more details.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian
Copy link
Member Author

Added today:

  • support for labels and taints for worker pools
  • initial CI configuration

Still missing:

  • documentation
  • tests
  • CI job

@invidian
Copy link
Member Author

invidian commented Apr 2, 2020

Progress:

  • Added unit tests.
  • Added validation rules for all required fields.
  • Figured out how to handle both privileged and not so privileged users. Now one can create a cluster just with having service principle credentials.

Missing:

  • CI job
  • documentation
  • CI cleanup script
  • including components tests into the CI

@invidian
Copy link
Member Author

invidian commented Apr 2, 2020

Hm, I just realized there is one more distinction, which we possibly need to do. There are actually 2 types of credentials, which we need to handle here. One is ones, which are used for running Terraform and which create all the resources and another one is, which will be used to actually run the AKS cluster 😕

invidian added a commit that referenced this pull request Apr 2, 2020
This is needed to install this component on AKS with Kubernetes 1.16.3.

See #219 for more details.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian
Copy link
Member Author

invidian commented Apr 2, 2020

Currently, we fail to scrape following metrics:

    TestPrometheus/ComponentMetrics/prometheus-kube-proxy: components_metrics_test.go:98: Running scrape test iteration #19
    TestPrometheus/ComponentMetrics/prometheus-kube-scheduler: components_metrics_test.go:98: Running scrape test iteration #19
    TestPrometheus/ComponentMetrics/prometheus-coredns: components_metrics_test.go:98: Running scrape test iteration #19

Additionally, alerts for missing controlplane pods are currently triggering, they need to be disabled.

@invidian
Copy link
Member Author

invidian commented Apr 3, 2020

I resolved the issue with prometheus-operator. As some K8s components are not available for scraping in AKS, I had to disable alerts and scraping for them. CoreDNS required special care, with customized selector for pods.

I also have test CI job prepared, which is currently passing with all tests.

I also started looking into CI cleanup script, and it seems there should be relatively easy way to do the cleanups, based on selecting all resources with owner tag and then removing them based on updated-at tag. Need to dive in into Python SDK for Azure.

Left TODO:

  • Convert test CI job to production pipelines
  • Add AKS documentation
  • CI cleanup script
  • Clarify and implement how should we handle the secrets when creating the cluster

@invidian invidian force-pushed the invidian/aks-support branch 6 times, most recently from ff1bcac to 2f40a17 Compare April 6, 2020 09:41
invidian added a commit that referenced this pull request Apr 6, 2020
This is needed to install this component on AKS with Kubernetes 1.16.3.

See #219 for more details.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Apr 21, 2020
This is needed to install this component on AKS with Kubernetes 1.16.3.

See #219 for more details.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian
Copy link
Member Author

Feel free to document differences on a different PR.

Will do, as I have other follow-up issues to create anyway.

LGTM.

Now only to get CI passing...

invidian added a commit that referenced this pull request Apr 24, 2020
This is needed to install this component on AKS with Kubernetes 1.16.3.

See #219 for more details.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian invidian force-pushed the invidian/aks-support branch 2 times, most recently from 6af2afb to ba48f32 Compare April 24, 2020 10:55
So adding new properties does not require adding new interface method
and implementing it for all platforms.

This is required, as for adding support for managed Kubernetes
platforms, we will need to add more fields here.

Refs #215 #216

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This field will be used to distinguished managed and self-hosted
platforms, as the cluster upgrade process will depend on this
information.

Refs #215 #216

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
If we deploy on managed Kubernetes, it won't have controlplane installed
as Helm releases, so we shouldn't try to upgrade it. The upgrade will be
most likely driven by Terraform.

Refs #215 #216

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This is needed to install this component on AKS with Kubernetes 1.16.3.

See #219 for more details.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Refs #215 #216

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Adding -count=1 to 'go test' command makes, that all tests will be
executed again, which is important if e2e tests are run multiple times
on a single system.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So it doesn't violate 10 minutes default timeout.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So the diff between the two is easier to read.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian
Copy link
Member Author

I tested it with various scenarios using separate Azure account and clarified docs a bit more. I think it's ready to merge.

@invidian invidian requested a review from iaguis April 27, 2020 06:42
Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@invidian
Copy link
Member Author

Created follow up issues: #360, #361, #362.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants