Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (cluster): [day2-ops] node update configuration #403

Merged
merged 14 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 0 additions & 30 deletions 05-bootstrap-prep.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,36 +57,6 @@ In addition to Azure Container Registry being deployed to support bootstrapping,
# Get your ACR instance name
export ACR_NAME_AKS_BASELINE=$(az deployment group show -g rg-bu0001a0008 -n acr-stamp --query properties.outputs.containerRegistryName.value -o tsv)
echo ACR_NAME_AKS_BASELINE: $ACR_NAME_AKS_BASELINE

# Import core image(s) hosted in public container registries to be used during bootstrapping
az acr import --source ghcr.io/kubereboot/kured:1.15.0 -n $ACR_NAME_AKS_BASELINE
```

> In this walkthrough, there is only one image that is included in the bootstrapping process. It's included as a reference for this process. Your choice to use Kubernetes Reboot Daemon (Kured) or any other images, including Helm charts, as part of your bootstrapping is yours to make.

1. Update bootstrapping manifests to pull from your Azure Container Registry. *Optional. Fork required.*

> Your cluster will immediately begin processing the manifests in [`cluster-manifests/`](./cluster-manifests/) due to the bootstrapping configuration that will be applied to it. So, before you deploy the cluster now would be the right time push the following changes to your fork so that it will use your files instead of the files found in the original mspnp repo which point to public container registries:
>
> - update the one `image:` value in [`kured.yaml`](./cluster-manifests/cluster-baseline-settings/kured.yaml) to use your container registry instead of a public container registry. See the comment in the file for instructions (or you can simply run the following command.)

:warning: Without updating these files and using your own fork, you will be deploying your cluster such that it takes dependencies on public container registries. This is generally okay for exploratory/testing, but not suitable for production. Before going to production, ensure *all* image references you bring to your cluster are from *your* container registry (link imported in the prior step) or another that you feel confident relying on.

```bash
sed -i "s:ghcr.io:${ACR_NAME_AKS_BASELINE}.azurecr.io:" ./cluster-manifests/cluster-baseline-settings/kured.yaml
```

Note, that if you are on macOS, you might need to use the following command instead:

```bash
sed -i '' 's:ghcr.io:'"${ACR_NAME_AKS_BASELINE}"'.azurecr.io:g' ./cluster-manifests/cluster-baseline-settings/kured.yaml
```

Now commit changes to repository.

```bash
git commit -a -m "Update image source to use my ACR instance instead of a public container registry."
git push
```

### Save your work in-progress
Expand Down
15 changes: 13 additions & 2 deletions 07-bootstrap-validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,19 @@ GitOps allows a team to author Kubernetes manifest files, persist them in their
echo AKS_CLUSTER_NAME: $AKS_CLUSTER_NAME
```

1. Validate there are no available image upgrades. As this AKS cluster was recently deployed, only a race condition between publication of new available images and the deployment image fetch could result into a different state.

```bash
az aks nodepool get-upgrades -n npuser01 --cluster-name $AKS_CLUSTER_NAME -g rg-bu0001a0008 && \
az aks nodepool show -n npuser01 --cluster-name $AKS_CLUSTER_NAME -g rg-bu0001a0008 --query nodeImageVersion
```

> Typically, base node images don't contain a suffix with a date (i.e. `AKSUbuntu-2204gen2containerd`). If the `nodeImageVersion` value looks like `AKSUbuntu-2204gen2containerd-202402.26.0` a SecurityPatch or NodeImage upgrade has been applied to the AKS node.

> The AKS nodes are configured to receive weekly updates automatically which include security patches, kernel updates, and node images updates. The AKS cluster version won't be updated automatically since production clusters should be updated manually after testing in lower environments.

> Node image updates are shipped on a weekly cadence by default. This AKS cluster is configured to have its maintenance window for node image updates every Tuesday at 9PM. If a node image is released outside of this maintenance window, the nodes will be updated on the next scheduled occurrence. For AKS nodes that require more frequent updates, consider changing the auto-upgrade channel to `SecurityPatch` and configuring a daily maintenance window.

1. Get AKS `kubectl` credentials.

> In the [Microsoft Entra ID Integration](03-microsoft-entra-id.md) step, we placed our cluster under Microsoft Entra group-backed RBAC. This is the first time we are seeing this used. `az aks get-credentials` sets your `kubectl` context so that you can issue commands against your cluster. Even when you have enabled Microsoft Entra ID integration with your AKS cluster, an Azure user has sufficient permissions on the cluster resource can still access your AKS cluster by using the `--admin` switch to this command. Using this switch *bypasses* Microsoft Entra ID and uses client certificate authentication instead; that isn't what we want to happen. So in order to prevent that practice, local account access such as `clusterAdmin` or `clusterMonitoringUser`) is expressly disabled.
Expand All @@ -52,11 +65,9 @@ GitOps allows a team to author Kubernetes manifest files, persist them in their
The bootstrapping process that already happened due to the usage of the Flux extension for AKS has set up the following, amoung other things

- the workload's namespace named `a0008`
- installed kured

```bash
kubectl get namespaces
kubectl get all -n cluster-baseline-settings
```

These commands will show you results that were due to the automatic bootstrapping process your cluster experienced due to the Flux GitOps extension. This content mirrors the content found in [`cluster-manifests`](./cluster-manifests), and commits made there will reflect in your cluster within minutes of making the change.
Expand Down
5 changes: 0 additions & 5 deletions cluster-manifests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,9 @@ This is the root of the GitOps configuration directory. These Kubernetes object

- Default Namespaces
- Kubernetes RBAC Role Assignments (cluster and namespace) to Microsoft Entra groups. *Optional*
- [Kured](#kured)
- Ingress Network Policy
- Azure Monitor Prometheus Scraping

### Kured

Kured is included as a solution to handle occasional required reboots from daily OS patching. This open-source software component is only needed if you require a managed rebooting solution between weekly [node image upgrades](https://learn.microsoft.com/azure/aks/node-image-upgrade). Building a process around deploying node image upgrades [every week](https://github.com/Azure/AKS/releases) satisfies most organizational weekly patching cadence requirements. Combined with most security patches on Linux not requiring reboots often, this leaves your cluster in a well supported state. If weekly node image upgrades satisfies your business requirements, then remove Kured from this solution by deleting [`kured.yaml`](./cluster-baseline-settings/kured.yaml). If however weekly patching using node image upgrades is not sufficient and you need to respond to daily security updates that mandate a reboot ASAP, then using a solution like Kured will help you achieve that objective. **Kured is not supported by Microsoft Support.**

## Private bootstrapping repository

Typically, your bootstrapping repository wouldn't be a public-facing repository like this one, but instead a private GitHub or Azure DevOps repo. The Flux operator deployed with the cluster supports private Git repositories as your bootstrapping source. In addition to requiring network line of sight to the repository from your cluster's nodes, you'll also need to ensure that you've provided the necessary credentials. This can come, typically, in the form of certificate-based SSH or personal access tokens (PAT), both ideally scoped as read-only to the repo with no additional permissions.
Expand Down
183 changes: 0 additions & 183 deletions cluster-manifests/cluster-baseline-settings/kured.yaml

This file was deleted.

54 changes: 22 additions & 32 deletions cluster-stamp.bicep
Original file line number Diff line number Diff line change
Expand Up @@ -263,23 +263,6 @@ resource qPrometheusAll 'Microsoft.OperationalInsights/queryPacks/queries@2019-0
}
}

// Example query that shows the usage of a specific Prometheus metric emitted by Kured
resource qNodeReboots 'Microsoft.OperationalInsights/queryPacks/queries@2019-09-01' = {
parent: qpBaselineQueryPack
name: guid(resourceGroup().id, 'KuredNodeReboot', clusterName)
properties: {
displayName: 'Kubenertes node reboot requested'
description: 'Which Kubernetes nodes are flagged for reboot (based on Prometheus metrics).'
body: 'InsightsMetrics | where Namespace == "prometheus" and Name == "kured_reboot_required" | where Val > 0'
related: {
categories: [
'container'
'management'
]
}
}
}

resource sci 'Microsoft.OperationsManagement/solutions@2015-11-01-preview' = {
name: 'ContainerInsights(${la.name})'
location: location
Expand Down Expand Up @@ -961,15 +944,6 @@ resource paAKSLinuxRestrictive 'Microsoft.Authorization/policyAssignments@2021-0
'azure-arc'
'flux-system'

// Known violations
// K8sAzureAllowedSeccomp
// - Kured, no profile defined
// K8sAzureContainerNoPrivilege
// - Kured, requires privileged to perform reboot
// K8sAzureBlockHostNamespaceV2
// - Kured, shared host namespace
// K8sAzureAllowedUsersGroups
// - Kured, no runAsNonRoot, no runAsGroup, no supplementalGroups, no fsGroup
'cluster-baseline-settings'

// Known violations
Expand Down Expand Up @@ -1054,7 +1028,6 @@ resource paRoRootFilesystem 'Microsoft.Authorization/policyAssignments@2021-06-0
}
excludedContainers: {
value: [
'kured' // Kured
'aspnet-webapp-sample' // ASP.NET Core does not support read-only root
]
}
Expand All @@ -1078,10 +1051,10 @@ resource paEnforceResourceLimits 'Microsoft.Authorization/policyAssignments@2021
policyDefinitionId: pdEnforceResourceLimitsId
parameters: {
cpuLimit: {
value: '500m' // Kured = 500m, traefik-ingress-controller = 200m, aspnet-webapp-sample = 100m
value: '500m' // traefik-ingress-controller = 200m, aspnet-webapp-sample = 100m
}
memoryLimit: {
value: '256Mi' // aspnet-webapp-sample = 256Mi, traefik-ingress-controller = 128Mi, Kured = 48Mi
value: '256Mi' // aspnet-webapp-sample = 256Mi, traefik-ingress-controller = 128Mi
}
excludedNamespaces: {
value: [
Expand Down Expand Up @@ -1111,7 +1084,7 @@ resource paEnforceImageSource 'Microsoft.Authorization/policyAssignments@2021-06
parameters: {
allowedContainerImagesRegex: {
// If all images are pull into your ARC instance as described in these instructions you can remove the docker.io & ghcr.io entries.
value: '${acr.name}\\.azurecr\\.io/.+$|mcr\\.microsoft\\.com/.+$|ghcr\\.io/kubereboot/kured.+$|docker\\.io/library/.+$'
value: '${acr.name}\\.azurecr\\.io/.+$|mcr\\.microsoft\\.com/.+$|docker\\.io/library/.+$'
}
excludedNamespaces: {
value: [
Expand Down Expand Up @@ -1640,7 +1613,7 @@ resource pdzAksIngress 'Microsoft.Network/privateDnsZones@2020-06-01' = {
}
}

resource mc 'Microsoft.ContainerService/managedClusters@2023-02-02-preview' = {
resource mc 'Microsoft.ContainerService/managedClusters@2024-01-02-preview' = {
name: clusterName
location: location
tags: {
Expand Down Expand Up @@ -1800,7 +1773,8 @@ resource mc 'Microsoft.ContainerService/managedClusters@2023-02-02-preview' = {
enabled: false // Using Microsoft Entra Workload IDs for pod identities.
}
autoUpgradeProfile: {
upgradeChannel: 'stable'
nodeOSUpgradeChannel: 'NodeImage'
upgradeChannel: 'none'
}
azureMonitorProfile: {
metrics: {
Expand Down Expand Up @@ -1907,6 +1881,22 @@ resource mc 'Microsoft.ContainerService/managedClusters@2023-02-02-preview' = {
kvPodMiIngressControllerKeyVaultReader_roleAssignment
kvPodMiIngressControllerSecretsUserRole_roleAssignment
]

resource os_maintenanceConfigurations 'maintenanceConfigurations' = {
name: 'aksManagedNodeOSUpgradeSchedule'
properties: {
maintenanceWindow: {
durationHours: 12
schedule: {
weekly: {
dayOfWeek: 'Tuesday'
intervalWeeks: 1
}
}
startTime: '21:00'
}
}
}
}

resource acrKubeletAcrPullRole_roleAssignment 'Microsoft.Authorization/roleAssignments@2020-10-01-preview' = {
Expand Down