-
Notifications
You must be signed in to change notification settings - Fork 345
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reliability-As-Code blogpost (#1217)
Signed-off-by: Kalai Wei <kawei@vmware.com>
- Loading branch information
Showing
1 changed file
with
128 additions
and
0 deletions.
There are no files selected for viewing
128 changes: 128 additions & 0 deletions
128
site/content/posts/2021-03-08-establishing-reliability-as-code-via-sonobuoy.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
--- | ||
title: Establishing Reliability-As-Code via Sonobuoy | ||
image: /img/sonobuoy.svg | ||
excerpt: Reliability Scanner Project as a Sonobuoy Plugin allows you to assess reliability risks using an extensible set of policies. | ||
author_name: Kalai Wei | ||
author_url: https://github.com/madamkiwi | ||
categories: [kubernetes, sonobuoy, plugins] | ||
# Tag should match author to drive author pages | ||
tags: ['Sonobuoy Team', 'VMware Tanzu CRE', 'Kalai Wei'] | ||
date: 2021-03-18 | ||
slug: reliability-scanner-kubernetes-clusters | ||
--- | ||
|
||
|
||
We are happy to introduce the [Reliability Scanner for Kubernetes as a Sonobuoy Plugin](https://github.com/vmware-tanzu/sonobuoy-plugins/tree/master/reliability-scanner)! It is a light-weight Go program that includes an extensible set of reliability assessments, or checks, performed against various components of a cluster, such as Pods, Namespaces, Services, etc. The Reliability Scanner runs as a container that serves as a Sonobuoy plugin and uses a configuration file to define customized checks. Kubernetes cluster operators can then configure appropriate constraints to run reliability scanner checks against their clusters. This project is based upon efforts and practices of [VMware’s Customer Reliability Engineering (CRE) team](https://tanzu.vmware.com/content/blog/hello-world-meet-vmware-cre). | ||
|
||
This article provides a walk-through of the components usedby the initial set of checks recommended by VMware CRE. You can create your own customized checks or extend the existing checks. | ||
|
||
## Probes | ||
Probes are periodic checks against containers that run our services and notify the kubelet when the container is alive and ready to accept traffic. Probes help Kubernetes make more informed decisions about the current status of one or many particular Pods behind a Service. | ||
|
||
There are three kinds of probes: | ||
|
||
- `startupProbe`: confirms the application within the container is available | ||
- `livelinessProbe`: confirms the container is in a running state | ||
- `readinessProbe`: confirms the container is ready to respond to requests | ||
|
||
The reliability scanner checks allow Kubernetes cluster operators to report Pods that are missing the liveliness and readiness probes, as part of the Sonobuoy report. Currently only liveliness and readiness checks are recommended by the VMware CRE team, however this check could extend to include startupProbe, if necessary, for your clusters. Additionally, a potential improvement in future releases would be to add a labeling capability for operators to specify which Pods to skip. | ||
|
||
### Owner annotations | ||
Owner annotations provide the ability to replicate patterns of deployment across multiple physical hosts. In a multi-tenant Kubernetes deployment, owner annotations support incident management by making it easy to figure out parties who should be notified. The scanner will check any user-provided annotations and allowed domains, against the Services running in the cluster, to reports back if the desired values are not set appropriately. | ||
|
||
### Minimum desired quality of service | ||
Quality of service (QoS) in Kubernetes is used by the scheduler to make decisions around scheduling and evicting pods on a node. There are three classes of QoS in Kubernetes: **BestEffort**, **Burstable**, and **Guaranteed**. | ||
|
||
The upfront reservation on scheduled Nodes depend on how our application requirement was defined. One of the reliability practices that allows applications to operate more efficiently is to ensure that the Node has or reserves enough resources when a container is scheduled. The reliability scanner will report any Pod that does not meet the minimum desired QoS class defined in the constraint. | ||
|
||
### Getting Started | ||
With the Reliability Scanner, you can add new reliability checks easily. Please refer to the [README](https://github.com/vmware-tanzu/sonobuoy-plugins/blob/master/reliability-scanner/README.md) for more details. | ||
|
||
First, make sure that [Sonobuoy CLI](https://github.com/vmware-tanzu/sonobuoy), Docker, make, ytt, and kubectl are installed on your local machine. There’s a helpful [Makefile](https://github.com/vmware-tanzu/sonobuoy-plugins/blob/master/reliability-scanner/Makefile) that handles the necessary setup and templating needed to run the customized reliability check YAML. | ||
|
||
|
||
### Adding a new QOS reliability check | ||
The reliability scanner is ready to be used to scan your cluster. However, a cluster operator who is interested in creating a customized reliability check, will need to make modifications in two places in order to create a new customized reliability check. First, the new check must be defined by name, description, kind, and spec, in a [YAML configuration file](https://github.com/vmware-tanzu/sonobuoy-plugins/blob/master/reliability-scanner/plugin/reliability-scanner-custom-values.lib.yml) as shown below: | ||
|
||
``` | ||
name: "Pod QOS check" | ||
description: Checks each pod to see if the minimum desired QOS is defined | ||
kind: v1alpha1/pod/qos # location of the check logic | ||
spec: # configurable by cluster operators | ||
minimum_desired_qos_class: Guaranteed # defines minimum expected QoS | ||
include_detail: true # includes the actual Pod QoS in the report | ||
``` | ||
|
||
Next, some Golang code must be written which actually does the check we need. As a starting point, the check will have to satisfy the [Querier](https://github.com/vmware-tanzu/sonobuoy-plugins/blob/master/reliability-scanner/api/v1alpha1/pod/qos/qos.go) Go interface (the project comes with a default [QoS implementation](https://github.com/vmware-tanzu/sonobuoy-plugins/blob/master/reliability-scanner/api/v1alpha1/pod/qos/qos.go) of the Querier interface), and a mapping must be made in the [scanner.go](https://github.com/vmware-tanzu/sonobuoy-plugins/blob/master/reliability-scanner/cmd/reliability-scanner/scanner.go) file. | ||
|
||
### Testing the new check | ||
The QoS check, defined above, allows the scanner to look across the cluster to report back on the current state of a Pod within the cluster to understand workloads that do not define our minimum [QoS class](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/). | ||
|
||
Let us create a Pod with a guaranteed QoS class and run our scan, we should see it reflected in the report: | ||
|
||
``` | ||
$ cat <<EOF | kubectl apply -f - | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: test | ||
spec: | ||
containers: | ||
- name: test | ||
image: nginx | ||
resources: | ||
limits: | ||
memory: "200Mi" | ||
cpu: "100Mi" | ||
EOF | ||
``` | ||
|
||
We can expect the following response from `kubectl`: | ||
``` | ||
pod/test created | ||
``` | ||
|
||
|
||
To run the reliability scanner, use command `make run` (run `make clean` to clean up any previous run). Once it’s complete, you can use `make results` to view the output. The scanner will send its results to the Sonobuoy aggregator to be collected using the command `sonobuoy result`. | ||
|
||
Based on our defined check, we have two configuration options: minimum_desired_qos_class and include_detail. Both options are telling us that for our report, we will fail any check that does not meet the minimum QoS class defined here. The included detail configuration option allows for the report runner to return the current QoS class of the Pod being assessed. | ||
|
||
Let’s review an excerpt from our Sonobuoy generated report to see how our scan went. | ||
|
||
``` | ||
$ make results | ||
# ... | ||
- name: qos | ||
status: failed # overall test status | ||
meta: | ||
file: "" | ||
type: "" | ||
items: | ||
- name: default/test | ||
status: passed # subtest status | ||
details: | ||
qos_class: Guaranteed | ||
# … | ||
``` | ||
|
||
|
||
|
||
|
||
We can see that, although our report is showing a failed status (as none of the Pods in the cluster meet the minimum desired QoS class), our guaranteed Pod, which we created earlier (`default/test`), has passed the check. | ||
|
||
Using this Reliability Scanner within a cluster is an easy way for cluster operators to identify any workloads or configurations that do not meet requirements and report them. | ||
|
||
For the probes and owner annotations checks, please refer to the [Reliability Scanner repo](https://github.com/vmware-tanzu/sonobuoy-plugins/tree/master/reliability-scanner). | ||
|
||
We hope that in time we are able to build sets of checks for multiple concerns and would love any feedback about the Reliability Scanner. Additionally, the VMware CRE team would love to hear from the broader community about good practices for operating workloads on Kubernetes. | ||
|
||
Community Shoutout | ||
This plugin contribution is thanks to site reliability engineers Peter Grant (Twitter @peterjamesgrant) and Kalai Wei of the [VMware CRE](https://tanzu.vmware.com/content/blog/hello-world-meet-vmware-cre) team who work together with customers and partner teams to learn and apply reliability engineering practices using the Tanzu portfolio. | ||
Collaborate with the Sonobuoy Community! | ||
Get updates on Twitter at | ||
[@projectsonobuoy](https://twitter.com/projectsonobuoy) | ||
Chat with us on Slack at | ||
[#sonobuoy](https://kubernetes.slack.com/messages/sonobuoy) on the Kubernetes Slack | ||
Collaborate with us on GitHub: | ||
[github.com/vmware-tanzu/sonobuoy](https://github.com/vmware-tanzu/sonobuoy) | ||
|