Skip to content

Commit

Permalink
Add metrics and alerts tutorial to the docs
Browse files Browse the repository at this point in the history
Signed-off-by: assaf-admi <aadmi@redhat.com>
  • Loading branch information
assafad committed Feb 28, 2023
1 parent d828db2 commit c657fac
Showing 1 changed file with 202 additions and 2 deletions.
204 changes: 202 additions & 2 deletions website/content/en/docs/building-operators/golang/advanced-topics.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,193 @@ func init() {
* After adding new import paths to your operator project, run `go mod vendor` if a `vendor/` directory is present in the root of your project directory to fulfill these dependencies.
* Your 3rd party resource needs to be added before add the controller in `"Setup all Controllers"`.
### Metrics
### Monitoring and Observability
This section aims to explain how to expose the first operator's custom metrics and alerts. It focuses on the technical aspects of adding metrics and alerts, and demonstrates it over the sample [memcached-operator]. For more information regarding operators' monitoring, visit [observability-best-practices].
To learn about how metrics work in the Operator SDK read the [metrics section][metrics_doc] of the Kubebuilder documentation.
#### Prerequisites
- Deploy [kube-prometheus] on your cluster. This will set a Prometheus instance on the cluster, which will enable the Prometheus UI, in which we will inspect the metrics, [alerts] and [recording rules].
- Make sure Prometheus has access to the operator's namespace, by setting the corresponding RBAC rules. You may find [prometheus_role.yaml] and [prometheus_role_binding.yaml] examples helpful.
- Create a new `monitoring` directory, which will contain the metrics and alerts implementation. It is recommended to create a dedicated `monitoring` directory, in order to keep the monitoring implementation separated from the core operator's logic. This will allow easier maintenance of both the core operator code and the monitoring code.
- Add the following command to the `dockerfile`:
```dockerfile
COPY monitoring/ monitoring/
```
#### Publishing Custom Metrics
If you wish to publish custom metrics for your operator, this can be easily achieved by using the global registry from `controller-runtime/pkg/metrics`.
One way to achieve this is to declare your collectors as global variables, register them using `RegisterMetrics()` and call it in the controller's `init()` function.
Example custom metric: [MemcachedDeploymentSizeUndesiredCountTotal].
```go
package monitoring
import (
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"
)
var (
MemcachedDeploymentSizeUndesiredCountTotal = prometheus.NewCounter(
prometheus.CounterOpts{
Name: "memcached_deployment_size_undesired_count_total",
Help: "Total number of times the deployment size was not as desired.",
},
)
)
// RegisterMetrics will register metrics with the global prometheus registry
func RegisterMetrics() {
metrics.Registry.MustRegister(MemcachedDeploymentSizeUndesiredCountTotal)
}
```
- The above example creates a new counter metric. For other metrics' types, visit [Prometheus Documentation].
- For more information regarding metrics naming conventions, follow [observability-best-practices].
[init() function example]:
```go
package main
import (
...
"github.com/example/memcached-operator/monitoring"
)
func init() {
...
monitoring.RegisterMetrics()
...
}
```
The next step would be to set the controller's logic according to which we update the metric's value. In this case, the new metric type is counter, thus a valid update operation would be to increment its value.
[Metric update example]:
```go
...
size := memcached.Spec.Size
if *found.Spec.Replicas != size {
// Increment MemcachedDeploymentSizeUndesiredCountTotal metric by 1
monitoring.MemcachedDeploymentSizeUndesiredCountTotal.Inc()
}
...
```
Different metrics types have different valid operations. Visit [Prometheus Golang client] for more information.
#### Publishing Custom Alerts and Recording Rules
In order to add alerts and recording rules, which are unique to the operator's needs, we'll create a dedicated PrometheusRule object, by using [prometheus-operator API].
[PrometheusRule example]:
```go
package monitoring
import (
monitoringv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/intstr"
)
// NewPrometheusRule creates new PrometheusRule(CR) for the operator to have alerts and recording rules
func NewPrometheusRule(namespace string) *monitoringv1.PrometheusRule {
return &monitoringv1.PrometheusRule{
TypeMeta: metav1.TypeMeta{
APIVersion: monitoringv1.SchemeGroupVersion.String(),
Kind: "PrometheusRule",
},
ObjectMeta: metav1.ObjectMeta{
Name: "memcached-operator-rules",
Namespace: "memcached-operator-system",
},
Spec: *NewPrometheusRuleSpec(),
}
}
// NewPrometheusRuleSpec creates PrometheusRuleSpec for alerts and recording rules
func NewPrometheusRuleSpec() *monitoringv1.PrometheusRuleSpec {
return &monitoringv1.PrometheusRuleSpec{
Groups: []monitoringv1.RuleGroup{{
Name: "memcached.rules",
Rules: []monitoringv1.Rule{
createOperatorUpTotalRecordingRule(),
createOperatorDownAlertRule()
},
}},
}
}
// createOperatorUpTotalRecordingRule creates memcached_operator_up_total recording rule
func createOperatorUpTotalRecordingRule() monitoringv1.Rule {
return monitoringv1.Rule{
Record: "memcached_operator_up_total",
Expr: intstr.FromString("sum(up{pod=~'memcached-operator-controller-manager-.*'} or vector(0))"),
}
}
// createOperatorDownAlertRule creates MemcachedOperatorDown alert rule
func createOperatorDownAlertRule() monitoringv1.Rule {
return monitoringv1.Rule{
Alert: "MemcachedOperatorDown",
Expr: intstr.FromString("memcached_operator_up_total == 0"),
Annotations: map[string]string{
"description": "No running memcached-operator pods were detected in the last 5 min.",
},
For: "5m",
Labels: map[string]string{
"severity": "critical",
},
}
}
```
Then, we may want to ensure that the new PrometheusRule is being created and reconciled. One way to achieve this is by expanding the existing `Reconcile()` function logic.
[PrometheusRule reconciliation example] example:
```go
func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
...
...
// Check if prometheus rule already exists, if not create a new one
foundRule := &monitoringv1.PrometheusRule{}
err := r.Get(ctx, types.NamespacedName{Name: ruleName, Namespace: namespace}, foundRule)
if err != nil && apierrors.IsNotFound(err) {
// Define a new prometheus rule
prometheusRule := monitoring.NewPrometheusRule(namespace)
if err := r.Create(ctx, prometheusRule); err != nil {
log.Error(err, "Failed to create prometheus rule")
return ctrl.Result{}, nil
}
}
if err == nil {
// Check if prometheus rule spec was changed, if so set as desired
desiredRuleSpec := monitoring.NewPrometheusRuleSpec()
if !reflect.DeepEqual(foundRule.Spec.DeepCopy(), desiredRuleSpec) {
desiredRuleSpec.DeepCopyInto(&foundRule.Spec)
if r.Update(ctx, foundRule); err != nil {
log.Error(err, "Failed to update prometheus rule")
return ctrl.Result{}, nil
}
}
...
...
}
```
- For more information regarding alerts and recording rules best practices, please follow [observability-best-practices].
- It is highly recommended implementing unit tests for prometheus rules. For more information, please follow Prometheus [unit testing documentation]. For a reference of unit tests implementation in a Golang operator, see memcached-operator [alerts unit tests].
Finally, in order to inspect the exposed metrics and alerts, we need to forward port `9090`, which corresponds with the Prometheus
deployment. This can be done with the following command:
```bash
$ kubectl -n monitoring port-forward svc/prometheus-k8s 9090
```
Now you can access Prometheus UI using [http://localhost:9090](http://localhost:9090).
### Handle Cleanup on Deletion
Expand Down Expand Up @@ -311,3 +494,20 @@ Authors may decide to distribute their bundles for various architectures: x86_64
[apimachinery_condition]: https://github.com/kubernetes/apimachinery/blob/d4f471b82f0a17cda946aeba446770563f92114d/pkg/apis/meta/v1/types.go#L1368
[helpers-conditions]: https://github.com/kubernetes/apimachinery/blob/master/pkg/api/meta/conditions.go
[multi_arch]:/docs/advanced-topics/multi-arch
[observability-best-practices]:https://sdk.operatorframework.io/docs/best-practices/observability-best-practices/
[alerts]:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
[recording rules]:https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
[prometheus_role.yaml]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/config/rbac/prometheus_role.yaml
[prometheus_role_binding.yaml]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/config/rbac/prometheus_role_binding.yaml
[MemcachedDeploymentSizeUndesiredCountTotal]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/monitoring/metrics.go
[init() function example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/main.go
[Metric update example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/controllers/memcached_controller.go
[Prometheus Documentation]:https://prometheus.io/docs/concepts/metric_types/
[Prometheus Golang client]:https://pkg.go.dev/github.com/prometheus/client_golang/prometheus
[kube-prometheus]:https://github.com/prometheus-operator/kube-prometheus
[memcached-operator]:https://github.com/operator-framework/operator-sdk/tree/master/testdata/go/v4-alpha/monitoring/memcached-operator
[prometheus-operator API]:https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md
[PrometheusRule example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/monitoring/alerts.go
[PrometheusRule reconciliation example]:https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v4-alpha/monitoring/memcached-operator/controllers/memcached_controller.go
[unit testing documentation]:https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
[alerts unit tests]:https://github.com/operator-framework/operator-sdk/tree/master/testdata/go/v4-alpha/monitoring/memcached-operator/monitoring/prom-rule-ci

0 comments on commit c657fac

Please sign in to comment.