From d1a6a39244d8f1de6a19c42ca997b52c652626fb Mon Sep 17 00:00:00 2001 From: Robert Fratto Date: Tue, 15 Jun 2021 11:23:55 -0400 Subject: [PATCH] docs for the Grafana Agent Operator (#651) * docs for the Grafana Agent Operator * fix identation of nested lists * Update docs/operator/README.md Co-authored-by: Mario * more detail in README * describe why CRDs * mirror docs/operator/README.md intro to cmd/agent-operator/README.md Co-authored-by: Mario --- cmd/agent-operator/README.md | 32 +- docs/getting-started.md | 4 + docs/operator/README.md | 30 ++ docs/operator/architecture.md | 98 ++++++ docs/operator/faq.md | 9 + docs/operator/getting-started.md | 282 ++++++++++++++++++ .../operator/maintainers-guide.md | 144 ++++----- 7 files changed, 519 insertions(+), 80 deletions(-) create mode 100644 docs/operator/README.md create mode 100644 docs/operator/architecture.md create mode 100644 docs/operator/faq.md create mode 100644 docs/operator/getting-started.md rename cmd/agent-operator/DEVELOPMENT.md => docs/operator/maintainers-guide.md (93%) diff --git a/cmd/agent-operator/README.md b/cmd/agent-operator/README.md index d12ba4c12d26..25237909ef63 100644 --- a/cmd/agent-operator/README.md +++ b/cmd/agent-operator/README.md @@ -1,11 +1,22 @@ # Grafana Agent Operator The Grafana Agent Operator is a Kubernetes operator that makes it easier to -deploy Grafana Agent and easier to discover targets for metric collection. - -It is based on the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) -and aims to be compatible the official ServiceMonitor, PodMonitor, and Probe -CRDs that Prometheus Operator useres are used to. +deploy the Grafana Agent and easier to collect telemetry data from your pods. + +It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) +that specify how you would like to collect telemetry data from your Kubernetes +cluster and where you would like to send it. They abstract Kubernetes-specific +configuration that is more tedious to perform manually. The Grafana Agent +Operator manages corresponding Grafana Agent deployments in your cluster by +watching for changes against the custom resources. + +Metric collection is based on the [Prometheus +Operator](https://github.com/prometheus-operator/prometheus-operator) and +supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the +project. These custom resources represent abstractions for monitoring services, +pods, and ingresses. They are especially useful for Helm users, where manually +writing a generic SD to match all your charts can be difficult (or impossible!) +or where manually writing a specific SD for each chart can be tedious. ## Roadmap @@ -14,12 +25,13 @@ CRDs that Prometheus Operator useres are used to. - [ ] Traces support - [ ] Integrations support -## Installing +## Documentation -TODO. Stay tuned! +Refer to the project's [documentation](../../docs/operator) for how to install +and get started with the Grafana Agent Operator. ## Developer Reference -The [Developer's Guide](./DEVELOPMENT.md) includes basic information to help you -understand how the code works. This can be very useful if you are planning on -working on the Operator. +The [Maintainer's Guide](../../docs/operator/maintainers-guide.md) includes +basic information to help you understand how the code works. This can be very +useful if you are planning on working on the operator. diff --git a/docs/getting-started.md b/docs/getting-started.md index 090ecc303983..7235dc20d033 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -1,5 +1,9 @@ # Getting Started +This guide helps users get started with the Grafana Agent. For getting started +with the Grafana Agent Operator, please refer to the Operator-specific +[documentation](./operator). + ## Docker-Compose Example The quickest way to try out the Agent with a full Cortex, Grafana, and Agent diff --git a/docs/operator/README.md b/docs/operator/README.md new file mode 100644 index 000000000000..9b8b2619aa00 --- /dev/null +++ b/docs/operator/README.md @@ -0,0 +1,30 @@ +# Grafana Agent Operator + +The Grafana Agent Operator is a Kubernetes operator that makes it easier to +deploy the Grafana Agent and easier to collect telemetry data from your pods. + +It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) +that specify how you would like to collect telemetry data from your Kubernetes +cluster and where you would like to send it. They abstract Kubernetes-specific +configuration that is more tedious to perform manually. The Grafana Agent +Operator manages corresponding Grafana Agent deployments in your cluster by +watching for changes against the custom resources. + +Metric collection is based on the [Prometheus +Operator](https://github.com/prometheus-operator/prometheus-operator) and +supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the +project. These custom resources represent abstractions for monitoring services, +pods, and ingresses. They are especially useful for Helm users, where manually +writing a generic SD to match all your charts can be difficult (or impossible!) +or where manually writing a specific SD for each chart can be tedious. + +## Table of Contents + +1. [Getting Started](./getting-started.md) + 1. [Deploying CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions) + 2. [Installing on Kubernetes](./getting-started.md#installing-on-kubernetes) + 3. [Running locally](./getting-started.md#running-locally) + 4. [Deploying GrafanaAgent](./getting-started.md#deploying-grafanagent) +2. [FAQ](./faq.md) +3. [Architecture](./architecture.md) +4. [Maintainers Guide](./maintainers-guide.md) diff --git a/docs/operator/architecture.md b/docs/operator/architecture.md new file mode 100644 index 000000000000..61ed750484b7 --- /dev/null +++ b/docs/operator/architecture.md @@ -0,0 +1,98 @@ +# Architecture + +This guide gives a high-level overview of how the Grafana Agent Operator +works. Refer to the [maintainer's guide](./maintainers-guide.md) for +detailed lower-level information targeted at maintainers. + +The Grafana Agent Operator works in two phases: + +1. Discover a hierarchy of custom resources +2. Reconcile that hierarchy into a Grafana Agent deployment + +## Custom Resource Hierarchy + +The root of the custom resource hierarchy is the `GrafanaAgent` resource. It is +primary resource the Operator looks for, and is called the "root" because it +discovers many other sub-resources. + +The full hierarchy of custom resources is as follows: + +1. `GrafanaAgent` + 1. `PrometheusInstance` + 1. `PodMonitor` + 2. `Probe` + 3. `ServiceMonitor` + +Most of the resources above have the ability to reference a ConfigMap or a +Secret. All referenced ConfigMaps or Secrets are added into the resource +hierarchy. + +When a hierarchy is established, each item is watched for changes. Any changed +item will cause a reconcile of the root GrafanaAgent resource, either +creating, modifying, or deleting the corresponding Grafana Agent deployment. + +A single resource can belong to multiple hierarchies. For example, if two +GrafanaAgents use the same Probe, modifying that Probe will cause both +GrafanaAgents to be reconciled. + +## Reconcile + +When a resource hierarchy is created, updated, or deleted, a reconcile occurs. +When a GrafanaAgent resource is deleted, the corresponding Grafana Agent +deployment will also be deleted. + +Reconciling creates a few cluster resources: + +1. A Secret is generated holding the + [configuration](../configuration-reference.md) of the Grafana Agent. +2. Another Secret is created holding all referenced Secrets or ConfigMaps from + the resource hierarchy. This ensures that Secrets referenced from a custom + resource in another namespace can still be read. +3. A Service is created to govern the created StatefulSets. +4. One StatefulSet per Prometheus shard is created. + +PodMonitors, Probes, and ServiceMonitors are turned into individual scrape jobs +which all use Kubernetes SD. + +## Sharding and Replication + +The GrafanaAgent resource can specify a number of shards. Each shard results in +the creation of a StatefulSet with a hashmod + keep relabel_config per job: + +```yaml +- source_labels: [__address__] + target_label: __tmp_hash + modulus: NUM_SHARDS + action: hashmod +- source_labels: [__tmp_hash] + regex: CURRENT_STATEFULSET_SHARD + action: keep +``` + +This allows for some decent horizontal scaling capabilities, where each shard +will handle roughly 1/N of the total scrape load. Note that this does not use +consistent hashing, which means changing the number of shards will cause +anywhere between 1/N to N targets to reshuffle. + +The sharding mechanism is borrowed from the Prometheus Operator. + +The number of replicas can be defined, similarly to the number of shards. This +creates duplicate shards. This must be paired with a remote_write system that +can perform HA duplication. Grafana Cloud and Cortex provide this out of the +box, and the Grafana Agent Operator defaults support these two systems. + +The total number of created metrics pods will be product of `numShards * +numReplicas`. + +## Labels + +Two labels are added by default to every metric: + +- `cluster`, representing the `GrafanaAgent` deployment. Holds the value of + `/`. +- `__replica__`, representing the replica number of the Agent. This label works + out of the box with Grafana Cloud and Cortex's [HA + deduplication](https://cortexmetrics.io/docs/guides/ha-pair-handling/). + +The shard number is not added as a label, as sharding is designed to be +transparent on the receiver end. diff --git a/docs/operator/faq.md b/docs/operator/faq.md new file mode 100644 index 000000000000..b89b478dd3dc --- /dev/null +++ b/docs/operator/faq.md @@ -0,0 +1,9 @@ +# FAQ + +## Where do I find information on the supported values for the CustomResourceDefinitions? + +Once you've [deployed the CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions) +to your Kubernetes cluster, use `kubectl explain ` to get access to +the documentation for each resource. For example, `kubectl explain GrafanaAgent` +will describe the GrafanaAgent CRD, and `kubectl explain GrafanaAgent.spec` will +give you information on its spec field. diff --git a/docs/operator/getting-started.md b/docs/operator/getting-started.md new file mode 100644 index 000000000000..56b156ec4317 --- /dev/null +++ b/docs/operator/getting-started.md @@ -0,0 +1,282 @@ +# Getting Started + +An official Helm chart is planned to make it really easy to deploy the Grafana Agent +Operator on Kubernetes. For now, things must be done a little manually. + +## Deploying CustomResourceDefinitions + +Before you can write custom resources to describe a Grafana Agent deployment, +you _must_ deploy the +[CustomResourceDefinitions](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/) +to the cluster first. These definitions describe the schema that the custom +resources will conform to. This is also required for the operator to run; it +will fail if it can't find the custom resource definitions of objects it is +looking to use. + +The current set of CustomResourceDefinitions can be found in +[production/operator/crds](../../production/operator/crds). Apply them from the +root of this repository using: + +``` +kubectl apply -f production/operator/crds +``` + +This step *must* be done before installing the Operator, as the Operator will +fail to start if the CRDs do not exist. + +## Installing on Kubernetes + +Use the following Deployment to run the Operator, changing values as desired: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: grafana-agent-operator + namespace: default + labels: + app: grafana-agent-operator +spec: + replicas: 1 + selector: + matchLabels: + app: grafana-agent-operator + template: + metadata: + labels: + app: grafana-agent-operator + spec: + serviceAccountName: grafana-agent-operator + containers: + - name: operator + image: grafana/agent-operator:v0.15.0 +--- + +apiVersion: v1 +kind: ServiceAccount +metadata: + name: grafana-agent-operator + namespace: default + +--- + +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: grafana-agent-operator +rules: +- apiGroups: [monitoring.grafana.com] + resources: + - grafana-agents + - prometheus-instances + verbs: [get, list, watch] +- apiGroups: [monitoring.coreos.com] + resources: + - podmonitors + - probes + - servicemonitors + verbs: [get, list, watch] +- apiGroups: [""] + resources: + - namespaces + verbs: [get, list, watch] +- apiGroups: [""] + resources: + - secrets + - services + verbs: [get, list, watch, create, update, patch, delete] +- apiGroups: ["apps"] + resources: + - statefulsets + verbs: [get, list, watch, create, update, patch, delete] + +--- + +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: grafana-agent-operator +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: grafana-agent-operator +subjects: +- kind: ServiceAccount + name: grafana-agent-operator + namespace: default +``` + +## Running locally + +Before running locally, **make sure your kubectl context is correct!** +Running locally uses your current kubectl context, and you probably don't want +to accidentally deploy a new Grafana Agent to prod. + +CRDs should be installed on the cluster prior to running locally. If you haven't +done this yet, follow [deploying CustomResourceDefinitions](#deploying-customresourcedefinitions) +first. + +Afterwards, you can run the operator using `go run`: + +``` +go run ./cmd/agent-operator +``` + +## Deploying GrafanaAgent + +Now that the Operator is running, you can create a deployment of the +Grafana Agent. The first step is to create a GrafanaAgent resource. This +resource will discover a set of PrometheusInstance resources. You can use +this example, which creates a GrafanaAgent and the appropriate ServiceAccount +for you: + +```yaml +apiVersion: monitoring.grafana.com/v1alpha1 +kind: GrafanaAgent +metadata: + name: grafana-agent + namespace: default + labels: + app: grafana-agent +spec: + image: grafana/agent:v0.15.0 + logLevel: info + serviceAccountName: grafana-agent + prometheus: + instanceSelector: + matchLabels: + agent: grafana-agent + +--- + +apiVersion: v1 +kind: ServiceAccount +metadata: + name: grafana-agent + namespace: default + +--- + +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: grafana-agent +rules: +- apiGroups: + - "" + resources: + - nodes + - nodes/proxy + - services + - endpoints + - pods + verbs: + - get + - list + - watch +- nonResourceURLs: + - /metrics + verbs: + - get + +--- + +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: grafana-agent +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: grafana-agent +subjects: +- kind: ServiceAccount + name: grafana-agent + namespace: default +``` + +Note that this searches for PrometheusInstances in the same namespace with the +label matching `agent: grafana-agent`. A PrometheusInstance is a custom resource +that describes where to write collected metrics. Use this one as an example: + +```yaml +apiVersion: monitoring.grafana.com/v1alpha1 +kind: PrometheusInstance +metadata: + name: primary + namespace: default + labels: + agent: grafana-agent +spec: + remoteWrite: + - url: https://prometheus-us-central1.grafana.net/api/prom/push + basicAuth: + username: + name: primary-credentials + key: username + password: + name: primary-credentials + key: password + + # Supply an empty namespace selector to look in all namespaces. Remove + # this to only look in the same namespace. + serviceMonitorNamespaceSelector: {} + serviceMonitorSelector: + matchLabels: + instance: primary + + # Supply an empty namespace selector to look in all namespaces. Remove + # this to only look in the same namespace. + podMonitorNamespaceSelector: {} + podMonitorSelector: + matchLabels: + instance: primary + + # Supply an empty namespace selector to look in all namespaces. Remove + # this to only look in the same namespace. + probeNamespaceSelector: {} + probeSelector: + matchLabels: + instance: primary +``` + +Replace the remoteWrite URL to match your vendor. If your vendor doesn't need +credentials, you may remove the `basicAuth` section. Otherwise, configure a +secret with the base64-encoded values of the username and password: + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: primary-credentials + namespace: default +data: + username: BASE64_ENCODED_USERNAME + password: BASE64_ENCODED_PASSWORD +``` + +The above configuration of PrometheusInstance will discover all +[PodMonitors](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#podmonitor), +[Probes](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#probe), +and [ServiceMonitors](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#servicemonitor) +with a label matching `instance: primary`. Create resources as appropriate for +your environment. + +As an example, here is a ServiceMonitor that can collect metrics from +`kube-dns`: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: kube-dns + namespace: kube-system + labels: + instance: primary +spec: + selector: + matchLabels: + k8s-app: kube-dns + endpoints: + - port: metrics +``` diff --git a/cmd/agent-operator/DEVELOPMENT.md b/docs/operator/maintainers-guide.md similarity index 93% rename from cmd/agent-operator/DEVELOPMENT.md rename to docs/operator/maintainers-guide.md index f0df0d023a30..835ff1d1ae48 100644 --- a/cmd/agent-operator/DEVELOPMENT.md +++ b/docs/operator/maintainers-guide.md @@ -1,10 +1,13 @@ -# Developing the Agent Operator +# Maintainer's Guide + +This document contains maintainer-specific information. Table of Contents: 1. [Introduction](#introduction) -2. [Architecture](#architecture) -3. [Local testing environment](#local-testing-environment) +2. [Updating CRDs](#updating-crds) +3. [Testing Locally](#testing-locally) +4. [Development Architecture](#development-architecture) ## Introduction @@ -25,7 +28,74 @@ The public [Grafana Agent Operator design doc](https://docs.google.com/document/d/1nlwhJLspTkkm8vLgrExJgf02b9GCAWv_Ci_a9DliI_s) goes into more detail about the context and design decisions being made. -## Architecture +## Updating CRDs + +The `make crds` command at the root of this repository will generate CRDs and +other code used by the operator. This calls the [generate-crds +script](../../tools/generate-crds.bash) in a container. If you wish to call this +script manually, you must also install `controller-gen`: + +``` +go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest +``` + +## Testing Locally + +Create a k3d cluster (depending on k3d v4.x): + +``` +k3d cluster create agent-operator \ + --port 30080:80@loadbalancer \ + --api-port 50043 \ + --kubeconfig-update-default=true \ + --kubeconfig-switch-context=true \ + --wait +``` + +### Deploy Prometheus + +An example Prometheus server is provided in `./example-prometheus.yaml`. Deploy +it with the following, from the root of the repository: + +``` +kubectl apply -f ./cmd/agent-operator/example-prometheus.yaml +``` + +You can view it at http://prometheus.k3d.localhost:30080 once the k3d cluster is +running. + +### Apply the CRDs + +Generated CRDs used by the operator can be found in [the Production +folder](../../production/operator/crds). Deploy them from the root of the +repository with: + +``` +kubectl apply -f production/operator/crds +``` + +### Run the Operator + +Now that the CRDs are applied, you can run the operator from the root of the +repository: + +``` +go run ./cmd/agent-operator +``` + +### Apply a GrafanaAgent custom resource + +Finally, you can apply an example GrafanaAgent custom resource. One is [provided +for you](./agent-example-config.yaml). From the root of the repository, run: + +``` +kubectl apply -f ./cmd/agent-operator/agent-example-config.yaml +``` + +If you are running the operator, you should see it pick up the change and start +mutating the cluster. + +## Development Architecture This project makes heavy use of the [Kubernetes SIG Controller Runtime](https://pkg.go.dev/sigs.k8s.io/controller-runtime) project. That @@ -109,70 +179,4 @@ CR: When `default/agent` gets deleted, all `EnqueueRequestForSelector` event handlers get notified to stop sending events for `default/agent`. -## Generating CRDs - -The `make crds` command at the root of this repository will generate CRDs and -other code used by the operator. This calls the [generate-crds -script](../../tools/generate-crds.bash) in a container. If you wish to call this -script manually, you must also install `controller-gen`: - -``` -go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest -``` - -## Local testing environment - -Create a k3d cluster (depending on k3d v4.x): - -``` -k3d cluster create agent-operator \ - --port 30080:80@loadbalancer \ - --api-port 50043 \ - --kubeconfig-update-default=true \ - --kubeconfig-switch-context=true \ - --wait -``` - -### Deploy Prometheus - -An example Prometheus server is provided in `./example-prometheus.yaml`. Deploy -it with the following: - -``` -kubectl apply -f ./cmd/agent-operator/example-prometheus.yaml -``` - -You can view it at http://prometheus.k3d.localhost:30080 once the k3d cluster is -running. - -### Apply the CRDs - -Generated CRDs used by the operator can be found in [the Production -folder](../../production/operator/crds). Deploy them from the root of the -repository with: - -``` -kubectl apply -f production/operator/crds -``` - -### Run the Operator - -Now that the CRDs are applied, you can run the operator: - -``` -go run ./cmd/agent-operator -``` - -### Apply a GrafanaAgent custom resource - -Finally, you can apply an example GrafanaAgent custom resource. One is [provided -for you](./agent-example-config.yaml). From the root of the repository, run: - -``` -kubectl apply -f ./cmd/agent-operator/agent-example-config.yaml -``` - -If you are running the operator, you should see it pick up the change and start -mutating the cluster. -