From d1a6a39244d8f1de6a19c42ca997b52c652626fb Mon Sep 17 00:00:00 2001
From: Robert Fratto <robert.fratto@grafana.com>
Date: Tue, 15 Jun 2021 11:23:55 -0400
Subject: [PATCH] docs for the Grafana Agent Operator (#651)

* docs for the Grafana Agent Operator

* fix identation of nested lists

* Update docs/operator/README.md

Co-authored-by: Mario <mariorvinas@gmail.com>

* more detail in README

* describe why CRDs

* mirror docs/operator/README.md intro to cmd/agent-operator/README.md

Co-authored-by: Mario <mariorvinas@gmail.com>
---
 cmd/agent-operator/README.md                  |  32 +-
 docs/getting-started.md                       |   4 +
 docs/operator/README.md                       |  30 ++
 docs/operator/architecture.md                 |  98 ++++++
 docs/operator/faq.md                          |   9 +
 docs/operator/getting-started.md              | 282 ++++++++++++++++++
 .../operator/maintainers-guide.md             | 144 ++++-----
 7 files changed, 519 insertions(+), 80 deletions(-)
 create mode 100644 docs/operator/README.md
 create mode 100644 docs/operator/architecture.md
 create mode 100644 docs/operator/faq.md
 create mode 100644 docs/operator/getting-started.md
 rename cmd/agent-operator/DEVELOPMENT.md => docs/operator/maintainers-guide.md (93%)

diff --git a/cmd/agent-operator/README.md b/cmd/agent-operator/README.md
index d12ba4c12d26..25237909ef63 100644
--- a/cmd/agent-operator/README.md
+++ b/cmd/agent-operator/README.md
@@ -1,11 +1,22 @@
 # Grafana Agent Operator
 
 The Grafana Agent Operator is a Kubernetes operator that makes it easier to
-deploy Grafana Agent and easier to discover targets for metric collection.
-
-It is based on the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)
-and aims to be compatible the official ServiceMonitor, PodMonitor, and Probe
-CRDs that Prometheus Operator useres are used to.
+deploy the Grafana Agent and easier to collect telemetry data from your pods.
+
+It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+that specify how you would like to collect telemetry data from your Kubernetes
+cluster and where you would like to send it. They abstract Kubernetes-specific
+configuration that is more tedious to perform manually. The Grafana Agent
+Operator manages corresponding Grafana Agent deployments in your cluster by
+watching for changes against the custom resources.
+
+Metric collection is based on the [Prometheus
+Operator](https://github.com/prometheus-operator/prometheus-operator) and
+supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the
+project. These custom resources represent abstractions for monitoring services,
+pods, and ingresses. They are especially useful for Helm users, where manually
+writing a generic SD to match all your charts can be difficult (or impossible!)
+or where manually writing a specific SD for each chart can be tedious.
 
 ## Roadmap
 
@@ -14,12 +25,13 @@ CRDs that Prometheus Operator useres are used to.
 - [ ] Traces support
 - [ ] Integrations support
 
-## Installing
+## Documentation
 
-TODO. Stay tuned!
+Refer to the project's [documentation](../../docs/operator) for how to install
+and get started with the Grafana Agent Operator.
 
 ## Developer Reference
 
-The [Developer's Guide](./DEVELOPMENT.md) includes basic information to help you
-understand how the code works. This can be very useful if you are planning on
-working on the Operator.
+The [Maintainer's Guide](../../docs/operator/maintainers-guide.md) includes
+basic information to help you understand how the code works. This can be very
+useful if you are planning on working on the operator.
diff --git a/docs/getting-started.md b/docs/getting-started.md
index 090ecc303983..7235dc20d033 100644
--- a/docs/getting-started.md
+++ b/docs/getting-started.md
@@ -1,5 +1,9 @@
 # Getting Started
 
+This guide helps users get started with the Grafana Agent. For getting started
+with the Grafana Agent Operator, please refer to the Operator-specific
+[documentation](./operator).
+
 ## Docker-Compose Example
 
 The quickest way to try out the Agent with a full Cortex, Grafana, and Agent
diff --git a/docs/operator/README.md b/docs/operator/README.md
new file mode 100644
index 000000000000..9b8b2619aa00
--- /dev/null
+++ b/docs/operator/README.md
@@ -0,0 +1,30 @@
+# Grafana Agent Operator
+
+The Grafana Agent Operator is a Kubernetes operator that makes it easier to
+deploy the Grafana Agent and easier to collect telemetry data from your pods.
+
+It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+that specify how you would like to collect telemetry data from your Kubernetes
+cluster and where you would like to send it. They abstract Kubernetes-specific
+configuration that is more tedious to perform manually. The Grafana Agent
+Operator manages corresponding Grafana Agent deployments in your cluster by
+watching for changes against the custom resources.
+
+Metric collection is based on the [Prometheus
+Operator](https://github.com/prometheus-operator/prometheus-operator) and
+supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the
+project. These custom resources represent abstractions for monitoring services,
+pods, and ingresses. They are especially useful for Helm users, where manually
+writing a generic SD to match all your charts can be difficult (or impossible!)
+or where manually writing a specific SD for each chart can be tedious.
+
+## Table of Contents
+
+1. [Getting Started](./getting-started.md)
+    1. [Deploying CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions)
+    2. [Installing on Kubernetes](./getting-started.md#installing-on-kubernetes)
+    3. [Running locally](./getting-started.md#running-locally)
+    4. [Deploying GrafanaAgent](./getting-started.md#deploying-grafanagent)
+2. [FAQ](./faq.md)
+3. [Architecture](./architecture.md)
+4. [Maintainers Guide](./maintainers-guide.md)
diff --git a/docs/operator/architecture.md b/docs/operator/architecture.md
new file mode 100644
index 000000000000..61ed750484b7
--- /dev/null
+++ b/docs/operator/architecture.md
@@ -0,0 +1,98 @@
+# Architecture
+
+This guide gives a high-level overview of how the Grafana Agent Operator
+works. Refer to the [maintainer's guide](./maintainers-guide.md) for
+detailed lower-level information targeted at maintainers.
+
+The Grafana Agent Operator works in two phases:
+
+1. Discover a hierarchy of custom resources
+2. Reconcile that hierarchy into a Grafana Agent deployment
+
+## Custom Resource Hierarchy
+
+The root of the custom resource hierarchy is the `GrafanaAgent` resource. It is
+primary resource the Operator looks for, and is called the "root" because it
+discovers many other sub-resources.
+
+The full hierarchy of custom resources is as follows:
+
+1. `GrafanaAgent`
+    1. `PrometheusInstance`
+        1. `PodMonitor`
+        2. `Probe`
+        3. `ServiceMonitor`
+
+Most of the resources above have the ability to reference a ConfigMap or a
+Secret. All referenced ConfigMaps or Secrets are added into the resource
+hierarchy.
+
+When a hierarchy is established, each item is watched for changes. Any changed
+item will cause a reconcile of the root GrafanaAgent resource, either
+creating, modifying, or deleting the corresponding Grafana Agent deployment.
+
+A single resource can belong to multiple hierarchies. For example, if two
+GrafanaAgents use the same Probe, modifying that Probe will cause both
+GrafanaAgents to be reconciled.
+
+## Reconcile
+
+When a resource hierarchy is created, updated, or deleted, a reconcile occurs.
+When a GrafanaAgent resource is deleted, the corresponding Grafana Agent
+deployment will also be deleted.
+
+Reconciling creates a few cluster resources:
+
+1. A Secret is generated holding the
+   [configuration](../configuration-reference.md) of the Grafana Agent.
+2. Another Secret is created holding all referenced Secrets or ConfigMaps from
+   the resource hierarchy. This ensures that Secrets referenced from a custom
+   resource in another namespace can still be read.
+3. A Service is created to govern the created StatefulSets.
+4. One StatefulSet per Prometheus shard is created.
+
+PodMonitors, Probes, and ServiceMonitors are turned into individual scrape jobs
+which all use Kubernetes SD.
+
+## Sharding and Replication
+
+The GrafanaAgent resource can specify a number of shards. Each shard results in
+the creation of a StatefulSet with a hashmod + keep relabel_config per job:
+
+```yaml
+- source_labels: [__address__]
+  target_label: __tmp_hash
+  modulus: NUM_SHARDS
+  action: hashmod
+- source_labels: [__tmp_hash]
+  regex: CURRENT_STATEFULSET_SHARD
+  action: keep
+```
+
+This allows for some decent horizontal scaling capabilities, where each shard
+will handle roughly 1/N of the total scrape load. Note that this does not use
+consistent hashing, which means changing the number of shards will cause
+anywhere between 1/N to N targets to reshuffle.
+
+The sharding mechanism is borrowed from the Prometheus Operator.
+
+The number of replicas can be defined, similarly to the number of shards. This
+creates duplicate shards. This must be paired with a remote_write system that
+can perform HA duplication. Grafana Cloud and Cortex provide this out of the
+box, and the Grafana Agent Operator defaults support these two systems.
+
+The total number of created metrics pods will be product of `numShards *
+numReplicas`.
+
+## Labels
+
+Two labels are added by default to every metric:
+
+- `cluster`, representing the `GrafanaAgent` deployment. Holds the value of
+  `<GrafanaAgent.metadata.namespace>/<GrafanaAgent.metadata.name>`.
+- `__replica__`, representing the replica number of the Agent. This label works
+   out of the box with Grafana Cloud and Cortex's [HA
+   deduplication](https://cortexmetrics.io/docs/guides/ha-pair-handling/).
+
+The shard number is not added as a label, as sharding is designed to be
+transparent on the receiver end.
diff --git a/docs/operator/faq.md b/docs/operator/faq.md
new file mode 100644
index 000000000000..b89b478dd3dc
--- /dev/null
+++ b/docs/operator/faq.md
@@ -0,0 +1,9 @@
+# FAQ
+
+## Where do I find information on the supported values for the CustomResourceDefinitions?
+
+Once you've [deployed the CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions)
+to your Kubernetes cluster, use `kubectl explain <resource>` to get access to
+the documentation for each resource. For example, `kubectl explain GrafanaAgent`
+will describe the GrafanaAgent CRD, and `kubectl explain GrafanaAgent.spec` will
+give you information on its spec field.
diff --git a/docs/operator/getting-started.md b/docs/operator/getting-started.md
new file mode 100644
index 000000000000..56b156ec4317
--- /dev/null
+++ b/docs/operator/getting-started.md
@@ -0,0 +1,282 @@
+# Getting Started
+
+An official Helm chart is planned to make it really easy to deploy the Grafana Agent
+Operator on Kubernetes. For now, things must be done a little manually.
+
+## Deploying CustomResourceDefinitions
+
+Before you can write custom resources to describe a Grafana Agent deployment,
+you _must_ deploy the
+[CustomResourceDefinitions](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/)
+to the cluster first. These definitions describe the schema that the custom
+resources will conform to. This is also required for the operator to run; it
+will fail if it can't find the custom resource definitions of objects it is
+looking to use.
+
+The current set of CustomResourceDefinitions can be found in
+[production/operator/crds](../../production/operator/crds). Apply them from the
+root of this repository using:
+
+```
+kubectl apply -f production/operator/crds
+```
+
+This step *must* be done before installing the Operator, as the Operator will
+fail to start if the CRDs do not exist.
+
+## Installing on Kubernetes
+
+Use the following Deployment to run the Operator, changing values as desired:
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: grafana-agent-operator
+  namespace: default
+  labels:
+    app: grafana-agent-operator
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: grafana-agent-operator
+  template:
+    metadata:
+      labels:
+        app: grafana-agent-operator
+    spec:
+      serviceAccountName: grafana-agent-operator
+      containers:
+      - name: operator
+        image: grafana/agent-operator:v0.15.0
+---
+
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: grafana-agent-operator
+  namespace: default
+
+---
+
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: grafana-agent-operator
+rules:
+- apiGroups: [monitoring.grafana.com]
+  resources:
+  - grafana-agents
+  - prometheus-instances
+  verbs: [get, list, watch]
+- apiGroups: [monitoring.coreos.com]
+  resources:
+  - podmonitors
+  - probes
+  - servicemonitors
+  verbs: [get, list, watch]
+- apiGroups: [""]
+  resources:
+  - namespaces
+  verbs: [get, list, watch]
+- apiGroups: [""]
+  resources:
+  - secrets
+  - services
+  verbs: [get, list, watch, create, update, patch, delete]
+- apiGroups: ["apps"]
+  resources:
+  - statefulsets
+  verbs: [get, list, watch, create, update, patch, delete]
+
+---
+
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: grafana-agent-operator
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: grafana-agent-operator
+subjects:
+- kind: ServiceAccount
+  name: grafana-agent-operator
+  namespace: default
+```
+
+## Running locally
+
+Before running locally, **make sure your kubectl context is correct!**
+Running locally uses your current kubectl context, and you probably don't want
+to accidentally deploy a new Grafana Agent to prod.
+
+CRDs should be installed on the cluster prior to running locally. If you haven't
+done this yet, follow [deploying CustomResourceDefinitions](#deploying-customresourcedefinitions)
+first.
+
+Afterwards, you can run the operator using `go run`:
+
+```
+go run ./cmd/agent-operator
+```
+
+## Deploying GrafanaAgent
+
+Now that the Operator is running, you can create a deployment of the
+Grafana Agent. The first step is to create a GrafanaAgent resource. This
+resource will discover a set of PrometheusInstance resources. You can use
+this example, which creates a GrafanaAgent and the appropriate ServiceAccount
+for you:
+
+```yaml
+apiVersion: monitoring.grafana.com/v1alpha1
+kind: GrafanaAgent
+metadata:
+  name: grafana-agent
+  namespace: default
+  labels:
+    app: grafana-agent
+spec:
+  image: grafana/agent:v0.15.0
+  logLevel: info
+  serviceAccountName: grafana-agent
+  prometheus:
+    instanceSelector:
+      matchLabels:
+        agent: grafana-agent
+
+---
+
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: grafana-agent
+  namespace: default
+
+---
+
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: grafana-agent
+rules:
+- apiGroups:
+  - ""
+  resources:
+  - nodes
+  - nodes/proxy
+  - services
+  - endpoints
+  - pods
+  verbs:
+  - get
+  - list
+  - watch
+- nonResourceURLs:
+  - /metrics
+  verbs:
+  - get
+
+---
+
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: grafana-agent
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: grafana-agent
+subjects:
+- kind: ServiceAccount
+  name: grafana-agent
+  namespace: default
+```
+
+Note that this searches for PrometheusInstances in the same namespace with the
+label matching `agent: grafana-agent`. A PrometheusInstance is a custom resource
+that describes where to write collected metrics. Use this one as an example:
+
+```yaml
+apiVersion: monitoring.grafana.com/v1alpha1
+kind: PrometheusInstance
+metadata:
+  name: primary
+  namespace: default
+  labels:
+    agent: grafana-agent
+spec:
+  remoteWrite:
+  - url: https://prometheus-us-central1.grafana.net/api/prom/push
+    basicAuth:
+      username:
+        name: primary-credentials
+        key: username
+      password:
+        name: primary-credentials
+        key: password
+
+  # Supply an empty namespace selector to look in all namespaces. Remove
+  # this to only look in the same namespace.
+  serviceMonitorNamespaceSelector: {}
+  serviceMonitorSelector:
+    matchLabels:
+      instance: primary
+
+  # Supply an empty namespace selector to look in all namespaces. Remove
+  # this to only look in the same namespace.
+  podMonitorNamespaceSelector: {}
+  podMonitorSelector:
+    matchLabels:
+      instance: primary
+
+  # Supply an empty namespace selector to look in all namespaces. Remove
+  # this to only look in the same namespace.
+  probeNamespaceSelector: {}
+  probeSelector:
+    matchLabels:
+      instance: primary
+```
+
+Replace the remoteWrite URL to match your vendor. If your vendor doesn't need
+credentials, you may remove the `basicAuth` section. Otherwise, configure a
+secret with the base64-encoded values of the username and password:
+
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: primary-credentials
+  namespace: default
+data:
+  username: BASE64_ENCODED_USERNAME
+  password: BASE64_ENCODED_PASSWORD
+```
+
+The above configuration of PrometheusInstance will discover all
+[PodMonitors](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#podmonitor),
+[Probes](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#probe),
+and [ServiceMonitors](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#servicemonitor)
+with a label matching `instance: primary`. Create resources as appropriate for
+your environment.
+
+As an example, here is a ServiceMonitor that can collect metrics from
+`kube-dns`:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: kube-dns
+  namespace: kube-system
+  labels:
+    instance: primary
+spec:
+  selector:
+    matchLabels:
+      k8s-app: kube-dns
+  endpoints:
+  - port: metrics
+```
diff --git a/cmd/agent-operator/DEVELOPMENT.md b/docs/operator/maintainers-guide.md
similarity index 93%
rename from cmd/agent-operator/DEVELOPMENT.md
rename to docs/operator/maintainers-guide.md
index f0df0d023a30..835ff1d1ae48 100644
--- a/cmd/agent-operator/DEVELOPMENT.md
+++ b/docs/operator/maintainers-guide.md
@@ -1,10 +1,13 @@
-# Developing the Agent Operator
+# Maintainer's Guide
+
+This document contains maintainer-specific information.
 
 Table of Contents:
 
 1. [Introduction](#introduction)
-2. [Architecture](#architecture)
-3. [Local testing environment](#local-testing-environment)
+2. [Updating CRDs](#updating-crds)
+3. [Testing Locally](#testing-locally)
+4. [Development Architecture](#development-architecture)
 
 ## Introduction
 
@@ -25,7 +28,74 @@ The public [Grafana Agent Operator design
 doc](https://docs.google.com/document/d/1nlwhJLspTkkm8vLgrExJgf02b9GCAWv_Ci_a9DliI_s)
 goes into more detail about the context and design decisions being made.
 
-## Architecture
+## Updating CRDs
+
+The `make crds` command at the root of this repository will generate CRDs and
+other code used by the operator. This calls the [generate-crds
+script](../../tools/generate-crds.bash) in a container. If you wish to call this
+script manually, you must also install `controller-gen`:
+
+```
+go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest
+```
+
+## Testing Locally
+
+Create a k3d cluster (depending on k3d v4.x):
+
+```
+k3d cluster create agent-operator \
+  --port 30080:80@loadbalancer \
+  --api-port 50043 \
+  --kubeconfig-update-default=true \
+  --kubeconfig-switch-context=true \
+  --wait
+```
+
+### Deploy Prometheus
+
+An example Prometheus server is provided in `./example-prometheus.yaml`. Deploy
+it with the following, from the root of the repository:
+
+```
+kubectl apply -f ./cmd/agent-operator/example-prometheus.yaml
+```
+
+You can view it at http://prometheus.k3d.localhost:30080 once the k3d cluster is
+running.
+
+### Apply the CRDs
+
+Generated CRDs used by the operator can be found in [the Production
+folder](../../production/operator/crds). Deploy them from the root of the
+repository with:
+
+```
+kubectl apply -f production/operator/crds
+```
+
+### Run the Operator
+
+Now that the CRDs are applied, you can run the operator from the root of the
+repository:
+
+```
+go run ./cmd/agent-operator
+```
+
+### Apply a GrafanaAgent custom resource
+
+Finally, you can apply an example GrafanaAgent custom resource. One is [provided
+for you](./agent-example-config.yaml). From the root of the repository, run:
+
+```
+kubectl apply -f ./cmd/agent-operator/agent-example-config.yaml
+```
+
+If you are running the operator, you should see it pick up the change and start
+mutating the cluster.
+
+## Development Architecture
 
 This project makes heavy use of the [Kubernetes SIG Controller
 Runtime](https://pkg.go.dev/sigs.k8s.io/controller-runtime) project. That
@@ -109,70 +179,4 @@ CR:
 When `default/agent` gets deleted, all `EnqueueRequestForSelector` event
 handlers get notified to stop sending events for `default/agent`.
 
-## Generating CRDs
-
-The `make crds` command at the root of this repository will generate CRDs and
-other code used by the operator. This calls the [generate-crds
-script](../../tools/generate-crds.bash) in a container. If you wish to call this
-script manually, you must also install `controller-gen`:
-
-```
-go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest
-```
-
-## Local testing environment
-
-Create a k3d cluster (depending on k3d v4.x):
-
-```
-k3d cluster create agent-operator \
-  --port 30080:80@loadbalancer \
-  --api-port 50043 \
-  --kubeconfig-update-default=true \
-  --kubeconfig-switch-context=true \
-  --wait
-```
-
-### Deploy Prometheus
-
-An example Prometheus server is provided in `./example-prometheus.yaml`. Deploy
-it with the following:
-
-```
-kubectl apply -f ./cmd/agent-operator/example-prometheus.yaml
-```
-
-You can view it at http://prometheus.k3d.localhost:30080 once the k3d cluster is
-running.
-
-### Apply the CRDs
-
-Generated CRDs used by the operator can be found in [the Production
-folder](../../production/operator/crds). Deploy them from the root of the
-repository with:
-
-```
-kubectl apply -f production/operator/crds
-```
-
-### Run the Operator
-
-Now that the CRDs are applied, you can run the operator:
-
-```
-go run ./cmd/agent-operator
-```
-
-### Apply a GrafanaAgent custom resource
-
-Finally, you can apply an example GrafanaAgent custom resource. One is [provided
-for you](./agent-example-config.yaml). From the root of the repository, run:
-
-```
-kubectl apply -f ./cmd/agent-operator/agent-example-config.yaml
-```
-
-If you are running the operator, you should see it pick up the change and start
-mutating the cluster.
-