docs for the Grafana Agent Operator (#651)

* docs for the Grafana Agent Operator * fix identation of nested lists * Update docs/operator/README.md Co-authored-by: Mario <mariorvinas@gmail.com> * more detail in README * describe why CRDs * mirror docs/operator/README.md intro to cmd/agent-operator/README.md Co-authored-by: Mario <mariorvinas@gmail.com>
grafana · Jun 15, 2021 · d1a6a39 · d1a6a39
1 parent e474e24
commit d1a6a39
Show file tree

Hide file tree

Showing 7 changed files with 519 additions and 80 deletions.
diff --git a/cmd/agent-operator/README.md b/cmd/agent-operator/README.md
@@ -1,11 +1,22 @@
 # Grafana Agent Operator
 
 The Grafana Agent Operator is a Kubernetes operator that makes it easier to
-deploy Grafana Agent and easier to discover targets for metric collection.
-
-It is based on the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)
-and aims to be compatible the official ServiceMonitor, PodMonitor, and Probe
-CRDs that Prometheus Operator useres are used to.
+deploy the Grafana Agent and easier to collect telemetry data from your pods.
+
+It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+that specify how you would like to collect telemetry data from your Kubernetes
+cluster and where you would like to send it. They abstract Kubernetes-specific
+configuration that is more tedious to perform manually. The Grafana Agent
+Operator manages corresponding Grafana Agent deployments in your cluster by
+watching for changes against the custom resources.
+
+Metric collection is based on the [Prometheus
+Operator](https://github.com/prometheus-operator/prometheus-operator) and
+supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the
+project. These custom resources represent abstractions for monitoring services,
+pods, and ingresses. They are especially useful for Helm users, where manually
+writing a generic SD to match all your charts can be difficult (or impossible!)
+or where manually writing a specific SD for each chart can be tedious.
 
 ## Roadmap
 
@@ -14,12 +25,13 @@ CRDs that Prometheus Operator useres are used to.
 - [ ] Traces support
 - [ ] Integrations support
 
-## Installing
+## Documentation
 
-TODO. Stay tuned!
+Refer to the project's [documentation](../../docs/operator) for how to install
+and get started with the Grafana Agent Operator.
 
 ## Developer Reference
 
-The [Developer's Guide](./DEVELOPMENT.md) includes basic information to help you
-understand how the code works. This can be very useful if you are planning on
-working on the Operator.
+The [Maintainer's Guide](../../docs/operator/maintainers-guide.md) includes
+basic information to help you understand how the code works. This can be very
+useful if you are planning on working on the operator.
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -1,5 +1,9 @@
 # Getting Started
 
+This guide helps users get started with the Grafana Agent. For getting started
+with the Grafana Agent Operator, please refer to the Operator-specific
+[documentation](./operator).
+
 ## Docker-Compose Example
 
 The quickest way to try out the Agent with a full Cortex, Grafana, and Agent

diff --git a/docs/operator/README.md b/docs/operator/README.md
@@ -0,0 +1,30 @@
+# Grafana Agent Operator
+
+The Grafana Agent Operator is a Kubernetes operator that makes it easier to
+deploy the Grafana Agent and easier to collect telemetry data from your pods.
+
+It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+that specify how you would like to collect telemetry data from your Kubernetes
+cluster and where you would like to send it. They abstract Kubernetes-specific
+configuration that is more tedious to perform manually. The Grafana Agent
+Operator manages corresponding Grafana Agent deployments in your cluster by
+watching for changes against the custom resources.
+
+Metric collection is based on the [Prometheus
+Operator](https://github.com/prometheus-operator/prometheus-operator) and
+supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the
+project. These custom resources represent abstractions for monitoring services,
+pods, and ingresses. They are especially useful for Helm users, where manually
+writing a generic SD to match all your charts can be difficult (or impossible!)
+or where manually writing a specific SD for each chart can be tedious.
+
+## Table of Contents
+
+1. [Getting Started](./getting-started.md)
+    1. [Deploying CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions)
+    2. [Installing on Kubernetes](./getting-started.md#installing-on-kubernetes)
+    3. [Running locally](./getting-started.md#running-locally)
+    4. [Deploying GrafanaAgent](./getting-started.md#deploying-grafanagent)
+2. [FAQ](./faq.md)
+3. [Architecture](./architecture.md)
+4. [Maintainers Guide](./maintainers-guide.md)
diff --git a/docs/operator/architecture.md b/docs/operator/architecture.md
@@ -0,0 +1,98 @@
+# Architecture
+
+This guide gives a high-level overview of how the Grafana Agent Operator
+works. Refer to the [maintainer's guide](./maintainers-guide.md) for
+detailed lower-level information targeted at maintainers.
+
+The Grafana Agent Operator works in two phases:
+
+1. Discover a hierarchy of custom resources
+2. Reconcile that hierarchy into a Grafana Agent deployment
+
+## Custom Resource Hierarchy
+
+The root of the custom resource hierarchy is the `GrafanaAgent` resource. It is
+primary resource the Operator looks for, and is called the "root" because it
+discovers many other sub-resources.
+
+The full hierarchy of custom resources is as follows:
+
+1. `GrafanaAgent`
+    1. `PrometheusInstance`
+        1. `PodMonitor`
+        2. `Probe`
+        3. `ServiceMonitor`
+
+Most of the resources above have the ability to reference a ConfigMap or a
+Secret. All referenced ConfigMaps or Secrets are added into the resource
+hierarchy.
+
+When a hierarchy is established, each item is watched for changes. Any changed
+item will cause a reconcile of the root GrafanaAgent resource, either
+creating, modifying, or deleting the corresponding Grafana Agent deployment.
+
+A single resource can belong to multiple hierarchies. For example, if two
+GrafanaAgents use the same Probe, modifying that Probe will cause both
+GrafanaAgents to be reconciled.
+
+## Reconcile
+
+When a resource hierarchy is created, updated, or deleted, a reconcile occurs.
+When a GrafanaAgent resource is deleted, the corresponding Grafana Agent
+deployment will also be deleted.
+
+Reconciling creates a few cluster resources:
+
+1. A Secret is generated holding the
+   [configuration](../configuration-reference.md) of the Grafana Agent.
+2. Another Secret is created holding all referenced Secrets or ConfigMaps from
+   the resource hierarchy. This ensures that Secrets referenced from a custom
+   resource in another namespace can still be read.
+3. A Service is created to govern the created StatefulSets.
+4. One StatefulSet per Prometheus shard is created.
+
+PodMonitors, Probes, and ServiceMonitors are turned into individual scrape jobs
+which all use Kubernetes SD.
+
+## Sharding and Replication
+
+The GrafanaAgent resource can specify a number of shards. Each shard results in
+the creation of a StatefulSet with a hashmod + keep relabel_config per job:
+
+```yaml
+- source_labels: [__address__]
+  target_label: __tmp_hash
+  modulus: NUM_SHARDS
+  action: hashmod
+- source_labels: [__tmp_hash]
+  regex: CURRENT_STATEFULSET_SHARD
+  action: keep
+```
+
+This allows for some decent horizontal scaling capabilities, where each shard
+will handle roughly 1/N of the total scrape load. Note that this does not use
+consistent hashing, which means changing the number of shards will cause
+anywhere between 1/N to N targets to reshuffle.
+
+The sharding mechanism is borrowed from the Prometheus Operator.
+
+The number of replicas can be defined, similarly to the number of shards. This
+creates duplicate shards. This must be paired with a remote_write system that
+can perform HA duplication. Grafana Cloud and Cortex provide this out of the
+box, and the Grafana Agent Operator defaults support these two systems.
+
+The total number of created metrics pods will be product of `numShards *
+numReplicas`.
+
+## Labels
+
+Two labels are added by default to every metric:
+
+- `cluster`, representing the `GrafanaAgent` deployment. Holds the value of
+  `<GrafanaAgent.metadata.namespace>/<GrafanaAgent.metadata.name>`.
+- `__replica__`, representing the replica number of the Agent. This label works
+   out of the box with Grafana Cloud and Cortex's [HA
+   deduplication](https://cortexmetrics.io/docs/guides/ha-pair-handling/).
+
+The shard number is not added as a label, as sharding is designed to be
+transparent on the receiver end.
diff --git a/docs/operator/faq.md b/docs/operator/faq.md
@@ -0,0 +1,9 @@
+# FAQ
+
+## Where do I find information on the supported values for the CustomResourceDefinitions?
+
+Once you've [deployed the CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions)
+to your Kubernetes cluster, use `kubectl explain <resource>` to get access to
+the documentation for each resource. For example, `kubectl explain GrafanaAgent`
+will describe the GrafanaAgent CRD, and `kubectl explain GrafanaAgent.spec` will
+give you information on its spec field.