Skip to content

Commit

Permalink
docs for the Grafana Agent Operator (#651)
Browse files Browse the repository at this point in the history
* docs for the Grafana Agent Operator

* fix identation of nested lists

* Update docs/operator/README.md

Co-authored-by: Mario <mariorvinas@gmail.com>

* more detail in README

* describe why CRDs

* mirror docs/operator/README.md intro to cmd/agent-operator/README.md

Co-authored-by: Mario <mariorvinas@gmail.com>
  • Loading branch information
rfratto and mapno authored Jun 15, 2021
1 parent e474e24 commit d1a6a39
Show file tree
Hide file tree
Showing 7 changed files with 519 additions and 80 deletions.
32 changes: 22 additions & 10 deletions cmd/agent-operator/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,22 @@
# Grafana Agent Operator

The Grafana Agent Operator is a Kubernetes operator that makes it easier to
deploy Grafana Agent and easier to discover targets for metric collection.

It is based on the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)
and aims to be compatible the official ServiceMonitor, PodMonitor, and Probe
CRDs that Prometheus Operator useres are used to.
deploy the Grafana Agent and easier to collect telemetry data from your pods.

It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
that specify how you would like to collect telemetry data from your Kubernetes
cluster and where you would like to send it. They abstract Kubernetes-specific
configuration that is more tedious to perform manually. The Grafana Agent
Operator manages corresponding Grafana Agent deployments in your cluster by
watching for changes against the custom resources.

Metric collection is based on the [Prometheus
Operator](https://github.com/prometheus-operator/prometheus-operator) and
supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the
project. These custom resources represent abstractions for monitoring services,
pods, and ingresses. They are especially useful for Helm users, where manually
writing a generic SD to match all your charts can be difficult (or impossible!)
or where manually writing a specific SD for each chart can be tedious.

## Roadmap

Expand All @@ -14,12 +25,13 @@ CRDs that Prometheus Operator useres are used to.
- [ ] Traces support
- [ ] Integrations support

## Installing
## Documentation

TODO. Stay tuned!
Refer to the project's [documentation](../../docs/operator) for how to install
and get started with the Grafana Agent Operator.

## Developer Reference

The [Developer's Guide](./DEVELOPMENT.md) includes basic information to help you
understand how the code works. This can be very useful if you are planning on
working on the Operator.
The [Maintainer's Guide](../../docs/operator/maintainers-guide.md) includes
basic information to help you understand how the code works. This can be very
useful if you are planning on working on the operator.
4 changes: 4 additions & 0 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Getting Started

This guide helps users get started with the Grafana Agent. For getting started
with the Grafana Agent Operator, please refer to the Operator-specific
[documentation](./operator).

## Docker-Compose Example

The quickest way to try out the Agent with a full Cortex, Grafana, and Agent
Expand Down
30 changes: 30 additions & 0 deletions docs/operator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Grafana Agent Operator

The Grafana Agent Operator is a Kubernetes operator that makes it easier to
deploy the Grafana Agent and easier to collect telemetry data from your pods.

It works by watching for [Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
that specify how you would like to collect telemetry data from your Kubernetes
cluster and where you would like to send it. They abstract Kubernetes-specific
configuration that is more tedious to perform manually. The Grafana Agent
Operator manages corresponding Grafana Agent deployments in your cluster by
watching for changes against the custom resources.

Metric collection is based on the [Prometheus
Operator](https://github.com/prometheus-operator/prometheus-operator) and
supports the official v1 ServiceMonitor, PodMonitor, and Probe CRDs from the
project. These custom resources represent abstractions for monitoring services,
pods, and ingresses. They are especially useful for Helm users, where manually
writing a generic SD to match all your charts can be difficult (or impossible!)
or where manually writing a specific SD for each chart can be tedious.

## Table of Contents

1. [Getting Started](./getting-started.md)
1. [Deploying CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions)
2. [Installing on Kubernetes](./getting-started.md#installing-on-kubernetes)
3. [Running locally](./getting-started.md#running-locally)
4. [Deploying GrafanaAgent](./getting-started.md#deploying-grafanagent)
2. [FAQ](./faq.md)
3. [Architecture](./architecture.md)
4. [Maintainers Guide](./maintainers-guide.md)
98 changes: 98 additions & 0 deletions docs/operator/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Architecture

This guide gives a high-level overview of how the Grafana Agent Operator
works. Refer to the [maintainer's guide](./maintainers-guide.md) for
detailed lower-level information targeted at maintainers.

The Grafana Agent Operator works in two phases:

1. Discover a hierarchy of custom resources
2. Reconcile that hierarchy into a Grafana Agent deployment

## Custom Resource Hierarchy

The root of the custom resource hierarchy is the `GrafanaAgent` resource. It is
primary resource the Operator looks for, and is called the "root" because it
discovers many other sub-resources.

The full hierarchy of custom resources is as follows:

1. `GrafanaAgent`
1. `PrometheusInstance`
1. `PodMonitor`
2. `Probe`
3. `ServiceMonitor`

Most of the resources above have the ability to reference a ConfigMap or a
Secret. All referenced ConfigMaps or Secrets are added into the resource
hierarchy.

When a hierarchy is established, each item is watched for changes. Any changed
item will cause a reconcile of the root GrafanaAgent resource, either
creating, modifying, or deleting the corresponding Grafana Agent deployment.

A single resource can belong to multiple hierarchies. For example, if two
GrafanaAgents use the same Probe, modifying that Probe will cause both
GrafanaAgents to be reconciled.

## Reconcile

When a resource hierarchy is created, updated, or deleted, a reconcile occurs.
When a GrafanaAgent resource is deleted, the corresponding Grafana Agent
deployment will also be deleted.

Reconciling creates a few cluster resources:

1. A Secret is generated holding the
[configuration](../configuration-reference.md) of the Grafana Agent.
2. Another Secret is created holding all referenced Secrets or ConfigMaps from
the resource hierarchy. This ensures that Secrets referenced from a custom
resource in another namespace can still be read.
3. A Service is created to govern the created StatefulSets.
4. One StatefulSet per Prometheus shard is created.

PodMonitors, Probes, and ServiceMonitors are turned into individual scrape jobs
which all use Kubernetes SD.

## Sharding and Replication

The GrafanaAgent resource can specify a number of shards. Each shard results in
the creation of a StatefulSet with a hashmod + keep relabel_config per job:

```yaml
- source_labels: [__address__]
target_label: __tmp_hash
modulus: NUM_SHARDS
action: hashmod
- source_labels: [__tmp_hash]
regex: CURRENT_STATEFULSET_SHARD
action: keep
```
This allows for some decent horizontal scaling capabilities, where each shard
will handle roughly 1/N of the total scrape load. Note that this does not use
consistent hashing, which means changing the number of shards will cause
anywhere between 1/N to N targets to reshuffle.
The sharding mechanism is borrowed from the Prometheus Operator.
The number of replicas can be defined, similarly to the number of shards. This
creates duplicate shards. This must be paired with a remote_write system that
can perform HA duplication. Grafana Cloud and Cortex provide this out of the
box, and the Grafana Agent Operator defaults support these two systems.
The total number of created metrics pods will be product of `numShards *
numReplicas`.

## Labels

Two labels are added by default to every metric:

- `cluster`, representing the `GrafanaAgent` deployment. Holds the value of
`<GrafanaAgent.metadata.namespace>/<GrafanaAgent.metadata.name>`.
- `__replica__`, representing the replica number of the Agent. This label works
out of the box with Grafana Cloud and Cortex's [HA
deduplication](https://cortexmetrics.io/docs/guides/ha-pair-handling/).

The shard number is not added as a label, as sharding is designed to be
transparent on the receiver end.
9 changes: 9 additions & 0 deletions docs/operator/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# FAQ

## Where do I find information on the supported values for the CustomResourceDefinitions?

Once you've [deployed the CustomResourceDefinitions](./getting-started.md#deploying-customresourcedefinitions)
to your Kubernetes cluster, use `kubectl explain <resource>` to get access to
the documentation for each resource. For example, `kubectl explain GrafanaAgent`
will describe the GrafanaAgent CRD, and `kubectl explain GrafanaAgent.spec` will
give you information on its spec field.
Loading

0 comments on commit d1a6a39

Please sign in to comment.