Skip to content

Commit

Permalink
Update Agent docs for lb (#830)
Browse files Browse the repository at this point in the history
* Update Agent docs for lb

* Improve docs

* Update docs/tempo/website/grafana-agent/tail-based-sampling.md

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Koenraad Verheyden <koenraad.verheyden@posteo.net>

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>
Co-authored-by: Koenraad Verheyden <koenraad.verheyden@posteo.net>
  • Loading branch information
3 people committed Aug 5, 2021
1 parent a0975ac commit 64fd2bf
Showing 1 changed file with 21 additions and 11 deletions.
32 changes: 21 additions & 11 deletions docs/tempo/website/grafana-agent/tail-based-sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,13 @@ such as runtime or egress traffic related costs.
Probabilistic sampling strategies are easy to implement,
but also run the risk of discarding relevant data that you'll later want.

In tail-based sampling, sampling decisions are made at the end of the workflow.
The Grafana Agent groups spans by trace ID and makes a sampling decision based on the data contained in the trace.
For instance, inspecting if a trace contains an error.
In tail-based sampling, sampling decisions are made at the end of the workflow allowing for a more accurate sampling decision.
The Grafana Agent groups span by trace ID and check its data to see
if it meets one of the defined policies (for example, latency or status_code).
For instance, a policy can check if a trace contains an error or if it took
longer than a certain duration.

A trace will be sampled if it meets at least one policy.

To group spans by trace ID, the Agent buffers spans for a configurable amount of time,
after which it will consider the trace complete.
Expand All @@ -21,13 +25,14 @@ However, waiting longer times will increase the memory overhead of buffering.

One particular challenge of grouping trace data is for multi-instance Agent deployments,
where spans that belong to the same trace can arrive to different Agents.
To solve that, in the Agent is possible to distribute traces across agent instances by consistently exporting spans belonging to the same trace to the same agent.
To solve that, you can configure the Agent to load balance traces across agent instances
by exporting spans belonging to the same trace to the same instance.

This is achieved by redistributing spans by trace ID once they arrive from the application.
The Agent must be able to discover and connect to other Agent instances where spans for the same trace can arrive.
For kubernetes users, that can be done with a [headless service](https://kubernetes.io/docs/concepts/services-networking/service/#headless-services).

Redistributing spans by trace ID means that spans are sent and receive twice,
Redistributing spans by trace ID means that spans are sent and received twice,
which can cause a significant increase in CPU usage.
This overhead will increase with the number of Agent instances that share the same traces.

Expand All @@ -47,10 +52,15 @@ tempo:
...
tail_sampling:
policies:
- rate_limiting:
spans_per_second: 50
load_balancing:
resolver:
dns:
hostname: host.namespace.svc.cluster.local
# sample traces that have a total duration longer than 100ms
- latency:
threshold_ms: 100
# sample traces that contain at least one span with status code ERROR
- status_code:
status_codes:
- "ERROR"
load_balancing:
resolver:
dns:
hostname: host.namespace.svc.cluster.local
```

0 comments on commit 64fd2bf

Please sign in to comment.