Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Agent docs for lb #830

Merged
merged 4 commits into from
Aug 5, 2021
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 20 additions & 10 deletions docs/tempo/website/grafana-agent/tail-based-sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,12 @@ Probabilistic sampling strategies are easy to implement,
but also run the risk of discarding relevant data that you'll later want.

In tail-based sampling, sampling decisions are made at the end of the workflow.
mapno marked this conversation as resolved.
Show resolved Hide resolved
The Grafana Agent groups spans by trace ID and makes a sampling decision based on the data contained in the trace.
For instance, inspecting if a trace contains an error.
The Grafana Agent groups spans by trace ID and evaluates checks its data to see
if it meets one of the defined policies (e.g. latency or status_code).
For instance, a policy can check if a trace contains an error, or it if it took
longer than a certain duration.
mapno marked this conversation as resolved.
Show resolved Hide resolved
mapno marked this conversation as resolved.
Show resolved Hide resolved

A trace gets sampled if it meets one policy's condition, even if it doesn't meet others.
mapno marked this conversation as resolved.
Show resolved Hide resolved

To group spans by trace ID, the Agent buffers spans for a configurable amount of time,
after which it will consider the trace complete.
Expand All @@ -21,13 +25,14 @@ However, waiting longer times will increase the memory overhead of buffering.

One particular challenge of grouping trace data is for multi-instance Agent deployments,
where spans that belong to the same trace can arrive to different Agents.
To solve that, in the Agent is possible to distribute traces across agent instances by consistently exporting spans belonging to the same trace to the same agent.
To solve that, in the Agent is possible to distribute traces across agent instances
by consistently exporting spans belonging to the same trace to the same agent.
mapno marked this conversation as resolved.
Show resolved Hide resolved

This is achieved by redistributing spans by trace ID once they arrive from the application.
The Agent must be able to discover and connect to other Agent instances where spans for the same trace can arrive.
For kubernetes users, that can be done with a [headless service](https://kubernetes.io/docs/concepts/services-networking/service/#headless-services).

Redistributing spans by trace ID means that spans are sent and receive twice,
Redistributing spans by trace ID means that spans are sent and received twice,
which can cause a significant increase in CPU usage.
This overhead will increase with the number of Agent instances that share the same traces.

Expand All @@ -47,10 +52,15 @@ tempo:
...
tail_sampling:
policies:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • have we revendored otel to add additional features? can we add examples of tail based configs that use the additional features?
  • Can we include some details about how you only need the load balancing if you have more than one collector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we revendored otel to add additional features? can we add examples of tail based configs that use the additional features?

Yes, it's on v0.30 now. Added examples for the two new policies.

Can we include some details about how you only need the load balancing if you have more than one collector?

I think it's explained in the description, is anything missing in particular?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include some details about how you only need the load balancing if you have more than one collector?

Maybe we should make a separate section for load balancing? This will make it easier to distinguish between how tail sampling itself works and how you get it working with multiple agent instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I don't have a strong opinion about it, so I'm fine with whatever the team thinks is better in this case.

kvrhdn marked this conversation as resolved.
Show resolved Hide resolved
- rate_limiting:
spans_per_second: 50
load_balancing:
resolver:
dns:
hostname: host.namespace.svc.cluster.local
# It will sample traces that lasted for 100ms or more
- latency:
threshold_ms: 100
# It will sample traces which status code is ERROR
- status_code:
status_codes:
- "ERROR"
mapno marked this conversation as resolved.
Show resolved Hide resolved
load_balancing:
resolver:
dns:
hostname: host.namespace.svc.cluster.local
```