Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Agent docs for lb #830

Merged
merged 4 commits into from
Aug 5, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 21 additions & 11 deletions docs/tempo/website/grafana-agent/tail-based-sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,13 @@ such as runtime or egress traffic related costs.
Probabilistic sampling strategies are easy to implement,
but also run the risk of discarding relevant data that you'll later want.

In tail-based sampling, sampling decisions are made at the end of the workflow.
The Grafana Agent groups spans by trace ID and makes a sampling decision based on the data contained in the trace.
For instance, inspecting if a trace contains an error.
In tail-based sampling, sampling decisions are made at the end of the workflow allowing for a more accurate sampling decision.
The Grafana Agent groups span by trace ID and check its data to see
if it meets one of the defined policies (for example, latency or status_code).
For instance, a policy can check if a trace contains an error or if it took
longer than a certain duration.

A trace will be sampled if it meets at least one policy.

To group spans by trace ID, the Agent buffers spans for a configurable amount of time,
after which it will consider the trace complete.
Expand All @@ -21,13 +25,14 @@ However, waiting longer times will increase the memory overhead of buffering.

One particular challenge of grouping trace data is for multi-instance Agent deployments,
where spans that belong to the same trace can arrive to different Agents.
To solve that, in the Agent is possible to distribute traces across agent instances by consistently exporting spans belonging to the same trace to the same agent.
To solve that, you can configure the Agent to load balance traces across agent instances
by exporting spans belonging to the same trace to the same instance.

This is achieved by redistributing spans by trace ID once they arrive from the application.
The Agent must be able to discover and connect to other Agent instances where spans for the same trace can arrive.
For kubernetes users, that can be done with a [headless service](https://kubernetes.io/docs/concepts/services-networking/service/#headless-services).

Redistributing spans by trace ID means that spans are sent and receive twice,
Redistributing spans by trace ID means that spans are sent and received twice,
which can cause a significant increase in CPU usage.
This overhead will increase with the number of Agent instances that share the same traces.

Expand All @@ -47,10 +52,15 @@ tempo:
...
tail_sampling:
policies:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • have we revendored otel to add additional features? can we add examples of tail based configs that use the additional features?
  • Can we include some details about how you only need the load balancing if you have more than one collector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we revendored otel to add additional features? can we add examples of tail based configs that use the additional features?

Yes, it's on v0.30 now. Added examples for the two new policies.

Can we include some details about how you only need the load balancing if you have more than one collector?

I think it's explained in the description, is anything missing in particular?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include some details about how you only need the load balancing if you have more than one collector?

Maybe we should make a separate section for load balancing? This will make it easier to distinguish between how tail sampling itself works and how you get it working with multiple agent instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I don't have a strong opinion about it, so I'm fine with whatever the team thinks is better in this case.

yvrhdn marked this conversation as resolved.
Show resolved Hide resolved
- rate_limiting:
spans_per_second: 50
load_balancing:
resolver:
dns:
hostname: host.namespace.svc.cluster.local
# sample traces that have a total duration longer than 100ms
- latency:
threshold_ms: 100
# sample traces that contain at least one span with status code ERROR
- status_code:
status_codes:
- "ERROR"
load_balancing:
resolver:
dns:
hostname: host.namespace.svc.cluster.local
```