-
Notifications
You must be signed in to change notification settings - Fork 523
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add sampling docs and improve distributed tracing (#4475)
- Loading branch information
1 parent
bad65d7
commit d18c559
Showing
12 changed files
with
784 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,122 @@ | ||
[[distributed-tracing]] | ||
=== Distributed tracing | ||
|
||
Together, <<transactions,`Transactions`>> and <<transaction-spans,`Spans`>> form a `Trace`. | ||
Traces are not events, but group together events that have a common root. | ||
// Make tab-widgets work | ||
include::../tab-widgets/code.asciidoc[] | ||
|
||
Elastic APM supports distributed tracing. | ||
Distributed tracing enables you to analyze performance throughout your microservices architecture all in one view. | ||
This is accomplished by tracing all of the requests - from the initial web request to your front-end service - to queries made to your back-end services. | ||
This makes finding possible bottlenecks throughout your application much easier and faster. | ||
Best of all, there's no additional configuration needed for distributed tracing, just ensure you're using the latest version of the applicable {apm-agents-ref}/index.html[agent]. | ||
A `trace` is a group of <<transactions,transactions>> and <<transaction-spans,spans>> with a common root. | ||
Each `trace` tracks the entirety of a single request. | ||
When a `trace` travels through multiple services, as is common in a microservice architecture, | ||
it is known as a distributed trace. | ||
|
||
The APM app in Kibana also supports distributed tracing. | ||
The Timeline visualization has been redesigned to show all of the transactions from individual services that are connected in a trace: | ||
[float] | ||
=== Why is distributed tracing important? | ||
|
||
Distributed tracing enables you to analyze performance throughout your microservice architecture | ||
by tracing the entirety of a request -- from the initial web request on your front-end service | ||
all the way to database queries made on your back-end services. | ||
|
||
Tracking requests as they propagate through your services provides an end-to-end picture of | ||
where your application is spending time, where errors are occurring, and where bottlenecks are forming. | ||
Distributed tracing eliminates individual service's data silos and reveals what's happening outside of | ||
service borders. | ||
|
||
For supported technologies, distributed tracing works out-of-the-box, with no additional configuration required. | ||
|
||
[float] | ||
=== How distributed tracing works | ||
|
||
Distributed tracing works by injecting a custom `traceparent` HTTP header into outgoing requests. | ||
This header includes information, like `trace-id`, which is used to identify the current trace, | ||
and `parent-id`, which is used to identify the parent of the current span on incoming requests | ||
or the current span on an outgoing request. | ||
|
||
When a service is working on a request, it checks for the existence of this HTTP header. | ||
If it's missing, the service starts a new trace. | ||
If it exists, the service ensures the current action is added as a child of the existing trace, | ||
and continues to propagate the trace. | ||
|
||
[float] | ||
==== Trace propagation examples | ||
|
||
In this example, Elastic's Ruby agent communicates with Elastic's Java agent. | ||
Both support the `traceparent` header, and trace data is successfully propagated. | ||
|
||
image::images/dt-trace-ex1.png[How traceparent propagation works] | ||
|
||
In this example, Elastic's Ruby agent communicates with OpenTelemetry's Java agent. | ||
Both support the `traceparent` header, and trace data is successfully propagated. | ||
|
||
image::images/dt-trace-ex2.png[How traceparent propagation works] | ||
|
||
In this example, the trace meets a piece of middleware that doesn't propagate the `traceparent` header. | ||
The distributed trace ends and any further communication will result in a new trace. | ||
|
||
image::images/dt-trace-ex3.png[How traceparent propagation works] | ||
|
||
|
||
[float] | ||
[[w3c-tracecontext]] | ||
==== W3C Tracecontext spec | ||
|
||
All Elastic agents now support the official W3C tracecontext spec and `traceparent` header. | ||
See the table below for the minimum required agent version: | ||
|
||
[options="header"] | ||
|==== | ||
|Agent name |Agent Version | ||
|**Go Agent**| ≥`1.6` | ||
|**Java Agent**| ≥`1.14` | ||
|**.NET Agent**| ≥`1.3` | ||
|**Node.js Agent**| ≥`3.4` | ||
|**Python Agent**| ≥`5.4` | ||
|**Ruby Agent**| ≥`3.5` | ||
|**RUM Agent**| ≥`5.0` | ||
|==== | ||
|
||
NOTE: Older Elastic agents use a unique `elastic-apm-traceparent` header. | ||
For backward-compatibility purposes, new versions of Elastic agents still support this header. | ||
|
||
[float] | ||
=== Visualize distributed tracing | ||
|
||
The APM app's timeline visualization provides a visual deep-dive into each of your application's traces: | ||
|
||
[role="screenshot"] | ||
image::images/apm-distributed-tracing.png[Distributed tracing in the APM UI] | ||
|
||
[float] | ||
=== Manual distributed tracing | ||
|
||
Elastic agents automatically propagate distributed tracing context for supported technologies. | ||
If your service communicates over a different, unsupported protocol, | ||
you can manually propagate distributed tracing context from a sending service to a receiving service | ||
with each agent's API. | ||
|
||
[float] | ||
==== Add the `traceparent` header to outgoing requests | ||
|
||
Sending services must add the `traceparent` header to outgoing requests. | ||
|
||
-- | ||
include::../tab-widgets/distributed-trace-send-widget.asciidoc[] | ||
-- | ||
|
||
[float] | ||
==== Add the `traceparent` header to incoming requests | ||
|
||
Receiving services must parse the incoming `traceparent` header, | ||
and start a new transaction or span as a child of the received context. | ||
|
||
-- | ||
include::../tab-widgets/distributed-trace-receive-widget.asciidoc[] | ||
-- | ||
|
||
[float] | ||
=== Distributed tracing with RUM | ||
|
||
Some additional setup may be required to correlate requests correctly with the Real User Monitoring (RUM) agent. | ||
|
||
See the {apm-rum-ref}/distributed-tracing-guide.html[RUM distributed tracing guide] | ||
for information on enabling cross-origin requests, setting up server configuration, | ||
and working with dynamically-generated HTML. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
[[trace-sampling]] | ||
=== Transaction sampling | ||
|
||
Elastic APM supports head-based, probability sampling. | ||
_Head-based_ means the sampling decision for each trace is made when that trace is initiated. | ||
_Probability sampling_ means that each trace has a defined and equal probability of being sampled. | ||
|
||
For example, a sampling value of `.2` indicates a transaction sample rate of `20%`. | ||
This means that only `20%` of traces will send and retain all of their associated information. | ||
The remaining traces will drop contextual information to reduce the transfer and storage size of the trace. | ||
|
||
[float] | ||
==== Why sample? | ||
|
||
Distributed tracing can generate a substantial amount of data, | ||
and storage can be a concern for users running `100%` sampling -- especially as they scale. | ||
|
||
The goal of probability sampling is to provide you with a representative set of data that allows | ||
you to make statistical inferences about the entire group of data. | ||
In other words, in most cases, you can still find anomalous patterns in your applications, detect outages, track errors, | ||
and lower MTTR, even when sampling at less than `100%`. | ||
|
||
[float] | ||
==== What data is sampled? | ||
|
||
A sampled trace retains all data associated with it. | ||
|
||
Non-sampled traces drop <<transaction-spans,`span`>> data. | ||
Spans contain more granular information about what is happening within a transaction, | ||
like external requests or database calls. | ||
Spans also contain contextual information and labels. | ||
|
||
Regardless of the sampling decision, all traces retain transaction and error data. | ||
This means the following data will always accurately reflect *all* of your application's requests, regardless of the configured sampling rate: | ||
|
||
* Transaction duration and transactions per minute | ||
* Transaction breakdown metrics | ||
* Errors, error occurrence, and error rate | ||
|
||
// To turn off the sending of all data, including transaction and error data, set `active` to `false`. | ||
|
||
[float] | ||
==== Sample rates | ||
|
||
What's the best sampling rate? Unfortunately, there isn't one. | ||
Sampling is dependent on your data, the throughput of your application, data retainment policies, and other factors. | ||
A sampling rate from `.1%` to `100%` would all be considered normal. | ||
You may even decide to have a unique sample rate per service -- for example, if a certain service | ||
experiences considerably more or less traffic than another. | ||
|
||
// Regardless, cost conscious customers are likely to be fine with a lower sample rate. | ||
|
||
[float] | ||
==== Sampling with distributed tracing | ||
|
||
The initiating service makes the sampling decision in a distributed trace, | ||
and all downstream services respect that decision. | ||
|
||
In each example below, `Service A` initiates four transactions. | ||
In the first example, `Service A` samples at `.5` (`50%`). In the second, `Service A` samples at `1` (`100%`). | ||
Each subsequent service respects the initial sampling decision, regardless of their configured sample rate. | ||
The result is a sampling percentage that matches the initiating service: | ||
|
||
image::images/dt-sampling-example.png[How sampling impacts distributed tracing] | ||
|
||
[float] | ||
==== APM app implications | ||
|
||
Because the transaction sample rate is respected by downstream services, | ||
the APM app always knows which transactions have and haven't been sampled. | ||
This prevents the app from showing broken traces. | ||
In addition, because transaction and error data is never sampled, | ||
you can always expect metrics and errors to be accurately reflected in the APM app. | ||
|
||
*Service maps* | ||
|
||
Service maps rely on distributed traces to draw connections between services. | ||
A minimum required version of APM agents is required for Service maps to work. | ||
See {kibana-ref}/service-maps.html[Service maps] for more information. | ||
|
||
// Follow-up: Add link from https://www.elastic.co/guide/en/kibana/current/service-maps.html#service-maps-how | ||
// to this page. | ||
|
||
[float] | ||
==== Adjust the sample rate | ||
|
||
There are three ways to adjust the transaction sample rate of your APM agents: | ||
|
||
Dynamic:: | ||
The transaction sample rate can be changed dynamically (no redeployment necessary) on a per-service and per-environment | ||
basis with {kibana-ref}/agent-configuration.html[APM Agent Configuration] in Kibana. | ||
|
||
Kibana API:: | ||
APM Agent configuration exposes an API that can be used to programmatically change | ||
your agents' sampling rate. | ||
An example is provided in the {kibana-ref}/agent-config-api.html[Agent configuration API reference]. | ||
|
||
Configuration:: | ||
Each agent provides a configuration value used to set the transaction sample rate. | ||
See the relevant agent's documentation for more details: | ||
|
||
* Go: {apm-go-ref-v}/configuration.html#config-transaction-sample-rate[`ELASTIC_APM_TRANSACTION_SAMPLE_RATE`] | ||
* Java: {apm-java-ref-v}/config-core.html#config-transaction-sample-rate[`transaction_sample_rate`] | ||
* .NET: {apm-dotnet-ref-v}/config-core.html#config-transaction-sample-rate[`TransactionSampleRate`] | ||
* Node.js: {apm-node-ref-v}/configuration.html#transaction-sample-rate[`transactionSampleRate`] | ||
* Python: {apm-py-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`] | ||
* Ruby: {apm-ruby-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`] |
114 changes: 114 additions & 0 deletions
114
docs/tab-widgets/distributed-trace-receive-widget.asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
// The Java agent defaults to visible. | ||
// Change with `aria-selected="false"` and `hidden=""` | ||
++++ | ||
<div class="tabs" data-tab-group="apm-agent-distributed-trace"> | ||
<div role="tablist" aria-label="dt"> | ||
<button role="tab" | ||
aria-selected="false" | ||
aria-controls="go-tab-dt-r" | ||
id="go-dt-r"> | ||
Go | ||
</button> | ||
<button role="tab" | ||
aria-selected="true" | ||
aria-controls="java-tab-dt-r" | ||
id="java-dt-r" | ||
tabindex="-1"> | ||
Java | ||
</button> | ||
<button role="tab" | ||
aria-selected="false" | ||
aria-controls="net-tab-dt-r" | ||
id="net-dt-r" | ||
tabindex="-1"> | ||
.NET | ||
</button> | ||
<button role="tab" | ||
aria-selected="false" | ||
aria-controls="node-tab-dt-r" | ||
id="node-dt-r" | ||
tabindex="-1"> | ||
Node.js | ||
</button> | ||
<button role="tab" | ||
aria-selected="false" | ||
aria-controls="python-tab-dt-r" | ||
id="python-dt-r" | ||
tabindex="-1"> | ||
Python | ||
</button> | ||
<button role="tab" | ||
aria-selected="false" | ||
aria-controls="ruby-tab-dt-r" | ||
id="ruby-dt-r" | ||
tabindex="-1"> | ||
Ruby | ||
</button> | ||
</div> | ||
<div tabindex="0" | ||
role="tabpanel" | ||
id="go-tab-dt-r" | ||
aria-labelledby="go-dt-r" | ||
hidden=""> | ||
++++ | ||
|
||
include::distributed-trace-receive.asciidoc[tag=go] | ||
|
||
++++ | ||
</div> | ||
<div tabindex="0" | ||
role="tabpanel" | ||
id="java-tab-dt-r" | ||
aria-labelledby="java-dt-r"> | ||
++++ | ||
|
||
include::distributed-trace-receive.asciidoc[tag=java] | ||
|
||
++++ | ||
</div> | ||
<div tabindex="0" | ||
role="tabpanel" | ||
id="net-tab-dt-r" | ||
aria-labelledby="net-dt-r" | ||
hidden=""> | ||
++++ | ||
|
||
include::distributed-trace-receive.asciidoc[tag=net] | ||
|
||
++++ | ||
</div> | ||
<div tabindex="0" | ||
role="tabpanel" | ||
id="node-tab-dt-r" | ||
aria-labelledby="node-dt-r" | ||
hidden=""> | ||
++++ | ||
|
||
include::distributed-trace-receive.asciidoc[tag=node] | ||
|
||
++++ | ||
</div> | ||
<div tabindex="0" | ||
role="tabpanel" | ||
id="python-tab-dt-r" | ||
aria-labelledby="python-dt-r" | ||
hidden=""> | ||
++++ | ||
|
||
include::distributed-trace-receive.asciidoc[tag=python] | ||
|
||
++++ | ||
</div> | ||
<div tabindex="0" | ||
role="tabpanel" | ||
id="ruby-tab-dt-r" | ||
aria-labelledby="ruby-dt-r" | ||
hidden=""> | ||
++++ | ||
|
||
include::distributed-trace-receive.asciidoc[tag=ruby] | ||
|
||
++++ | ||
</div> | ||
</div> | ||
++++ |
Oops, something went wrong.