Skip to content

Commit

Permalink
docs: Add sampling docs and improve distributed tracing (#4475)
Browse files Browse the repository at this point in the history
  • Loading branch information
bmorelli25 authored Dec 16, 2020
1 parent bad65d7 commit d18c559
Show file tree
Hide file tree
Showing 12 changed files with 784 additions and 9 deletions.
123 changes: 114 additions & 9 deletions docs/guide/distributed-tracing.asciidoc
Original file line number Diff line number Diff line change
@@ -1,17 +1,122 @@
[[distributed-tracing]]
=== Distributed tracing

Together, <<transactions,`Transactions`>> and <<transaction-spans,`Spans`>> form a `Trace`.
Traces are not events, but group together events that have a common root.
// Make tab-widgets work
include::../tab-widgets/code.asciidoc[]

Elastic APM supports distributed tracing.
Distributed tracing enables you to analyze performance throughout your microservices architecture all in one view.
This is accomplished by tracing all of the requests - from the initial web request to your front-end service - to queries made to your back-end services.
This makes finding possible bottlenecks throughout your application much easier and faster.
Best of all, there's no additional configuration needed for distributed tracing, just ensure you're using the latest version of the applicable {apm-agents-ref}/index.html[agent].
A `trace` is a group of <<transactions,transactions>> and <<transaction-spans,spans>> with a common root.
Each `trace` tracks the entirety of a single request.
When a `trace` travels through multiple services, as is common in a microservice architecture,
it is known as a distributed trace.

The APM app in Kibana also supports distributed tracing.
The Timeline visualization has been redesigned to show all of the transactions from individual services that are connected in a trace:
[float]
=== Why is distributed tracing important?

Distributed tracing enables you to analyze performance throughout your microservice architecture
by tracing the entirety of a request -- from the initial web request on your front-end service
all the way to database queries made on your back-end services.

Tracking requests as they propagate through your services provides an end-to-end picture of
where your application is spending time, where errors are occurring, and where bottlenecks are forming.
Distributed tracing eliminates individual service's data silos and reveals what's happening outside of
service borders.

For supported technologies, distributed tracing works out-of-the-box, with no additional configuration required.

[float]
=== How distributed tracing works

Distributed tracing works by injecting a custom `traceparent` HTTP header into outgoing requests.
This header includes information, like `trace-id`, which is used to identify the current trace,
and `parent-id`, which is used to identify the parent of the current span on incoming requests
or the current span on an outgoing request.

When a service is working on a request, it checks for the existence of this HTTP header.
If it's missing, the service starts a new trace.
If it exists, the service ensures the current action is added as a child of the existing trace,
and continues to propagate the trace.

[float]
==== Trace propagation examples

In this example, Elastic's Ruby agent communicates with Elastic's Java agent.
Both support the `traceparent` header, and trace data is successfully propagated.

image::images/dt-trace-ex1.png[How traceparent propagation works]

In this example, Elastic's Ruby agent communicates with OpenTelemetry's Java agent.
Both support the `traceparent` header, and trace data is successfully propagated.

image::images/dt-trace-ex2.png[How traceparent propagation works]

In this example, the trace meets a piece of middleware that doesn't propagate the `traceparent` header.
The distributed trace ends and any further communication will result in a new trace.

image::images/dt-trace-ex3.png[How traceparent propagation works]


[float]
[[w3c-tracecontext]]
==== W3C Tracecontext spec

All Elastic agents now support the official W3C tracecontext spec and `traceparent` header.
See the table below for the minimum required agent version:

[options="header"]
|====
|Agent name |Agent Version
|**Go Agent**| ≥`1.6`
|**Java Agent**| ≥`1.14`
|**.NET Agent**| ≥`1.3`
|**Node.js Agent**| ≥`3.4`
|**Python Agent**| ≥`5.4`
|**Ruby Agent**| ≥`3.5`
|**RUM Agent**| ≥`5.0`
|====

NOTE: Older Elastic agents use a unique `elastic-apm-traceparent` header.
For backward-compatibility purposes, new versions of Elastic agents still support this header.

[float]
=== Visualize distributed tracing

The APM app's timeline visualization provides a visual deep-dive into each of your application's traces:

[role="screenshot"]
image::images/apm-distributed-tracing.png[Distributed tracing in the APM UI]

[float]
=== Manual distributed tracing

Elastic agents automatically propagate distributed tracing context for supported technologies.
If your service communicates over a different, unsupported protocol,
you can manually propagate distributed tracing context from a sending service to a receiving service
with each agent's API.

[float]
==== Add the `traceparent` header to outgoing requests

Sending services must add the `traceparent` header to outgoing requests.

--
include::../tab-widgets/distributed-trace-send-widget.asciidoc[]
--

[float]
==== Add the `traceparent` header to incoming requests

Receiving services must parse the incoming `traceparent` header,
and start a new transaction or span as a child of the received context.

--
include::../tab-widgets/distributed-trace-receive-widget.asciidoc[]
--

[float]
=== Distributed tracing with RUM

Some additional setup may be required to correlate requests correctly with the Real User Monitoring (RUM) agent.

See the {apm-rum-ref}/distributed-tracing-guide.html[RUM distributed tracing guide]
for information on enabling cross-origin requests, setting up server configuration,
and working with dynamically-generated HTML.
10 changes: 10 additions & 0 deletions docs/guide/features.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,20 @@
<titleabbrev>Features</titleabbrev>
++++

* <<distributed-tracing>>
* <<rum>>
* <<trace-sampling>>
* <<opentracing>>
* <<open-telemetry-elastic>>
* <<observability-integrations>>
* <<apm-cross-cluster-search>>

include::./distributed-tracing.asciidoc[]

include::./rum.asciidoc[]

include::./trace-sampling.asciidoc[]

include::./opentracing.asciidoc[]

include::./opentelemetry-elastic.asciidoc[]
Expand Down
Binary file modified docs/guide/images/apm-distributed-tracing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/guide/images/dt-sampling-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/guide/images/dt-trace-ex1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/guide/images/dt-trace-ex2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/guide/images/dt-trace-ex3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
107 changes: 107 additions & 0 deletions docs/guide/trace-sampling.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
[[trace-sampling]]
=== Transaction sampling

Elastic APM supports head-based, probability sampling.
_Head-based_ means the sampling decision for each trace is made when that trace is initiated.
_Probability sampling_ means that each trace has a defined and equal probability of being sampled.

For example, a sampling value of `.2` indicates a transaction sample rate of `20%`.
This means that only `20%` of traces will send and retain all of their associated information.
The remaining traces will drop contextual information to reduce the transfer and storage size of the trace.

[float]
==== Why sample?

Distributed tracing can generate a substantial amount of data,
and storage can be a concern for users running `100%` sampling -- especially as they scale.

The goal of probability sampling is to provide you with a representative set of data that allows
you to make statistical inferences about the entire group of data.
In other words, in most cases, you can still find anomalous patterns in your applications, detect outages, track errors,
and lower MTTR, even when sampling at less than `100%`.

[float]
==== What data is sampled?

A sampled trace retains all data associated with it.

Non-sampled traces drop <<transaction-spans,`span`>> data.
Spans contain more granular information about what is happening within a transaction,
like external requests or database calls.
Spans also contain contextual information and labels.

Regardless of the sampling decision, all traces retain transaction and error data.
This means the following data will always accurately reflect *all* of your application's requests, regardless of the configured sampling rate:

* Transaction duration and transactions per minute
* Transaction breakdown metrics
* Errors, error occurrence, and error rate

// To turn off the sending of all data, including transaction and error data, set `active` to `false`.

[float]
==== Sample rates

What's the best sampling rate? Unfortunately, there isn't one.
Sampling is dependent on your data, the throughput of your application, data retainment policies, and other factors.
A sampling rate from `.1%` to `100%` would all be considered normal.
You may even decide to have a unique sample rate per service -- for example, if a certain service
experiences considerably more or less traffic than another.

// Regardless, cost conscious customers are likely to be fine with a lower sample rate.

[float]
==== Sampling with distributed tracing

The initiating service makes the sampling decision in a distributed trace,
and all downstream services respect that decision.

In each example below, `Service A` initiates four transactions.
In the first example, `Service A` samples at `.5` (`50%`). In the second, `Service A` samples at `1` (`100%`).
Each subsequent service respects the initial sampling decision, regardless of their configured sample rate.
The result is a sampling percentage that matches the initiating service:

image::images/dt-sampling-example.png[How sampling impacts distributed tracing]

[float]
==== APM app implications

Because the transaction sample rate is respected by downstream services,
the APM app always knows which transactions have and haven't been sampled.
This prevents the app from showing broken traces.
In addition, because transaction and error data is never sampled,
you can always expect metrics and errors to be accurately reflected in the APM app.

*Service maps*

Service maps rely on distributed traces to draw connections between services.
A minimum required version of APM agents is required for Service maps to work.
See {kibana-ref}/service-maps.html[Service maps] for more information.

// Follow-up: Add link from https://www.elastic.co/guide/en/kibana/current/service-maps.html#service-maps-how
// to this page.

[float]
==== Adjust the sample rate

There are three ways to adjust the transaction sample rate of your APM agents:

Dynamic::
The transaction sample rate can be changed dynamically (no redeployment necessary) on a per-service and per-environment
basis with {kibana-ref}/agent-configuration.html[APM Agent Configuration] in Kibana.

Kibana API::
APM Agent configuration exposes an API that can be used to programmatically change
your agents' sampling rate.
An example is provided in the {kibana-ref}/agent-config-api.html[Agent configuration API reference].

Configuration::
Each agent provides a configuration value used to set the transaction sample rate.
See the relevant agent's documentation for more details:

* Go: {apm-go-ref-v}/configuration.html#config-transaction-sample-rate[`ELASTIC_APM_TRANSACTION_SAMPLE_RATE`]
* Java: {apm-java-ref-v}/config-core.html#config-transaction-sample-rate[`transaction_sample_rate`]
* .NET: {apm-dotnet-ref-v}/config-core.html#config-transaction-sample-rate[`TransactionSampleRate`]
* Node.js: {apm-node-ref-v}/configuration.html#transaction-sample-rate[`transactionSampleRate`]
* Python: {apm-py-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`]
* Ruby: {apm-ruby-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`]
114 changes: 114 additions & 0 deletions docs/tab-widgets/distributed-trace-receive-widget.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
// The Java agent defaults to visible.
// Change with `aria-selected="false"` and `hidden=""`
++++
<div class="tabs" data-tab-group="apm-agent-distributed-trace">
<div role="tablist" aria-label="dt">
<button role="tab"
aria-selected="false"
aria-controls="go-tab-dt-r"
id="go-dt-r">
Go
</button>
<button role="tab"
aria-selected="true"
aria-controls="java-tab-dt-r"
id="java-dt-r"
tabindex="-1">
Java
</button>
<button role="tab"
aria-selected="false"
aria-controls="net-tab-dt-r"
id="net-dt-r"
tabindex="-1">
.NET
</button>
<button role="tab"
aria-selected="false"
aria-controls="node-tab-dt-r"
id="node-dt-r"
tabindex="-1">
Node.js
</button>
<button role="tab"
aria-selected="false"
aria-controls="python-tab-dt-r"
id="python-dt-r"
tabindex="-1">
Python
</button>
<button role="tab"
aria-selected="false"
aria-controls="ruby-tab-dt-r"
id="ruby-dt-r"
tabindex="-1">
Ruby
</button>
</div>
<div tabindex="0"
role="tabpanel"
id="go-tab-dt-r"
aria-labelledby="go-dt-r"
hidden="">
++++

include::distributed-trace-receive.asciidoc[tag=go]

++++
</div>
<div tabindex="0"
role="tabpanel"
id="java-tab-dt-r"
aria-labelledby="java-dt-r">
++++

include::distributed-trace-receive.asciidoc[tag=java]

++++
</div>
<div tabindex="0"
role="tabpanel"
id="net-tab-dt-r"
aria-labelledby="net-dt-r"
hidden="">
++++

include::distributed-trace-receive.asciidoc[tag=net]

++++
</div>
<div tabindex="0"
role="tabpanel"
id="node-tab-dt-r"
aria-labelledby="node-dt-r"
hidden="">
++++

include::distributed-trace-receive.asciidoc[tag=node]

++++
</div>
<div tabindex="0"
role="tabpanel"
id="python-tab-dt-r"
aria-labelledby="python-dt-r"
hidden="">
++++

include::distributed-trace-receive.asciidoc[tag=python]

++++
</div>
<div tabindex="0"
role="tabpanel"
id="ruby-tab-dt-r"
aria-labelledby="ruby-dt-r"
hidden="">
++++

include::distributed-trace-receive.asciidoc[tag=ruby]

++++
</div>
</div>
++++
Loading

0 comments on commit d18c559

Please sign in to comment.