docs: Add sampling docs and improve distributed tracing (#4475)

elastic · Dec 16, 2020 · d18c559 · d18c559
1 parent bad65d7
commit d18c559
Show file tree

Hide file tree

Showing 12 changed files with 784 additions and 9 deletions.
diff --git a/docs/guide/distributed-tracing.asciidoc b/docs/guide/distributed-tracing.asciidoc
@@ -1,17 +1,122 @@
 [[distributed-tracing]]
 === Distributed tracing
 
-Together, <<transactions,`Transactions`>> and <<transaction-spans,`Spans`>> form a `Trace`.
-Traces are not events, but group together events that have a common root.
+// Make tab-widgets work
+include::../tab-widgets/code.asciidoc[]
 
-Elastic APM supports distributed tracing.
-Distributed tracing enables you to analyze performance throughout your microservices architecture all in one view.
-This is accomplished by tracing all of the requests - from the initial web request to your front-end service - to queries made to your back-end services.
-This makes finding possible bottlenecks throughout your application much easier and faster.
-Best of all, there's no additional configuration needed for distributed tracing, just ensure you're using the latest version of the applicable {apm-agents-ref}/index.html[agent].
+A `trace` is a group of <<transactions,transactions>> and <<transaction-spans,spans>> with a common root.
+Each `trace` tracks the entirety of a single request.
+When a `trace` travels through multiple services, as is common in a microservice architecture,
+it is known as a distributed trace.
 
-The APM app in Kibana also supports distributed tracing.
-The Timeline visualization has been redesigned to show all of the transactions from individual services that are connected in a trace:
+[float]
+=== Why is distributed tracing important?
+
+Distributed tracing enables you to analyze performance throughout your microservice architecture
+by tracing the entirety of a request -- from the initial web request on your front-end service
+all the way to database queries made on your back-end services.
+
+Tracking requests as they propagate through your services provides an end-to-end picture of
+where your application is spending time, where errors are occurring, and where bottlenecks are forming.
+Distributed tracing eliminates individual service's data silos and reveals what's happening outside of
+service borders.
+
+For supported technologies, distributed tracing works out-of-the-box, with no additional configuration required.
+
+[float]
+=== How distributed tracing works
+
+Distributed tracing works by injecting a custom `traceparent` HTTP header into outgoing requests.
+This header includes information, like `trace-id`, which is used to identify the current trace,
+and `parent-id`, which is used to identify the parent of the current span on incoming requests
+or the current span on an outgoing request.
+
+When a service is working on a request, it checks for the existence of this HTTP header.
+If it's missing, the service starts a new trace.
+If it exists, the service ensures the current action is added as a child of the existing trace,
+and continues to propagate the trace.
+
+[float]
+==== Trace propagation examples
+
+In this example, Elastic's Ruby agent communicates with Elastic's Java agent.
+Both support the `traceparent` header, and trace data is successfully propagated.
+
+image::images/dt-trace-ex1.png[How traceparent propagation works]
+
+In this example, Elastic's Ruby agent communicates with OpenTelemetry's Java agent.
+Both support the `traceparent` header, and trace data is successfully propagated.
+
+image::images/dt-trace-ex2.png[How traceparent propagation works]
+
+In this example, the trace meets a piece of middleware that doesn't propagate the `traceparent` header.
+The distributed trace ends and any further communication will result in a new trace.
+
+image::images/dt-trace-ex3.png[How traceparent propagation works]
+
+
+[float]
+[[w3c-tracecontext]]
+==== W3C Tracecontext spec
+
+All Elastic agents now support the official W3C tracecontext spec and `traceparent` header.
+See the table below for the minimum required agent version:
+
+[options="header"]
+|====
+|Agent name |Agent Version
+|**Go Agent**| ≥`1.6`
+|**Java Agent**| ≥`1.14`
+|**.NET Agent**| ≥`1.3`
+|**Node.js Agent**| ≥`3.4`
+|**Python Agent**| ≥`5.4`
+|**Ruby Agent**| ≥`3.5`
+|**RUM Agent**| ≥`5.0`
+|====
+
+NOTE: Older Elastic agents use a unique `elastic-apm-traceparent` header.
+For backward-compatibility purposes, new versions of Elastic agents still support this header.
+
+[float]
+=== Visualize distributed tracing
+
+The APM app's timeline visualization provides a visual deep-dive into each of your application's traces:
 
 [role="screenshot"]
 image::images/apm-distributed-tracing.png[Distributed tracing in the APM UI]
+
+[float]
+=== Manual distributed tracing
+
+Elastic agents automatically propagate distributed tracing context for supported technologies.
+If your service communicates over a different, unsupported protocol,
+you can manually propagate distributed tracing context from a sending service to a receiving service
+with each agent's API.
+
+[float]
+==== Add the `traceparent` header to outgoing requests
+
+Sending services must add the `traceparent` header to outgoing requests.
+
+--
+include::../tab-widgets/distributed-trace-send-widget.asciidoc[]
+--
+
+[float]
+==== Add the `traceparent` header to incoming requests
+
+Receiving services must parse the incoming `traceparent` header,
+and start a new transaction or span as a child of the received context.
+
+--
+include::../tab-widgets/distributed-trace-receive-widget.asciidoc[]
+--
+
+[float]
+=== Distributed tracing with RUM
+
+Some additional setup may be required to correlate requests correctly with the Real User Monitoring (RUM) agent.
+
+See the {apm-rum-ref}/distributed-tracing-guide.html[RUM distributed tracing guide]
+for information on enabling cross-origin requests, setting up server configuration,
+and working with dynamically-generated HTML.
diff --git a/docs/guide/features.asciidoc b/docs/guide/features.asciidoc
@@ -5,10 +5,20 @@
 <titleabbrev>Features</titleabbrev>
 ++++
 
+* <<distributed-tracing>>
+* <<rum>>
+* <<trace-sampling>>
+* <<opentracing>>
+* <<open-telemetry-elastic>>
+* <<observability-integrations>>
+* <<apm-cross-cluster-search>>
+
 include::./distributed-tracing.asciidoc[]
 
 include::./rum.asciidoc[]
 
+include::./trace-sampling.asciidoc[]
+
 include::./opentracing.asciidoc[]
 
 include::./opentelemetry-elastic.asciidoc[]

diff --git a/docs/guide/images/apm-distributed-tracing.png b/docs/guide/images/apm-distributed-tracing.png
diff --git a/docs/guide/images/dt-sampling-example.png b/docs/guide/images/dt-sampling-example.png
diff --git a/docs/guide/images/dt-trace-ex1.png b/docs/guide/images/dt-trace-ex1.png
diff --git a/docs/guide/images/dt-trace-ex2.png b/docs/guide/images/dt-trace-ex2.png
diff --git a/docs/guide/images/dt-trace-ex3.png b/docs/guide/images/dt-trace-ex3.png
diff --git a/docs/guide/trace-sampling.asciidoc b/docs/guide/trace-sampling.asciidoc
@@ -0,0 +1,107 @@
+[[trace-sampling]]
+=== Transaction sampling
+
+Elastic APM supports head-based, probability sampling.
+_Head-based_ means the sampling decision for each trace is made when that trace is initiated.
+_Probability sampling_ means that each trace has a defined and equal probability of being sampled.
+
+For example, a sampling value of `.2` indicates a transaction sample rate of `20%`.
+This means that only `20%` of traces will send and retain all of their associated information.
+The remaining traces will drop contextual information to reduce the transfer and storage size of the trace.
+
+[float]
+==== Why sample?
+
+Distributed tracing can generate a substantial amount of data,
+and storage can be a concern for users running `100%` sampling -- especially as they scale.
+
+The goal of probability sampling is to provide you with a representative set of data that allows
+you to make statistical inferences about the entire group of data.
+In other words, in most cases, you can still find anomalous patterns in your applications, detect outages, track errors,
+and lower MTTR, even when sampling at less than `100%`.
+
+[float]
+==== What data is sampled?
+
+A sampled trace retains all data associated with it.
+
+Non-sampled traces drop <<transaction-spans,`span`>> data.
+Spans contain more granular information about what is happening within a transaction,
+like external requests or database calls.
+Spans also contain contextual information and labels.
+
+Regardless of the sampling decision, all traces retain transaction and error data.
+This means the following data will always accurately reflect *all* of your application's requests, regardless of the configured sampling rate:
+
+* Transaction duration and transactions per minute
+* Transaction breakdown metrics
+* Errors, error occurrence, and error rate
+
+// To turn off the sending of all data, including transaction and error data, set `active` to `false`.
+
+[float]
+==== Sample rates
+
+What's the best sampling rate? Unfortunately, there isn't one.
+Sampling is dependent on your data, the throughput of your application, data retainment policies, and other factors.
+A sampling rate from `.1%` to `100%` would all be considered normal.
+You may even decide to have a unique sample rate per service -- for example, if a certain service
+experiences considerably more or less traffic than another.
+
+// Regardless, cost conscious customers are likely to be fine with a lower sample rate.
+
+[float]
+==== Sampling with distributed tracing
+
+The initiating service makes the sampling decision in a distributed trace,
+and all downstream services respect that decision.
+
+In each example below, `Service A` initiates four transactions.
+In the first example, `Service A` samples at `.5` (`50%`). In the second, `Service A` samples at `1` (`100%`).
+Each subsequent service respects the initial sampling decision, regardless of their configured sample rate.
+The result is a sampling percentage that matches the initiating service:
+
+image::images/dt-sampling-example.png[How sampling impacts distributed tracing]
+
+[float]
+==== APM app implications
+
+Because the transaction sample rate is respected by downstream services,
+the APM app always knows which transactions have and haven't been sampled.
+This prevents the app from showing broken traces.
+In addition, because transaction and error data is never sampled,
+you can always expect metrics and errors to be accurately reflected in the APM app.
+
+*Service maps*
+
+Service maps rely on distributed traces to draw connections between services.
+A minimum required version of APM agents is required for Service maps to work.
+See {kibana-ref}/service-maps.html[Service maps] for more information.
+
+// Follow-up: Add link from https://www.elastic.co/guide/en/kibana/current/service-maps.html#service-maps-how
+// to this page.
+
+[float]
+==== Adjust the sample rate
+
+There are three ways to adjust the transaction sample rate of your APM agents:
+
+Dynamic::
+The transaction sample rate can be changed dynamically (no redeployment necessary) on a per-service and per-environment
+basis with {kibana-ref}/agent-configuration.html[APM Agent Configuration] in Kibana.
+
+Kibana API::
+APM Agent configuration exposes an API that can be used to programmatically change
+your agents' sampling rate.
+An example is provided in the {kibana-ref}/agent-config-api.html[Agent configuration API reference].
+
+Configuration::
+Each agent provides a configuration value used to set the transaction sample rate.
+See the relevant agent's documentation for more details:
+
+* Go: {apm-go-ref-v}/configuration.html#config-transaction-sample-rate[`ELASTIC_APM_TRANSACTION_SAMPLE_RATE`]
+* Java: {apm-java-ref-v}/config-core.html#config-transaction-sample-rate[`transaction_sample_rate`]
+* .NET: {apm-dotnet-ref-v}/config-core.html#config-transaction-sample-rate[`TransactionSampleRate`]
+* Node.js: {apm-node-ref-v}/configuration.html#transaction-sample-rate[`transactionSampleRate`]
+* Python: {apm-py-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`]
+* Ruby: {apm-ruby-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`]
diff --git a/docs/tab-widgets/distributed-trace-receive-widget.asciidoc b/docs/tab-widgets/distributed-trace-receive-widget.asciidoc
@@ -0,0 +1,114 @@
+// The Java agent defaults to visible.
+// Change with `aria-selected="false"` and `hidden=""`
+++++
+<div class="tabs" data-tab-group="apm-agent-distributed-trace">
+  <div role="tablist" aria-label="dt">
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="go-tab-dt-r"
+            id="go-dt-r">
+      Go
+    </button>
+    <button role="tab"
+            aria-selected="true"
+            aria-controls="java-tab-dt-r"
+            id="java-dt-r"
+            tabindex="-1">
+      Java
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="net-tab-dt-r"
+            id="net-dt-r"
+            tabindex="-1">
+      .NET
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="node-tab-dt-r"
+            id="node-dt-r"
+            tabindex="-1">
+      Node.js
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="python-tab-dt-r"
+            id="python-dt-r"
+            tabindex="-1">
+      Python
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="ruby-tab-dt-r"
+            id="ruby-dt-r"
+            tabindex="-1">
+      Ruby
+    </button>
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="go-tab-dt-r"
+       aria-labelledby="go-dt-r"
+       hidden="">
+++++
+
+include::distributed-trace-receive.asciidoc[tag=go]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="java-tab-dt-r"
+       aria-labelledby="java-dt-r">
+++++
+
+include::distributed-trace-receive.asciidoc[tag=java]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="net-tab-dt-r"
+       aria-labelledby="net-dt-r"
+       hidden="">
+++++
+
+include::distributed-trace-receive.asciidoc[tag=net]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="node-tab-dt-r"
+       aria-labelledby="node-dt-r"
+       hidden="">
+++++
+
+include::distributed-trace-receive.asciidoc[tag=node]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="python-tab-dt-r"
+       aria-labelledby="python-dt-r"
+       hidden="">
+++++
+
+include::distributed-trace-receive.asciidoc[tag=python]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="ruby-tab-dt-r"
+       aria-labelledby="ruby-dt-r"
+       hidden="">
+++++
+
+include::distributed-trace-receive.asciidoc[tag=ruby]
+
+++++
+  </div>
+</div>
+++++