Elasticsearch is instrumented using the OpenTelemetry API, which allows ES developers to gather traces and analyze what Elasticsearch is doing.
The Elasticsearch server code contains a tracing package, which is an abstraction over the OpenTelemetry API. All locations in the code that perform instrumentation and tracing must use these abstractions.
Separately, there is the apm module, which works with the OpenTelemetry API directly to record trace data. Underneath the OTel API, we use Elastic's APM agent for Java, which attaches at runtime to the Elasticsearch JVM and removes the need for Elasticsearch to hard-code the use of an OTel implementation. Note that while it is possible to programmatically start the APM agent, the Security Manager permissions required make this essentially impossible.
You must supply configuration and credentials for the APM server (see below).
In your elasticsearch.yml
add the following configuration:
telemetry.tracing.enabled: true
telemetry.agent.server_url: https://<your-apm-server>:443
When using a secret token to authenticate with the APM server, you must add it to the Elasticsearch keystore under telemetry.secret_token
. For example, execute:
bin/elasticsearch-keystore add telemetry.secret_token
then enter the token when prompted. If you are using API keys, change the keystore key name to telemetry.api_key
.
All APM settings live under telemetry
. Tracing related settings go under telemetry.tracing
and settings
related to the Java agent go under telemetry.agent
. Anything you set under there will be propagated to
the agent.
For agent settings that can be changed dynamically, you can use the cluster settings REST API. For example, to change the sampling rate:
curl -XPUT \
-H "Content-type: application/json" \
-u "$USERNAME:$PASSWORD" \
-d '{ "persistent": { "telemetry.agent.transaction_sample_rate": "0.75" } }' \
https://localhost:9200/_cluster/settings
For context, the APM agent pulls configuration from multiple sources, with a hierarchy that means, for example, that options set in the config file cannot be overridden via system properties.
Now, in order to send tracing data to the APM server, ES needs to be configured with
either a secret_key
or an api_key
. We could configure these in the agent via
system properties, but then their values would be available to any Java code in
Elasticsearch that can read system properties.
Instead, when Elasticsearch bootstraps itself, it compiles all APM settings
together, including any secret_key
or api_key
values from the ES keystore,
and writes out a temporary APM config file containing all static configuration
(i.e. values that cannot change after the agent starts). This file is deleted
as soon as possible after ES starts up. Settings that are not sensitive and can
be changed dynamically are configured via system properties. Calls to the ES
settings REST API are translated into system property writes, which the agent
later picks up and applies.
You need to have an APM server running somewhere. For example, you can create a deployment in Elastic Cloud with Elastic's APM integration.
We primarily trace "tasks". The tasks framework in Elasticsearch allows work to be scheduled for execution, cancelled, executed in a different thread pool, and so on. Tracing a task results in a "span", which represents the execution of the task in the tracing system. We also instrument REST requests, which are not (at present) modelled by tasks.
A span can be associated with a parent span, which allows all spans in, for example, a REST request to be grouped together. Spans can track work across different Elasticsearch nodes.
Elasticsearch also supports distributed tracing via W3c Trace Context headers. If clients of Elasticsearch send these headers with their requests, then that data will be forwarded to the APM server in order to yield a trace across systems.
In rare circumstances, it is possible to avoid tracing a task using
TaskManager#register(String,String,TaskAwareRequest,boolean)
. For example,
Machine Learning uses tasks to record which models are loaded on each node. Such
tasks are long-lived and are not suitable candidates for APM tracing.
When a span is started, Elasticsearch tracks information about that span in the
current thread context. If a new thread context is created,
then the current span information must not be propagated but instead renamed, so
that (1) it doesn't interfere when new trace information is set in the context,
and (2) the previous trace information is available to establish a parent /
child span relationship. This is done with ThreadContext#newTraceContext()
.
Sometimes we need to detach new spans from their parent. For example, creating
an index starts some related background tasks, but these shouldn't be associated
with the REST request, otherwise all the background task spans will be
associated with the REST request for as long as Elasticsearch is running.
ThreadContext
provides the clearTraceContext
() method for this purpose.
First work out if you can turn it into a task. No, really.
If you can't do that, you'll need to ensure that your class can get access to a
Tracer
instance (this is available to inject, or you'll need to pass it when
your class is created). Then you need to call the appropriate methods on the
tracer when a span should start and end. You'll also need to manage the creation
of new trace contexts when child spans need to be created.