Content for a workshop on Distributed Tracing sponsored by Datadog
- Install
docker
anddocker-compose
on your system. Please, follow the instructions available in the Docker website For Linux environment, it should go like$ sudo curl -sSL https://get.docker.com/ | sh $ sudo curl -L "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose`` $ sudo chmod +x /usr/local/bin/docker-compose
- Get the workshop material ready to play with
$ git clone https://github.com/DataDog/dd-py-tracing-workshop.git $ cd dd-py-tracing-workshop
- Create a Datadog Account and get an
API_KEY
for that account (you can create from the Datadog API page). Remember to not share this key with anyone.
Here's an app that does a simple thing. It tells you what donut to pair with your craft brew. While it is contrived in its purpose, it probably has something in common with the apps you work on:
- It's a web application that exposes HTTP endpoints.
- To do its job, it must talk to datastores and external services.
- It may need performance improvements.
The application runs in many Docker containers that you can launch using the following command:
$ sudo DD_API_KEY=<add_your_API_KEY_here> docker-compose up -V
Each Python application runs a Flask server with live-reload so you can update your code without restarting any container. After executing the command above, you should have running:
- A Flask app
cafe
, accepting HTTP requests - A micro-service implemented by a smaller Flask app
taster
, also accepting HTTP requests - Redis, the backing datastore
- Datadog agent, a process that listens for, samples and aggregates traces
You can run the following command to verify these are running properly. You might have to use sudo
according to your permissions.
$ docker-compose ps
If all containers are running properly, you should see the following:
Name Command State Ports
-----------------------------------------------------------------------------------------------------------------------------
ddpytracingworkshop_agent_1 /entrypoint.sh supervisord ... Up 7777/tcp, 8125/udp, 0.0.0.0:8126->8126/tcp, 9001/tcp
ddpytracingworkshop_redis_1 docker-entrypoint.sh redis ... Up 6379/tcp
ddpytracingworkshop_taster_1 python taster.py Up 0.0.0.0:5001->5001/tcp
ddpytracingworkshop_cafe_1 python cafe.py Up 0.0.0.0:5000->5000/tcp
Now, let's poke through the app and see how it works.
-
Vital Business Info about Beers and Donuts lives in a SQL database.
-
Some information about Donuts changes rapidly, with the waves of baker opinion, so we store this time-sensitive information in a Redis-backed datastore called DonutStats.
-
The
DonutStats
class abstracts away some of the gory details and provides a simple API
-
We can list the beers we have available
curl -XGET "localhost:5000/beers"
-
The donuts we have available
curl -XGET "localhost:5000/donuts"
-
We can grab a beer by name
curl -XGET "localhost:5000/beers/ipa"
-
We can grab a donut by name
curl -XGET "localhost:5000/donuts/jelly"
So far so good.
Things feel pretty speedy. But what happens when we try to find a donut that pairs well with our favorite beer?
curl -XGET "localhost:5000/pair/beer?name=ipa"
It feels slow! Slow enough that people might complain about it. Let's try to understand why
In this first step, we'll use basic manual instrumentation to trace one single function from our application.
First, we configure the agent to make it receive traces.
# docker-compose.yaml
agent:
image: "datadog/agent:latest"
environment:
- DD_API_KEY
- DD_APM_ENABLED=true
- DD_APM_NON_LOCAL_TRAFFIC=TRUE
ports:
- "8126:8126"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
Then, let's instrument the code. The first thing to do is to import and configure tracing capabilities:
# cafe.py
from ddtrace import tracer, config, patch_all;
tracer.configure(hostname='agent', port=8126)
After what, we instrument the beers()
function by adding the tracer decorator.
# cafe.py
@app.route('/ping')
@tracer.wrap(service='beers')
def ping():
Now, when you call your webapp for beers curl -XGET "localhost:5000/beers"
, you see the traces of the underlying beers()
function in Datadog Trace List Datadog Trace List
When you click (View Trace ->) on your newly tracked service, you see the "details" of the trace. For now, the details are limited to the single span of the beers
method you just instrumented. One valuable information you find here is the duration of the span.
If you access the Beer Service Statistics page, you also find statistics about all the occurences of calls to that service (try curl -XGET "localhost:5000/beers"
ten times in a row to populate that statistics.
This is useful, but you'll need more to observe what's happening in your application and eventually fix or optimize things.
A good tracing client will unpack, for instance, some of the layers of indirection in ORMs, and give you a true view of the SQL being executed. This lets us marry the nice APIs of ORMS with visibility into what exactly is being executed and how performant it is.
We'll use Datadog's monkey patcher, a tool for safely adding tracing to packages in the import space for both apps cafe.py
and taster.py
. This monkey patcher enables OOTB instrumentation for Flask applications, along with other common frameworks in Python.
# cafe.py
from ddtrace import patch_all;
patch_all(flask=True)
# taster.py
from ddtrace import tracer, config, patch_all
tracer.configure(hostname='agent', port=8126)
config.flask['service_name'] = 'taster'
patch_all(flask=True)
from flask import Flask, request, jsonify
Don't forget to remove the tracer.wrap()
decorator from beers()
function, which we added in Step 1 but which is useless now.
The middleware is operating by monkey patching the flask integration to ensure it is:
- Timing requests
- Collecting request-scoped metadata
- Pinning some information to the global request context to allow causal relationships to be registered
Now, if we hit our app, we can see that Datadog has begun to display some information for us. Meaning, you should be able to see some data in the APM portion of the Datadog application.
At the moment I'm writing this workshop, the requests
package is not yet auto-instrumented by the monkey patcher (but it's about to come :) ). As a consequence, the traces from cafe
web application and taster
micro-service are not associated with each other.
Good pretext to explore what's happening behind the scene with distributed tracing by doing it manually. Indeed, traces can cross boundaries thanks to trace context transiting through HTTP calls headers.
Here is the corresponding code in the cafe.py
to inject the trace context in the call to taster
micro-service:
# cafe.py
from ddtrace.propagation.http import HTTPPropagator
def best_match(beer):
with tracer.trace("best_match") as span:
# [...]
for candidate in candidates:
headers = {}
propagator = HTTPPropagator()
propagator.inject(span.context,headers)
try:
resp = requests.get("http://taster:5001/taste",
params={"beer": beer.name, "donut": candidate},
timeout=2,headers=headers)
# [...]
The reverse operation of getting the context back from the header in the HTTP call received by taster
micro-service is done automatically, since the Flask.request
package is patched along with any other flask
component.
Traces are useful material, but sometimes troubleshoot starts with a line of log. Datadog magically enables correlation of traces and logs thanks to a trace_id
. It's a unique identifier of every single trace, that you can easily report in any log written in that trace.
Let's append our logger format to inherit metadata from trace: trace_id
and span_id
.
# cafe.py
patch_all(logging=True)
import logging
FORMAT = ('%(asctime)s %(levelname)s [%(name)s] [%(filename)s:%(lineno)d] '
'[dd.trace_id=%(dd.trace_id)s dd.span_id=%(dd.span_id)s] '
'- %(message)s')
# taster.py
patch_all(logging=True)
import logging
FORMAT = ('%(asctime)s %(levelname)s [%(name)s] [%(filename)s:%(lineno)d] '
'[dd.trace_id=%(dd.trace_id)s dd.span_id=%(dd.span_id)s] '
'- %(message)s')
We need to configure the agent to collect logs from the docker socket - refer to the agent documentation.
# docker-compose.yml
agent:
environment:
- DD_LOGS_ENABLED=true
- DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
- DD_AC_EXCLUDE=name:agent
volumes:
- /opt/datadog-agent/run:/opt/datadog-agent/run:rw
cafe:
labels:
com.datadoghq.ad.logs: '[{"source": "custom_python", "service": "cafe"}]'
taster:
labels:
com.datadoghq.ad.logs: '[{"source": "custom_python", "service": "taster"}]'
And finally update the Log pipelines to process these custom-format python logs. Note that there is no need to do it for Agent and Redis Logs, they are automatically recongnized and processes as such.
Create a new pipeline which custom filter is source:custom_python
. Within that pipeline:
-
Create a Grok Parser witht the following parsing rule
custom_python_trace %{date("yyyy-MM-dd HH:mm:ss,SSS"):timestamp} %{word:levelname} \[%{word}\] \[%{word}.%{word}:%{integer}] \[dd.trace_id=%{numberStr:dd.trace_id} dd.span_id=%{numberStr:dd.span_id}\] - %{data:message}
, -
Create a Date Remapper with the
timestamp
attribute, -
Create a Status remapper with the
levelname
attribute, -
Ceatet a Trace ID with the
dd.trace_id
attribute.
With that setup, you should now be able:
-
to access logs from a trace - see the Log Panel in the Trace Panel. See Documentation
-
to access a trace from a log - see the "Go to Trace" buttom in the Log Panel. See Documentation
Trace search deactivated by default in Datadog. You should explictely enable it, service by service.
Adding this environment variable in the datadog agent docker configures the agent to send trace events for the flask service.
# docker-compose.yml
agent:
image: "datadog/agent:latest"
environment:
- DD_APM_ANALYZED_SPANS=cafe|flask.request=1,taster|flask.request=1
After this, you can now search for specific traces in the Trace Search, and access advanced Analytics capabilities as well.
As we can see from that trace, we're spending a lot of time in the requests library, especially relative to the amount of work being done in the taster application. This looks like a place we can optimize.
If we look at the work being done in the best_match()
and taste()
, we see
that we can probably move work from best_match()
to taste()
and eliminate
the number of requests we are making.
Let's do this and see if we can cut down the average latency of this code!
First, we'll refactor best_match
to include all of the candidates in its request,
rather than making multiple requests.
# cafe.py
def best_match(beer):
# ...
resp = requests.get(
"http://taster:5001/taste",
params={"beer": beer.name, "donuts": candidates},
timeout=2,
headers=headers,
)
Then, we'll refactor the taste()
function to accept this and return the donut
with the highest taste score.
# taster.py
@app.route("/taste")
def taste():
# ...
# send the remaining candidates to our taster and pick the best
matches = []
beer = request.args.get("beer")
candidates = request.args.getlist("donuts")
for candidate in candidates:
score = random.randint(1, 10)
matches.append((score, candidate))
best_match = max(matches)
return jsonify(candidate=best_match[1], score=best_match[0])
Once we've done this, we can take a look at another trace and notice that we've cut down the total time significantly, or about ~60 ms to ~20 ms:
Datadog's tracing client gives you a lot of tools to extract meaningful insights from your Python apps.
We support an explicit approach, with tracing constructs embedded in your code, as well as more implicit ones, with tracing auto-patched into common libraries or even triggered via our command line entrypoint like so:
$ ddtrace-run python my_app.py
We hope you can find an approach that fits for your app. More details at:
Happy Tracing!