Content for a workshop on Distributed Tracing sponsored by Datadog at Pycon 2017
docker
docker-compose
python
- A Datadog Account
Here's an app that does a simple thing. It tells you what donut to pair with your craft brew. While it is contrived in its purpose, it probably has something in common with the apps you work on:
- It speaks HTTP
- To do its job, it must talk to datastores and external services.
- It has performance issues
Set your Datadog API key in the docker-compose.yml file
Now start up the sample app
$ docker-compose up
Now you should have running:
- A Flask app
web
, accepting HTTP requests - A smaller Flask app
taster
, also accepting HTTP requests - Redis, the backing datastore
- Datadog agent, a process that listens for, samples and aggregates traces
You can run the following command to verify these are running properly.
$ docker-compose ps
If all containers are running properly, you should see the following:
Name Command State Ports
-----------------------------------------------------------------------------------------------------------------------------
pycontracingworkshop_agent_1 /entrypoint.sh supervisord ... Up 7777/tcp, 8125/udp, 0.0.0.0:8126->8126/tcp, 9001/tcp
pycontracingworkshop_redis_1 docker-entrypoint.sh redis ... Up 6379/tcp
pycontracingworkshop_taster_1 python taster.py Up 0.0.0.0:5001->5001/tcp
pycontracingworkshop_web_1 python app.py Up 0.0.0.0:5000->5000/tcp
A few useful commands for debugging. You'll want these handy:
# Tail the logs for the trace-agent
docker exec -it pycontracingworkshop_agent_1 tail -f /var/log/datadog/trace-agent.log
# Tail the logs for web container
docker-compose logs -f web
Let's poke through the app and see how it works.
Vital Business Info about Beers and Donuts live in a SQL database.
Some information about Donuts changes rapidly, with the waves of baker opinion. We store this time-sensitive information in a Redis-backed datastore called DonutDB.
The DonutDB
class abstracts away some of the gory details and provides a simple API
Now let's look at the HTTP interface.
We can list the beers we have available
curl -XGET localhost:5000/beers
And the donuts we have available
curl -XGET localhost:5000/donuts
We can grab a beer by name
curl -XGET localhost:5000/beers/ipa
and a donut by name
curl -XGET localhost:5000/donuts/jelly
So far so good.
Things feel pretty speedy. But what happens when we try to find a donut that pairs well with our favorite beer?
curl -XGET localhost:5000/pair/beer?name=ipa
It feels slow! Slow enough that people might complain about it. Let's try to understand why
Anyone ever had to time a python function before? There are several ways to do it, and they all involve some kind of timestamp math
With a decorator:
def timing_decorator(func):
@wraps(func)
def wrapped(*args, **kwargs):
start = time.time()
try:
ret = func(*args, **kwargs)
finally:
end = time.time()
print("function %s took %.2f seconds" % (func.__name__, end-start))
return ret
return wrapped
With a context manager:
class TimingContextManager(object):
def __init__(self, name):
self.name = name
def __enter__(self):
self.start = time.time()
def __exit__(self, exc_type, exc_value, traceback):
end = time.time()
log.info("operation %s took %.2f seconds", self.name, end-self.:start)
This code lives for you in timing.py
and logs to timing.log
. Let's wire these into the app, app.py
.
from timing import timing_decorator, TimingContextManager
...
@app.route('/pair/beer')
@timing_decorator
def pair():
...
Now, when our slow route gets hit, it dumps some helpful debug information to the log, timing.log
.
You can start to see log lines such as:
function pair took 0.053 seconds
The information about our bad route is still rather one-dimensional. The pair route does some fairly complex things and i'm still not entirely sure where it spends its time.
Let me use the timing context manager to drill down further.
@app.route('/pair/beer')
@timing_decorator
def pair():
name = request.args.get('name')
with TimingContextManager("beer.query"):
beer = Beer.query.filter_by(name=name).first()
with TimingContextManager("donuts.query"):
donuts = Donut.query.all()
with TimingContextManager("match"):
match = best_match(beer)
return jsonify(match=match)
We now have a more granular view of where time is being spent
operation beer.query took 0.020 seconds
operation donuts.query took 0.005 seconds
operation match took 0.011 seconds
function pair took 0.041 seconds
But after several requests, this log becomes increasingly hard to scan! Let's add an identifier to make sure we can trace the path of a request in its entirety.
You may have the notion of a "correlation ID" in your infrastructure already. The goal of this identifier is almost always to inspect the lifecycle of a single request, especially one that moves through several parts of code and several services.
Let's put a correlation ID into our timed routes! Hopefully this will make the log easier to parse
Add it to our decorator
# timing.py
import uuid
...
def timing_decorator(func):
def wrapped(*args, **kwargs):
req_id = uuid.uuid1().int>>64 # a random 64-bit int
from flask import g
g.req_id = req_id
...
log.info("req: %s, function %s took %.3f seconds", req_id, func.__name__, end-start)
... and to our context manager
# timing.py
class TimingContextManager(object):
def __init__(self, name, req_id):
self.name = name
self.req_id = req_id
def __exit__(...):
....
log.info("req: %s, operation %s took %.3f seconds", self.req_id, self.name, end-self.start)
# app.py
with TimingContextManager('beer.query', g.req_id):
...
Now we see output like
req: 10743597937325899402, operation beer.query took 0.023 seconds
req: 10743597937325899402, operation donuts.query took 0.006 seconds
req: 10743597937325899402, operation match took 0.013 seconds
req: 10743597937325899402, function pair took 0.047 seconds
Let's think about what we've done so far. We've taken an app that was not particularly observable and made it incrementally more so.
Our app now generates events
- that are request-scoped.
- and suggest a causal relationship
Remember our glossary - we're well on our way to having real traces!
One thing to note, this didn't come for free:
- instrumentation is a non-zero overhead
- instrumentation logic can leak into business logic in unsavory ways
What a good tracing client does for you is minimize the impact of both of these, while still emitting rich, structured information.
Datadog's tracing client integrates with several commonly used python libraries.
Instrumentation can be explicit or implicit, and uses any library standards for telemetry that exist. For most web frameworks this means Middleware. Let's add trace middleware to our flask integration
# app.py
from ddtrace import tracer; tracer.debug_logging = True
tracer.configure(hostname="agent") # point to agent container
from ddtrace.contrib.flask import TraceMiddleware
app = Flask(__name__)
traced_app = TraceMiddleware(app, tracer, service="matchmaker")
The middleware is doing something very similar to the code you just wrote. It is:
- Timing requests
- Collecting request-scoped metadata
- Pinning some information to the global request context to allow causal relationships to be registered
Now that Datadog is doing the work for us at the middleware layer, lets drop out @timing_decorator
and each with TimingContextManager
in our app.py
file.
If we hit our app a few more times, we can see that datadog has begun to display some information for us. Let's walk through what you're seeing: _segue to demo
Datadog's tracing client configures your application to emit Spans . A span is a chunk of computation time. It is an operation that you care about, that takes some amount of time in the process of serving a request
Let's look at what a span consists of.
name flask.request
id 7245111199352082055
trace_id 1044159332829592407
parent_id None
service matchmaker
resource ping
type http
start 1495094155.75
end 1495094155.92
duration 0.17s
error 0
tags
http.status_code: 200
name
is the name of the operation being traced
A service
is the name of a set of processes that work together to provide a feature set.
A resource
is a particular query to a service. For web apps this is usually the route or handler function
id
is the unique ID of the current span
trace_id
is the unique ID of the request containing this span
parent_id
is the unique ID of the span that was the immediate causal predecessor of this span.
Remember the significance of trace_id
and parent_id
. We'll need them later as we wire up
tracing across service boundaries.
While expressive, a Span by itself is not incredibly useful. Let's add some context around it.
Our app involves multiple services. You'll notice our service list is a little bare. That's because right now, Datadog only knows about the one high-level flask service. Let's do some work so that it knows about the other services and datastores we communicate with.
Datadog's tracing client provides a version of the TimingContextManager that produces well-formatted spans. It also accepts as parameters the service, name and resource identifiers we just talked about.
Let's do a subtle rewrite of our context managers
# app.py
@app.route('/pair/beer')
def pair():
name = request.args.get('name')
with tracer.trace("beer.query", service="db"):
beer = Beer.query.filter_by(name=name).first()
with tracer.trace("donuts.query", service="db"):
donuts = Donut.query.all()
# leaving the service blank implies that we inherit the service from our parent
# i.e. the active span at the time `tracer.trace` is invoked
with tracer.trace("donuts.query"):
match = best_match(beer)
return jsonify(match=match)
If we hit the /pair/beer
route a few more times, we should see a trace like this one:
https://cl.ly/11010t3H2g3N
A good tracing client will unpack some of the layers of indirection in ORMs , and give you a true view of the sql being executed. This lets us marry the the nice APIs of ORMS with visibility into what exactly is being executed and how performant it is
Let's see what Datadog's sqlalchemy
and redis
integrations can do to help
de-mystify some of the abstractions we've built in our app. We'll use Datadog's monkey patcher, a tool for safely adding tracing to packages in the import space
# app.py
from ddtrace import monkey; monkey.patch(sqlalchemy=True, redis=True)
Ping our favorite route a few more times, and Datadog should show you a trace like this: https://cl.ly/0U2z3E2X2V07
Most of the hard problems we have to solve in our systems won't involve just one application. Even in our toy app the best_match
function crosses a distinct service
boundary, making an HTTP call to the "taster" service.
For traditional metrics and logging crossing a service boundary is often a full stop for whatever telemetry you have in place. But traces can cross these boundaries, which is what makes them so useful.
Here's how to make this happen in the Datadog client. First we configure the service that behaves as the client, to propagate information about the active trace via HTTP headers
# app.py
def best_match(beer):
...
for candidate in candidates:
try:
# Propagate context between the two services
headers = {
"X-Datadog-Trace-Id": str(g.flask_datadog_span.trace_id), # Here's what transaction I am a part of
"X-Datadog-Parent-Span-Id": str(g.flask_datadog_span.span_id) # Here's the span that is the immediate parent of the server-side span
}
resp = requests.get("http://taster:5001/taste", params={"beer": beer.name, "donut": candidate}, timeout=2, headers=headers)
...
We set up tracing on the server-side app ( "taster" ):
# taster,py
from ddtrace import tracer; tracer.debug_logging = True
tracer.configure(hostname="agent") # point to agent container
from ddtrace.contrib.flask import TraceMiddleware
app = Flask(__name__)
traced_app = TraceMiddleware(app, tracer, service="taster")
Then we configure the server side of this equation to extract this information from the http headers and continue the trace
# taster.py
@app.route("/taste")
def taste():
tid = request.headers.get("X-Datadog-Trace-Id")
pid = request.headers.get("X-Datadog-Parent-Span-Id")
if tid and pid:
g.flask_datadog_span.trace_id = int(tid)
g.flask_datadog_span.parent_id = int(pid)
beer = request.args.get("beer")
...
Let's hit our pairing route a few more times now, and see what Datadog turns up:
curl -XGET localhost:5000/pair/beer?name=ipa
If everything went well we should see a distributed trace! https://cl.ly/3J2E092U2w2b
Datadog's tracing client gives you a lot of tools to extract meaningful insights from your Python apps.
We support an explicit approach, with tracing constructs embedded in your code, as well as more implicit ones, with tracing auto-patched into common libraries or even triggered via our command line entrypoint like so:
ddtrace-run python my_app.py
We hope you can find an approach that fits for your app. More details at: https://pypi.datadoghq.com/trace/docs https://github.com/DataDog/dd-trace-py
Happy Tracing!