- Understand how Istio supports distributed tracing
- Find distributed tracing info in Kiali
- Introduction to Jaeger
This exercise will introduce some network delays and slightly more complex deployment of the sentences application to introduce you to another type of telemetry Istio generates. Distributed trace spans.
It will also introduce you to one of the distributed tracing backends Istio integrates with. Jaeger.
Istio supports distributed tracing through the envoy proxy sidecar. The proxies automatically generate trace spans on behalf applications they proxy. The sidecar proxy will send the tracing information directly to the tracing backends. So the application developer does not know or worry about a distributed tracing backend.
However, Istio does rely on the application to propagate some headers for subsequent outgoing requests so it can stitch together a complete view of the traffic. See more More Istio Distributed Tracing below for a list of the required headers.
More Istio Distributed Tracing
Some forms of delays can be observed with the metrics that Istio tracks.
Metrics are statistical and not specific to a certain request, i.e. we can only observe statistical data about observations like sums and averages.
This is quite useful but fairly limited in a more complex service based architecture. If the delay was caused by something more complicated it could be difficult to diagnose purely from metrics due to their statistical nature. For example the misbehaving application might not be the immediate one from which you are observing a delay. In fact, it might be deep in the application tree.
Distributed traces with spans provide a view of the life of a request as it travels across multiple hosts and services.
The “span” is the primary building block of a distributed trace, representing an individual unit of work done in a distributed system. Each component of the distributed system contributes a span - a named, timed operation representing a piece of the workflow.
Spans can (and generally do) contain “References” to other spans, which allows multiple Spans to be assembled into one complete Trace - a visualization of the life of a request as it moves through a distributed system.
In order for Istio to stitch together the spans and provide this view of the life of a request. Istio Requires the following B3 trace headers to be propagated across the services.
- x-request-id
- x-b3-traceid
- x-b3-spanid
- x-b3-parentspanid
- x-b3-sampled
- x-b3-flags
- b3
💡 If you have not completed exercise 00-setup-introduction you need to label your namespace with
istio-injection=enabled
.
You are going to deploy a slightly more complex version of the sentences application with a (simulated) bug that causes large delays on the combined service.
Then you are going to see how Istio's distributed tracing telemetry is leveraged by Kiali and Jaeger to help you identify where the delay is happening.
A general overview of what you will be doing in the Step By Step section.
-
Deploy sentences application services
-
Observe the traffic flow with Kiali
-
Observe the distributed tracing telemetry in Kiali
-
Observe the distributed tracing telemetry in Jaeger
-
Route traffic through the ingress gateway
-
Observe the distributed tracing telemetry in Jaeger
Expand the Tasks section below to do the exercise.
Tasks
kubectl apply -f 08-distributed-tracing/start/
In another shell, run the following to continuously query the sentence service through the NodePort.
scripts/loop-query.sh
Go to Graph menu item and select the Versioned app graph from the drop down menu.
If we select to display 'response time' we can see that traffic is flowing with relatively low delay on responses
kubectl apply -f 08-distributed-tracing/start/sentences-v2/
Go to Graph menu item and select the Versioned app graph from the drop down menu.
If we select to display 'response time' we can see that there is a
significant delay introduced by v2
of the sentences
service. However, from the Kiali graph it may seem like the delay is
affecting both v1
and v2
:
This is just a simulated bug and is easy to locate. But in a real world scenario the bug may be introduced by interaction of a service deeper in the application tree. To do a proper investigation you may need to trace the traffic flow of the request through this tree.
Kiali leverages Istio's distributed tracing telemetry and can be used to help in this type of scenario.
Browse to Workloads on the left hand menu and select the sentences-v2
workload. Then select the Traces tab.
Here you can see that there are outlier spans well over 1 second. These are the spans generated by Istio.
Select one of the spans and Kiali will give you some trace details.
Select the Span Details tab and you can see the different spans generated by the envoy proxy. Expanding the different entries will let you see details about where the request was sent and the response status.
The colors on the span and trace details is controlled by Kiali so it is easier to see problems. The colors are based on an average of comparisons of each span duration vs the metrics for the same source/destination services. See this blog for a more detailed dive into how Kiali does this.
Jaeger also leverages Istio's distributed tracing and can also be used to identify scenarios like this.
It can be argued that Jaeger gives an easier to understand and more logical view of the traffic flow of a request.
Browse to Jaeger and select the options as shown below and hit find traces.
💡 Select the sentences service corresponding to your namespace. E.g
sentences.student1
,sentences.student2
, etc.
You should see a trace taking longer than 1 second in the graph and the list of traces (if there is not trace longer than 1s in the graph, increase the 'Limit Result' value or click 'Find Trace' again to get the most resent traces).
Select the trace, either from the graph or the list of traces. Then select the first entry in the flow and expand the Tags section.
In the left side we see the distributed trace - a kind of 'call
graph'. We can read this as the sentences
service calls the name
service, which calls the random
service. The random
service can
bee seen as the root cause of the long delay.
Next, select the first entry in the flow and expand the Tags section.
From the details you can see that the envoy proxy provided the trace. You can
also see that the version of the sentences service is v2
.
The traffic flow in our sentences application is pretty simple with low complexity. But in much more complex system with a much more complicated traffic flow and many more services, the ability of the envoy proxy to provide traces without changes required at the application level is quite powerful.
As an example you will create an IngressGateway and VirtualService to route external traffic through it to the sentences service.
First create a file called sentences-ingress-gw.yaml
in the directory
08-distributed-tracing/start/
.
💡 Edit the hosts field with your namespace. (With
envsubst
in the commands below.)
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: sentences
spec:
selector:
app: istio-ingressgateway
istio: ingressgateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "$STUDENT_NS.sentences.$TRAINING_NAME.eficode.academy"
Then create a file sentences-ingress-vs.yaml
in the directory
08-distributed-tracing/start/
.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: sentences
spec:
hosts:
- "$STUDENT_NS.sentences.$TRAINING_NAME.eficode.academy"
gateways:
- sentences
http:
- route:
- destination:
host: sentences
Substitute the placeholders with environment variable(s) and apply with kubectl.
envsubst < 08-distributed-tracing/start/sentences-ingress-gw.yaml | kubectl apply -f -
envsubst < 08-distributed-tracing/start/sentences-ingress-vs.yaml | kubectl apply -f -
Now instead of hitting the NodePort of the sentences service use the
./scripts/loop-query.sh
with the -g
option and the entry point of
the gateway you just created.
./scripts/loop-query.sh -g $STUDENT_NS.sentences.$TRAINING_NAME.eficode.academy
Traffic will now be routed through the ingress gateway and towards the sentences service.
Browse to Jaeger and select the options as shown below and hit find traces.
You should be able to see the request flowing through the ingress gateway now.
NB: it might take a minute or two for the traces to show up, so don't get worried if you can't see them right away!
💡 For demonstrating purposes, the Jaeger and Istio deployed during a training has been configured to collect all traces; their default settings is to only keep between 0.1-1% of the traces.
For performance: In a production-system with thousands or even millions of requests each second, collecting everything would be infeasible, but the
loop-query
-script, only sends a couple of requests each second.For practicalities: Storing everything means we won't have to "get lucky" (in terms of this exercise) when the system chooses which traces to keep or discard.
If you select one of the traces, either from the graph or the list of traces, you should be able to see the ingress gateway as part of the traffic flow details.
In this exercise you have seen how Istio's distributed tracing telemetry can be leveraged to provide a less intrusive and more cohesive approach to distributed tracing.
The main takeaways are:
-
Istio's envoy proxies generate the distributed trace spans for the services.
-
If a service has no proxy sidecar, distributed trace telemetry will not be generated.
-
Istio's envoy proxies provide the distributed trace telemetry to the supported backends integrated with the mesh.
-
The only requirement for the service is to propagate the required B3 trace headers.
kubectl delete -f 08-distributed-tracing/start/sentences-v2/
kubectl delete -f 08-distributed-tracing/start/