This article was originally published on Amit Saha's personal site. With their kind permission, we’re sharing it here for Codeship readers.
Distributed tracing is the idea of tracing a network request as it travels through your services, as it would be in a microservices-based architecture. The primary reason you may want to do this is to troubleshoot or monitor the latency of a request as it travels through the different services.
In this post, we will see a demo of how we can introduce distributed tracing into a Python network stack communicating via HTTP. We have a service demo
that's a Flask application, which listens on /
. The handler for /
calls another service, service1
, via HTTP. We want to be able to see how much time a request spends in each service by introducing distributed tracing.
Before we get to the code, let's talk briefly about a few concepts.
Distributed Tracing Concepts
Roughly, a call to an "external service" starts a span. We can have a span nested within another span in a tree-like fashion. All the spans in the context of a single request would form a trace.
Something like the following would perhaps explain it better in the context of our demo
and service
network application stack:
<-------------------- Trace ------------------------------------ > Start Root Span Start a nested span External Request -> Demo HTTP app ---> Service 1 HTTP app ---> Process
The span that is started from service1
is designated as a child of the root span
, which was started from the demo
application. In the context of Python, we can think of a span as a context manager and one context manager living within another context manager. And all these "contexts" together form a trace.
From the above, it is somewhat clear (or not) that the start of each span initiates a "timer," which then on the request's way back (or end of the span) is used to calculate the time the span lasted for. So we need to have something (or things) which has to:
Emit this data
Receive this data
Allow us to collate the data together and make it available to us for each trace or request
This brings us to our next section.
Zipkin
Zipkin is a distributed tracing system that gives us the last two of the above requirements. How we emit this data from our application (the first point in the bulleted list above) is dependent on the language we have written the application in and the distributed tracing system we chose for the last two requirements. In our case, py_zipkin solves our problem.
First, we will start zipkin
with elasticsearch
as the backend as Docker containers. So, you need to have Docker installed. To get the data in elasticsearch
persisted, we will first create a data container as follows:
$ docker create --name esdata openzipkin/zipkin-elasticsearch
Then, download my code from here and:
$ wget .. $ unzip .. $ cd tracing/http_collector $ ./start_zipkin.sh .. .. zipkin | 2017-03-28 03:48:00.936 INFO 9 --- [ main] zipkin.server.ZipkinServer Started ZipkinServer in 7.36 seconds (JVM running for 8.595)
If you now go to http://localhost:9411/
in your browser, you will see the Zipkin web UI.
Creating Traces
Now, let's install the two libraries we need from the requirements.txt
via pip install -r requirements.txt
. Let's then start our two services. First, the "external" facing demo
service:
$ python demo.py * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) * Restarting with stat * Debugger is active! * Debugger pin code: 961-605-579
Then, the "internal" service1
:
$ python service1.py * Running on http://127.0.0.1:6000/ (Press CTRL+C to quit) * Restarting with stat * Debugger is active! * Debugger pin code: 961-605-579
Now, let's make a couple of requests to the demo
service using $ curl localhost:5000
twice. If we go back to the Zipkin web UI and click on Find Traces, we will see something like this:
If we click on one of the traces, we will see something like this:
As we can see, four spans were created (two spans in each service) with the second, third and fourth spans nested inside the first span. The time reported to be spent in each span will become clear next.
Application Code
Let's look at the demo.py
file first:
@zipkin_span(service_name='webapp', span_name='do_stuff') def do_stuff(): time.sleep(5) headers = create_http_headers_for_new_span() requests.get('http://localhost:6000/service1/', headers=headers) return 'OK' @app.route('/') def index(): with zipkin_span( service_name='webapp', span_name='index', transport_handler=http_transport, port=5000, sample_rate=100, #0.05, # Value between 0.0 and 100.0 ): do_stuff() time.sleep(10) return 'OK', 200
We create the first span inside the /
handler function index()
via the zipkin_span()
context manager. We specify the sample_rate=100
, meaning it will trace every request (only for demo
). The transport_handler
specifies "how" the emitted traces are transported to the Zipkin "collector". Here we use the http_transport
provided as an example by the py_zipkin
project.
This handler function calls the do_stuff()
function where we create another span, but since it is in the same service, we specify the same service_nam
e and decorate it with the zipkin_span
decorator. We have an artificial time delay of 5s before we make an HTTP call to the service1
service. Since we want to continue the current span, we pass in the span data as HTTP headers. These headers are created via the helper function create_http_headers_for_new_span()
provided via py_zipki
n.
Let's look at the service1.py
file next:
@zipkin_span(service_name='service1', span_name='service1_do_stuff') def do_stuff(): time.sleep(5) return 'OK' @app.route('/service1/') def index(): with zipkin_span( service_name='service1', zipkin_attrs=ZipkinAttrs( trace_id=request.headers['X-B3-TraceID'], span_id=request.headers['X-B3-SpanID'], parent_span_id=request.headers['X-B3-ParentSpanID'], flags=request.headers['X-B3-Flags'], is_sampled=request.headers['X-B3-Sampled'], ), span_name='index_service1', transport_handler=http_transport, port=6000, sample_rate=100, #0.05, # Value between 0.0 and 100.0 ): do_stuff() return 'OK', 200
This is almost the same as our demo
service above, but note how we set the zipkin_attrs
by making using of the headers we were passed from the demo
service above. This makes sure that the span of service1
is nested within the span of demo
. Note once again how we introduce artificial delays here to make the trace show the time spent in each service more clearly.
Ending Notes
Hopefully this post has given you a starting point of how you may go about implement distributed tracing. The following links has more: