This article was originally published on Amit Saha's personal site. With their kind permission, we’re sharing it here for Codeship readers.
Distributed tracing is the idea of tracing a network request as it travels through your services, as it would be in a microservices-based architecture. The primary reason you may want to do this is to troubleshoot or monitor the latency of a request as it travels through the different services.
In this post, we will see a demo of how we can introduce distributed tracing into a Python network stack communicating via HTTP. We have a service
demo that's a Flask application, which listens on
/. The handler for
/ calls another service,
service1, via HTTP. We want to be able to see how much time a request spends in each service by introducing distributed tracing.
Before we get to the code, let's talk briefly about a few concepts.
Distributed Tracing Concepts
Roughly, a call to an "external service" starts a span. We can have a span nested within another span in a tree-like fashion. All the spans in the context of a single request would form a trace.
Something like the following would perhaps explain it better in the context of our
service network application stack:
<-------------------- Trace ------------------------------------ > Start Root Span Start a nested span External Request -> Demo HTTP app ---> Service 1 HTTP app ---> Process
The span that is started from
service1 is designated as a child of the
root span, which was started from the
demo application. In the context of Python, we can think of a span as a context manager and one context manager living within another context manager. And all these "contexts" together form a trace.
From the above, it is somewhat clear (or not) that the start of each span initiates a "timer," which then on the request's way back (or end of the span) is used to calculate the time the span lasted for. So we need to have something (or things) which has to:
Emit this data
Receive this data
Allow us to collate the data together and make it available to us for each trace or request
This brings us to our next section.
Zipkin is a distributed tracing system that gives us the last two of the above requirements. How we emit this data from our application (the first point in the bulleted list above) is dependent on the language we have written the application in and the distributed tracing system we chose for the last two requirements. In our case, py_zipkin solves our problem.
First, we will start
elasticsearch as the backend as Docker containers. So, you need to have Docker installed. To get the data in
elasticsearch persisted, we will first create a data container as follows:
$ docker create --name esdata openzipkin/zipkin-elasticsearch
Then, download my code from here and:
$ wget .. $ unzip .. $ cd tracing/http_collector $ ./start_zipkin.sh .. .. zipkin | 2017-03-28 03:48:00.936 INFO 9 --- [ main] zipkin.server.ZipkinServer Started ZipkinServer in 7.36 seconds (JVM running for 8.595)
If you now go to
http://localhost:9411/ in your browser, you will see the Zipkin web UI.
Now, let's install the two libraries we need from the
pip install -r requirements.txt. Let's then start our two services. First, the "external" facing
$ python demo.py * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) * Restarting with stat * Debugger is active! * Debugger pin code: 961-605-579
Then, the "internal"
$ python service1.py * Running on http://127.0.0.1:6000/ (Press CTRL+C to quit) * Restarting with stat * Debugger is active! * Debugger pin code: 961-605-579
Now, let's make a couple of requests to the
demo service using
$ curl localhost:5000 twice. If we go back to the Zipkin web UI and click on Find Traces, we will see something like this:
If we click on one of the traces, we will see something like this:
As we can see, four spans were created (two spans in each service) with the second, third and fourth spans nested inside the first span. The time reported to be spent in each span will become clear next.
Let's look at the
demo.py file first:
@zipkin_span(service_name='webapp', span_name='do_stuff') def do_stuff(): time.sleep(5) headers = create_http_headers_for_new_span() requests.get('http://localhost:6000/service1/', headers=headers) return 'OK' @app.route('/') def index(): with zipkin_span( service_name='webapp', span_name='index', transport_handler=http_transport, port=5000, sample_rate=100, #0.05, # Value between 0.0 and 100.0 ): do_stuff() time.sleep(10) return 'OK', 200
We create the first span inside the
/ handler function
index() via the
zipkin_span() context manager. We specify the
sample_rate=100, meaning it will trace every request (only for
transport_handler specifies "how" the emitted traces are transported to the Zipkin "collector". Here we use the
http_transport provided as an example by the
This handler function calls the
do_stuff() function where we create another span, but since it is in the same service, we specify the same
service_name and decorate it with the
zipkin_span decorator. We have an artificial time delay of 5s before we make an HTTP call to the
service1 service. Since we want to continue the current span, we pass in the span data as HTTP headers. These headers are created via the helper function
create_http_headers_for_new_span() provided via
Let's look at the
service1.py file next:
@zipkin_span(service_name='service1', span_name='service1_do_stuff') def do_stuff(): time.sleep(5) return 'OK' @app.route('/service1/') def index(): with zipkin_span( service_name='service1', zipkin_attrs=ZipkinAttrs( trace_id=request.headers['X-B3-TraceID'], span_id=request.headers['X-B3-SpanID'], parent_span_id=request.headers['X-B3-ParentSpanID'], flags=request.headers['X-B3-Flags'], is_sampled=request.headers['X-B3-Sampled'], ), span_name='index_service1', transport_handler=http_transport, port=6000, sample_rate=100, #0.05, # Value between 0.0 and 100.0 ): do_stuff() return 'OK', 200
This is almost the same as our
demo service above, but note how we set the
zipkin_attrs by making using of the headers we were passed from the
demo service above. This makes sure that the span of
service1 is nested within the span of
demo. Note once again how we introduce artificial delays here to make the trace show the time spent in each service more clearly.
Hopefully this post has given you a starting point of how you may go about implement distributed tracing. The following links has more: