Introducing Distributed Tracing in Your Python Application via Zipkin

Written by: Amit Saha

This article was originally published on Amit Saha's personal site. With their kind permission, we’re sharing it here for Codeship readers.

Distributed tracing is the idea of tracing a network request as it travels through your services, as it would be in a microservices-based architecture. The primary reason you may want to do this is to troubleshoot or monitor the latency of a request as it travels through the different services.

In this post, we will see a demo of how we can introduce distributed tracing into a Python network stack communicating via HTTP. We have a service demo that's a Flask application, which listens on /. The handler for / calls another service, service1, via HTTP. We want to be able to see how much time a request spends in each service by introducing distributed tracing.

Before we get to the code, let's talk briefly about a few concepts.

Distributed Tracing Concepts

Roughly, a call to an "external service" starts a span. We can have a span nested within another span in a tree-like fashion. All the spans in the context of a single request would form a trace.

Something like the following would perhaps explain it better in the context of our demo and service network application stack:

                <--------------------          Trace       ------------------------------------ >
                                    Start Root Span                        Start a nested span
External Request -> Demo HTTP app       --->          Service 1 HTTP app        --->          Process

The span that is started from service1 is designated as a child of the root span, which was started from the demo application. In the context of Python, we can think of a span as a context manager and one context manager living within another context manager. And all these "contexts" together form a trace.

From the above, it is somewhat clear (or not) that the start of each span initiates a "timer," which then on the request's way back (or end of the span) is used to calculate the time the span lasted for. So we need to have something (or things) which has to:

  • Emit this data

  • Receive this data

  • Allow us to collate the data together and make it available to us for each trace or request

This brings us to our next section.

Zipkin

Zipkin is a distributed tracing system that gives us the last two of the above requirements. How we emit this data from our application (the first point in the bulleted list above) is dependent on the language we have written the application in and the distributed tracing system we chose for the last two requirements. In our case, py_zipkin solves our problem.

First, we will start zipkin with elasticsearch as the backend as Docker containers. So, you need to have Docker installed. To get the data in elasticsearch persisted, we will first create a data container as follows:

$ docker create --name esdata openzipkin/zipkin-elasticsearch

Then, download my code from here and:

$ wget ..
$ unzip ..
$ cd tracing/http_collector
$ ./start_zipkin.sh
..
..
zipkin          | 2017-03-28 03:48:00.936  INFO 9 --- [           main] zipkin.server.ZipkinServer
Started ZipkinServer in 7.36 seconds (JVM running for 8.595)

If you now go to http://localhost:9411/ in your browser, you will see the Zipkin web UI.

Creating Traces

Now, let's install the two libraries we need from the requirements.txt via pip install -r requirements.txt. Let's then start our two services. First, the "external" facing demo service:

$ python demo.py
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger pin code: 961-605-579

Then, the "internal" service1:

$ python service1.py
* Running on http://127.0.0.1:6000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger pin code: 961-605-579

Now, let's make a couple of requests to the demo service using $ curl localhost:5000 twice. If we go back to the Zipkin web UI and click on Find Traces, we will see something like this:

If we click on one of the traces, we will see something like this:

As we can see, four spans were created (two spans in each service) with the second, third and fourth spans nested inside the first span. The time reported to be spent in each span will become clear next.

Application Code

Let's look at the demo.py file first:

@zipkin_span(service_name='webapp', span_name='do_stuff')
def do_stuff():
    time.sleep(5)
    headers = create_http_headers_for_new_span()
    requests.get('http://localhost:6000/service1/', headers=headers)
    return 'OK'
@app.route('/')
def index():
    with zipkin_span(
        service_name='webapp',
        span_name='index',
        transport_handler=http_transport,
        port=5000,
        sample_rate=100, #0.05, # Value between 0.0 and 100.0
    ):
        do_stuff()
        time.sleep(10)
    return 'OK', 200

We create the first span inside the / handler function index() via the zipkin_span() context manager. We specify the sample_rate=100, meaning it will trace every request (only for demo). The transport_handler specifies "how" the emitted traces are transported to the Zipkin "collector". Here we use the http_transport provided as an example by the py_zipkin project.

This handler function calls the do_stuff() function where we create another span, but since it is in the same service, we specify the same service_name and decorate it with the zipkin_span decorator. We have an artificial time delay of 5s before we make an HTTP call to the service1 service. Since we want to continue the current span, we pass in the span data as HTTP headers. These headers are created via the helper function create_http_headers_for_new_span() provided via py_zipkin.

Let's look at the service1.py file next:

@zipkin_span(service_name='service1', span_name='service1_do_stuff')
def do_stuff():
    time.sleep(5)
    return 'OK'
@app.route('/service1/')
def index():
    with zipkin_span(
        service_name='service1',
        zipkin_attrs=ZipkinAttrs(
            trace_id=request.headers['X-B3-TraceID'],
            span_id=request.headers['X-B3-SpanID'],
            parent_span_id=request.headers['X-B3-ParentSpanID'],
            flags=request.headers['X-B3-Flags'],
            is_sampled=request.headers['X-B3-Sampled'],
        ),
        span_name='index_service1',
        transport_handler=http_transport,
        port=6000,
        sample_rate=100, #0.05, # Value between 0.0 and 100.0
    ):
        do_stuff()
    return 'OK', 200

This is almost the same as our demo service above, but note how we set the zipkin_attrs by making using of the headers we were passed from the demo service above. This makes sure that the span of service1 is nested within the span of demo. Note once again how we introduce artificial delays here to make the trace show the time spent in each service more clearly.

Ending Notes

Hopefully this post has given you a starting point of how you may go about implement distributed tracing. The following links has more:

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.