WebHook Events and the Party Problem

Stephen Connolly's picture

There was a dark time in the history of the web. The time before services had APIs. Thankfully, that time is past now. Everyone recognizes the need to provide their customers with API access to their service.

If your service only has a few customers using your API, they probably don’t generate too much load. The probability is that you will not have API limits enforced. The smart businesses will have at least some basic mechanism in place to enable enforcing API limits in the event of a runaway overnight success.

Hopefully your service grows to a level of success where you need API limits. Now you start to get requests from customers (or maybe you had these requests before). Before API limits, customers could just repeatedly poll your API to see what changed. Those repeated polls are rude clients, but they are also really simple to write and debug. The solution seems obvious, invert the responsibility: your API will call the customer when things change. WebHooks, events, callbacks, etc., it doesn’t matter what we call them, the reality is that they are the solution pattern we use to remove the repeat pollers and reduce the API load on our services.

If we were to define a Maturity Model for service APIs, it would probably look something like this:

  1. We have an API
  2. You can do everything with our API
  3. You need to know about our limits
  4. We have WebHooks

But I don’t think we should stop there. I think there is at least one further level, and unfortunately, I don’t see many service APIs that are at this level yet. An analogy may help.

The party problem

We’re hosting a big party - imagine one of those Oscar after-parties. Guests are coming and going.

A party!

We offer an API that tells you who is at the party. Deals are made at these parties, there’s no point in being at the party to pitch Will Smith if he isn’t even there. Heck, there’s no point staying at the party to try and bend Scarlett Johansson’s ear on your script if she has left already. So we offer /api/v1/attending/{name} to let you check if the person is there. This is level 1. We have an API that meets the needs of a few big paying customers.

There are other customers of our API who want to know who’s hanging with whom. Ok so the gossip rags are a bit of a shady business, but they’re paying customers… and the celebrities actually need the attention. We cannot have them continuously running:

for name in $(cat celebs.txt); do
  curl http://bigparty.example.org/api/v1/attending/$name && echo $name >> attendees
fi

Never mind that it drowns our servers, their celebs.txt file is missing the hot new blood that draws the big names to our party. So we offer /api/v1/attendees that returns a list of everyone in the party. This is level 2.

Did you know that we have a bunch of high-frequency traders using our API? Seemingly they are using machine learning to correlate the party attendees and try and predict what movie deals will be made in order to drive investment decisions in the various movie studios! Who knew? But yeah, we’re going to have to put rate limits on our API because they are absolutely hammering it. This is level 3.

Nobody is happy with the API limits. The media moguls who are waiting to time their entrance immediately after Jennifer Lawrence arrives at the party had been used to running

while ! curl http://bigparty.example.org/api/v1/attending/Jennifer+Lawrence ; do sleep 1

And everyone else has been used to basically hitting http://bigparty.example.org/api/v1/attendees as often and as fast as they can in order to get the edge on their competition.

So we introduce a WebHook API. The media moguls can just subscribe to the one or two people they are interested in receiving notifications on, and we’ll call them as soon as they arrive or leave. Everyone else can subscribe to the full event stream to be notified as each person arrives and leaves. This is level 4.

So now, as a user of the API, if we want to know the changes in the attendees, we have a choice:

  • We can periodically get the list of all attendees and manually correlate the delta since the last poll; or
  • We can subscribe to the WebHook.

But what if we want to know who is in the party at any point in time?

If WebHooks were reliable, if our WebHook receiver has 100% uptime, and if the network between the two is always available, then we can work from a known starting point and just keep applying the deltas to our list.

But these things are not 100% true. They are probably mostly true, but that 1% or 0.1% or 0.01% happens annoyingly often enough that we need to put in place some way to correct for it.

The naïve solution is to just periodically call the list of attendees API. But what happens if there are WebHook events co-incident with our request?

  1. Ask for list of current attendees
  2. Tom Cruise arrives
  3. Emma Watson leaves
  4. We get the list of attendees

Should we add Tom Cruise to our list of attendees? Or should we remove him? What about the event about Emma Watson leaving? What if she wasn’t even in our original list of attendees? And we haven’t even described the case where the list of attendees we are given included Emma Watson because when the transaction to build the list of attendees in the service backend started she was still present.

Periodically calling the list of attendees API is not making things easier.

Maybe repetition is the key. We decide the rate of change is too fast to be sure at any point in time that our list is 100% accurate, but we want to ensure that we heal the event gap in some reasonable time.

If we call the attendees API twice in succession and compare the two lists. Anyone who is present in both lists most certainly was present at some time in between making the first request and receiving the second response. Anyone not present on both lists most certainly was not present at some time in that window. This gives us two lists of corrections that we can apply.

What would be much better is if the party attendees API included some mechanism to assist clients to reconcile their internal state between the two API paths. This is level 5.

The Service API Maturity Model (SAMM)

  • Level 1: “We have an API” The API is characterized by functionality gaps and probably has been driven by the needs of one or two big customers who have effectively designed the API through feature requests.
  • Level 2: “You can do everything with our API” The API is now sufficiently powerful that you can avoid using the service UI and just interact with the API. If there are usage limits, the limits are sufficiently high that most customers can effectively ignore them.
  • Level 3: “You need to know about our limits” The service and API are generally successful and, as a result, the limits have been lowered to the point that they affect most customers.
  • Level 4: “We have WebHooks” Look at our shiny events API that will call you when things change. Please stop polling our API as it is causing the operations team to lose sleep.
  • Level 5: “Reconciliation” There is a mechanism that allows WebHook receivers to reconcile the events they receive with the point-in-time queries from the query API.

What can Level 5 look like?

So there are many ways this reconciliation can be achieved, all depending on the kind of back-end that the service is provided on top of.

  • If your service has a global monotonic counter that gets incremented with each change and transactional isolation, you could include that counter value in every event payload and query API response. This lets the consumer decide which events can be ignored because they have a newer query API response.
  • If your service has a global monotonic counter but cannot provide transactional isolation for queries, you can return the starting and ending values of your counter with query responses, thereby enabling clients to divide events into three buckets and then they can just re-query the events that were concurrent with query execution.
  • If you cannot provide a global monotonic counter, a timestamp can be used to provide a weaker form of reconciliation.
  • If you cannot provide a global monotonic counter, perhaps you can provide a monotonic counter per logical context. In our party API example, suppose our party was a hat swapping party where guests were encouraged to swap hats. We might expose an API of what hat each attendee was wearing. We could have one monotonic counter for the list of attendees and a monotonic counter for each attendee to track changes in their current hat.

There is one approach that I don’t see in use that I think would be quite useful for some APIs. Bloom filters. The basic idea is like this:

  • The replicating side picks a random seed, and based on the estimate for the number of elements and the desired accuracy, sizes a bloom filter.
  • The bloom filter is populated using hash(seed + item)
  • Send the seed and the bloom filter to the authoritative side.
  • The authoritative side runs through the same bloom filter population to build its filter and sends events for any missing items.
  • The authoritative side completes by sending the difference between its bloom filter and the replicating side’s bloom filter.
  • The replicating side runs through all the items again removing those on the final filter.

NOTE: The use of a client chosen random seed for every round is to ensure that hash collisions affect different elements each round and thus items mapping to the same bit of the Bloom Filter in one round will be on different bits in subsequent rounds.

What I like about this model is that there is effectively no difference between bootstrapping and reconciliation. As a replicator, when we first start up we can start processing events immediately. Our initial bloom filter will be all 0’s and the authoritative side can send us the seed list at a rate of its choosing.

This kind of reconciliation API can be implemented either by sending the reconciliation data through the WebHooks or by returning it as a Query API response.

Because it cannot give you perfect reconciliation, it encourages consumers to recognize the reality of eventual reconciliation.

Because the Bloom Filter requires minimal state to compute and doesn’t require any ordering in the processing of items, it does not impose any transactional requirements on either side.

If we let the client specify what additional state to include in the hash we can also provide updates for internal state changes. For example, if JIRA offered this kind of API then the client could say hash(seed + ticket_id + ticket_title + ticket_status) so that it can discover not just when tickets are created or renamed, but when the display title changes or when the ticket sate changes. Other clients might only ask for hash(seed + ticket_id) because they just want to know that the ticket exists.

If you cannot provide a global monotonic counter (and I suspect most really good services cannot) please consider offering something like a Bloom Filter to allow clients to reconcile your events API with your query API.

Stephen ConnollyStephen Connolly has over 25 years experience in software development. He is involved in a number of open source projects, including Jenkins. Stephen was one of the first non-Sun committers to the Jenkins project and developed the weather icons. Stephen lives in Dublin, Ireland - where the weather icons are particularly useful. Follow Stephen on Twitter, GitHub and on his blog.