Graph databases are one in a plethora of NoSQL databases now found in the wild. They're unique among their sibling database types by having a unique internal data structure. Graph databases make the relationships within the data a first-class citizen on the same level as the data itself.
Consequently, they are most popular in social networking, big data, and science and research communities. These domains have datasets that are as much about the relationships they encode as they are about the data. When everything's interconnected -- say, like on the web -- then the real power of this model starts to become apparent.
So, rather than focusing on pairing data values to meaning (column heading, key name, etc), graph databases express things using:
Nodes -- the subject of the relationship
Edges -- the relationships between Nodes and Properties
Properties -- the information related to the Node
Here's what a graph database might look like:
BigBlueHat writesFor Codeship
Reads like a sentence, right? In fact, in "triplestores" (a common graph database implementation type), these terms map to the following, more grammatically friendly names:
Subject -- the thing being talked about (node)
Predicate -- what's being said about the subject relative to the object (edge)
Object -- the thing related to the subject (property)
Simpler, right? Here are some more:
BigBlueHat contributesTo LevelGraph
BrightBall writesFor Codeship
BigBlueHat postsCodeAt https://github.com/bigbluehat
These statements use the same "writesFor" predicate, so finding out "who writes for Codeship" would be trivial. But what if we wanted to ask "Where do authors at Codeship post code?" If you squint, you'll see two sets of potential algebraic-looking sentences in there:
X writesFor Codeship
X postsCodeAt Y
This ability to stack and relate other relationships into cascading related queries is one of the core powers of graph databases.
Mind blown yet? Let's back up.
How to Build a Hexastore
LevelGraph is an implementation of a "hexastore" which is built upon LevelUp and LevelDB. Hexastores combine the consistent nature of triplestores with the boring but beneficial power of materialized indexes.
Given the triples we used above, you could imagine a key/value-based controller index with all the triples (subject, predicate, object) stored as the key with null
or 1
or anything stored as the value. Doing that would give one the power to find all the relationships of a single subject.
For instance, you can imagine the statements above stored in Redis (or whatever) as these keys -- each with a value of 1
:
BigBlueHat|writesFor|Codeship
BigBlueHat|contributesTo|LevelGraph
BrightBall|writesFor|Codeship
BigBlueHat|postsCodeAt|https://github.com/bigbluehat
Given the above, we could answer simple queries like "who does BigBlueHat writeFor?" or "what do we know about Brightball?" But we could not ask the more complex query mentioned above or answer something more common like "who writesFor Codeship?"
Hexastores use this same basic "key-heavy" storage style, but they store the same data six different ways. Seriously. Let's have a look.
We'll store S P O as our triple (for simplicity). A hexastore would store these six keys:
S|P|O
S|O|P
P|O|S
P|S|O
O|S|P
O|P|S
Seem crazy? Maybe. But it's also crazy efficient for lookups.
Let's look at that query from earlier:
X writesFor CodeShip
X postsCodeAt Y
Given that query, we'd use the keys that have a P|O|S structure first (to answer for "X") and, once we've solved for X, we'd check the S|P|O keys where X = S.
Remember, these are range-based, boring-old, materialized lookups. They're not doing table scans. They're jumping straight to that point in the index and returning what they find. Afterword, there's some combination -- if needed.
The above query, then, we'd check this range first:
writesFor|CodeShip|*
which would return these keys from our little dataset:
writesFor|CodeShip|BigBlueHat
writesFor|CodeShip|BrightBall
Then, using those results, we'd hit the index again, looking for this range:
BigBlueHat|postsCodeAt|*
Which would result in this key:
BigBlueHat|postsCodeAt|https://github.com/BigBlueHat
We now have the answers to our "complex" little question. Pretty rad, right?
Onward.
Meet LevelGraph
LevelGraph is (as mentioned) an implementation of a hexastore. It uses LevelUp and LevelDB as a foundational key/value store. In the browser, it uses level.js to store the same sort of hexastore key-based controller index (as seen above) into an IndexedDB database.
LevelGraph can be installed via browserify. Installation instructions can be found in the levelgraph
GitHub repo.
Putting data into LevelGraph looks like:
db.put({ subject: "BigBlueHat", predicate: "writesFor", object: "Codeship" }, function(err) { if (err) throw err; });
In this case, if there's an error, it will get thrown. Otherwise, it's safe to assume the triple went in verbatim.
Once the data has been added to LevelGraph, the underlying LevelDB (or IndexedDB if you're in a browser) will show keys looking rather similar to the ones above.
Getting data out of LevelGraph takes one of three forms:
db.get()
for a all triples containing any combination of a subject, predicate, or objectdb.search()
for doing queries like the one we did before + filtering + materializationdb.nav.*
chained queries using methods similar to the Gremlin query system from Apache Tinkerpop and friends
Let's take a look at each of those in turn.
First, a basic db.get()
could be used to answer the "who writesFor Codeship?" question above:
db.get({ predicate: "writesFor", object: "Codeship" }, function(err, rv) { if (err) throw err; console.log('rv', rv); });
The output is a an array of triples (subject, predicate, object) that contain the "writesFor" predicate and the "Codeship" object values. We can use the callback to extract just the subjects from those objects giving an array of answers to our question:
db.get({ predicate: "writesFor", object: "Codeship" }, function(err, rv) { if (err) throw err; writers = rv.map(function(triple) { return triple.subject; }); console.log('writers', writers); // writers = ["BigBlueHat", "Brightball"] });
Next, let's take a look at that more complex query using the db.search()
API.
// X writesFor Codeship // X postsCodeAt Y db.search([ { subject: db.v('x'), predicate: "writesFor", object: "Codeship" }, { subject: db.v('x'), predicate: "postsCodeAt", object: db.v('y') } ], {}, // we'll talk about this shortly function(err, rv) { if (err) throw err; console.log('rv', rv); });
The output is an array of results mapped to their db.v()
value holders.
// output of the above `db.search` rv = [{x: "BigBlueHat", y: "https://github.com/bigbluehat"}]
It's also possible to include a filter
function alongside any triple used in the search queries. This can be helpful for further limiting the output as the queries cascade into each other. For instance, if we only wanted code posted on GitHub, we could make the second search query object read:
{ subject: db.v('x'), predicate: "postsCodeAt", object: db.v('y'), filter: function(triple) { return triple.object.search("github.com") > -1; } }
Once the results are in, we can also provide a final filter
in that second parameter we say earlier (currently set to {}
). However, those filter functions take a slightly different using a callback function to confirm or reject the result (aka the solution to the query):
function filter(solution, callback) { // we're only looking for things starting with `B` if ('x' in solution && solution.x.search(/\^B/) > -1) { // confirm/keep the solution/result callback(null, solution); } else { // reject/skip callback(null); } }
Additionally, if we wanted to reshape the output back into triples, we can use the materialized
object. This is most useful when creating new triples within the graph based on results found from a search query.
For example:
{ materialized: { subject: "Codeship", predicate: "authorCodeRepository", object: db.v('y') } }
The result in this case being a new triple of:
Codeship authorCodeRepository https://github.com/bigbluehat
This can be super handy when you need to change the shape of the graph or make new statements into other graphs. Additionally, you can use the materialized output to do additional db.get()
and db.search()
requests using the materialized output as the query options in the next request or perhaps even removing any remaining statements from the graph that used a different predicate.
Removing triples looks rather like putting them, only the method is db.del()
:
db.del( {subject: "BigBlueHat", predicate: "postsCodeAt", object: "https://github.com/bigbluehat"}, function(err) { if (err) throw err; });
LevelGraph looks up the six variations of the triple-based keys in the hexastore and deletes them. If it can't find the key(s), it throws an error.
Now. Editing is going to feel a bit odd...
Because these are key names and not values, we can't just alter them in place. We have to remove them and then put the new ones in.
Let's change the https://github.com/bigbluehat
URL to https://github.com/BigBlueHat/
(which is technically more accurate). We'll start with the db.del()
operation above, and then db.put()
the replacement:
db.del( {subject: "BigBlueHat", predicate: "postsCodeAt", object:"https://github.com/bigbluehat"}, function(err) { if (err) throw err; db.put({subject: "BigBlueHat", predicate: "postsCodeAt", object: "https://github.com/BigBlueHat/"}, function(err) { if (err) throw err; }); });
That should do it. In fact, you could reverse those, so you'd be sure that the new stuff made it in before you removed the old one. Let's do that in the next example.
The code above works when we know a single triple that has the string we need to update in it. If we need to update all occurrences of the string in the graph, then we'll need to search for them all and then re-put them. Like so:
db.search([ {subject: db.v('s'), predicate: db.v('p'), object:"https://github.com/bigbluehat"}], {}, function(err, rv) { rv.forEach(function(r) { // putting the new stuff in first, to be sure it makes it in db.put({subject: r.s, predicate: r.p, object: "https://github.com/BigBlueHat/"}, function(err) { if (err) throw err; // new stuff is in, so let's remove the old db.del({subject: r.s, predicate: r.p, object: "http://github.com/bigbluehat"}, function(err) { if (err) throw err; } ); } ); }); });
While that may seem tedious, it has the advantage of being progressive improvement -- good stuff in, then bad stuff out -- or the other way around if you'd like. It's up to you.
One thing worth noting is that we've not had to name any columns or figure out primary or secondary keys. We didn't even have to determine what we were going to index before we stored something. Since the data has this consistent, triple-based shape throughout the process, those things are implicitly handled.
Let's move on to Linked Data.
!Sign up for a free Codeship Account
Linked Data Land
LevelGraph is one of many hexastore implementations, but it's use in Node.js and the browsers, meaning that it's an ideal candidate for building apps for working with Linked Data.
Linked Data uses this same graph-based thinking plus the webby nature of graphs (or is it the graphy nature of the web...) and brings that to data -- not just documents.
Our examples above were fine for what they are, but they assumed (rashly) a closed world environment in which "writesFor" is seen as sufficient within the closed world of our little graph. However, what happens when the two of us exchange data and you used something like "authorsFor"? What happens if English isn't your native language?
That's where Linked Data comes in.
Let's go back to that very first triple:
BigBlueHat writesFor Codeship
In this triple, we're using plain strings made of English words to be the identifiers of the things themselves. Trouble is, that don't scale so good.
Here's what that same triple would look like using the Schema.org Linked Data vocabulary developed by Google, Microsoft, Yahoo, Yandex, and friends and developed in a W3C Community Group:
https://blog.codeship.com/json-ld-building-meaningful-data-apis/ http://schema.org/author http://bigbluehat.com/#me
https://blog.codeship.com/json-ld-building-meaningful-data-apis/ http://schema.org/publisher https://codeship.com/#
We've now made two global-friendly statements using five total identifiers. Interestingly, we've now introduced an identifier (and in this case it's also a locator) of something published and authored -- which is information we'd lacked before.
However, there's now no direct relationship between the "BigBlueHat" identifier and the "Codeship" (blog) identifier. That said, we now specify one of the things that "BigBlueHat" actually wrote for "Codeship" and a URL of where it's published, so those are wins to be sure.
But how do we find out "who writes for the Codeship blog" now?
db.search([ { subject: db.v('article'), predicate: "http://schema.org/author", object: db.v('author') }, { subject: db.v('article'), predicate: "http://schema.org/publisher", object: db.v('publisher') }], {}, function(err, rv) { if (err) throw err; console.log('rv', rv); } );
The output should look like this:
rv = [ { author: "http://bigbluehat.com/#me", article: "https://blog.codeship.com/json-ld-building-meaningful-data-apis/", publisher: "https://codeship.com/#" } ];
If we were to update Brightball's triple statements to use the Schema.org vocabulary, there'd be more results. Additionally, we could use a filter
function at the end to clear out the article names if we were only interested in authors, for instance. However, in this case, we have even richer data returned than we did earlier, not to mention more accurately defined data!
Conclusion
This only scratches the surface of both LevelGraph and Linked Data. The graph model can take some time to get your head around, but it's one that's been increasingly delightful to work with as I learn the tooling, understand graph querying, and build more tools around it.
If you're new to all this and want to play with LevelGraph without the installation setup, give the LevelGraph Playground a look.
You can add colloquial ("closed world") graphs if you'd like, but it also supports common Linked Data formats such as JSON-LD, Turtle, and N3. JSON-LD is increasingly common and if you've landed at the Schema.org site while reading this article, you might enjoy testing out their JSON-LD examples on the playground.
The graphs that will be stored don't always connect, so they're not necessarily great for testing queries against. However, those examples should give you a quick way to build up a larger graph.
Lastly, you can also then output any filtered results into JSON-LD or Turtle and send them off to other people who can add them to their graph. If you use the same vocabulary, then you'll both benefit from the shared meaningful data that you can query across your own personal knowledge store.