Neo4j

Time-Based Versioned Graphs

May 13th, 2014  |  Published in Cypher, Graph Databases, Neo4j by Ian Robinson

(Want to talk about this or any other data-modelling topic in person? Sign up for our GOTO Amsterdam tutorial, June 18.)

Many graph database applications need to version a graph so as to see what it looked like at a particular point in time. Neo4j doesn’t provide intrinsic support either at the level of its labelled property graph model or in its Cypher query language for versioning. Therefore, to version a graph we need to make our application graph data model and queries version aware.

I’ll say this up front: versioning necessarily creates a lot more data – both more nodes and more relationships. In addition, queries will tend to be more complex, and slower, because every MATCH must take account of one or more versioned elements. Given these overheads, apply versioning with care. Perhaps not all of your graph needs to be versioned. If that’s the case, version only those portions of the graph that require it.

In this post I’m going to discuss what I consider to be the easiest form of versioning – time-based versioning. Time-based versioning gives us a universal versioning scheme: pick an instant, convert it to a long value, and the resulting version number applies to all parts of the graph. Version-number-based versioning, where each entity has its own independently varying version number (and scheme even), is more complex to implement. I’ll be dealing with that in a future post.

All graph versioning strategies with Neo4j must make tradeoffs between the cost of writes, the amount of on-disk storage require, the degree of fragmentation incurred, and the performance of reads for both current and historical queries. The design described here strikes a balance between these competing considerations. At the end of the post I describe some variations and optimisations.

The examples here were written using Neo4j Community 2.0.3. I’ve also included the examples in a GraphGist.

Separate Structure From State

The key to versioning a graph is separating structure from state. This allows us to version the graph’s structure independently of its state.

To help describe how to design a version-aware graph model, I’m going to introduce some new terminology: identity nodes, state nodes, structural relationships and state relationships.

Identity Nodes

Identity nodes are used to represent the positions of entities in a domain-meaningful graph structure. Each identity node contains one or more immutable properties, which together constitute an entity’s identity. In a version-free graph (the kind of graph we build normally) nodes tend to identify an entity, locate it in the network structure, and represent its state. Identity nodes in a version-aware graph, in contrast, serve only to identify an entity and locate it in a network structure.

Structural Relationships

Identity nodes are connected to one another using timestamped structural relationships. These structural relationships are similar to the domain relationships we’d include in a version-free graph, except they have two additional properties, from and to, both of which are timestamps.

State Nodes and Relationships

Connected to each identity node are one or more state nodes. Each state node represents an immutable snapshot of an entity’s state. State nodes are connected to identity nodes using timestamped state relationships.

Example

The following example describes a simple domain made up of shops, products and suppliers. A Shop SELLS one or more Products, which are SUPPLIED_BY Suppliers. Over time prices change, shops sell different products, and source them from different suppliers. With a versioned graph we’ll be able to see which products a shop was selling, what price they were being sold at, and who was supplying them, at any point in the graph’s history.

The diagram below shows a version-aware graph representation of this domain. To help you see how structural and state-based elements are combined in the same graph, I’ve drawn identity nodes and structural relationships in blue, state nodes and state relationships in red. In the data itself, I’ve labelled the identity nodes Shop, Product, and Supplier, and the state nodes ShopState, ProductState and SupplierState. Structural relationships are named SELLS and SUPPLIED_BY. All state relationships are named STATE.

initial-state

This graph represents the domain in its initial state on 1 January 2014. All the relationships, both structural and state, are current. Each relationship has a from property with a long millisecond value representing 1 January 2014, and a to property with a very large long value (End-Of-Time, or EOT, which is a magic number – in this case Long.MAX_VALUE) that effectively means there is no current upper bound to the period associated with the relationship.

Here’s the Cypher for creating this graph in its initial state:

CREATE
(s1:Shop{shop_id:1})
  -[:STATE{from:1388534400000,to:9223372036854775807}]->
  (ss1:ShopState{name:'General Store'}),
(p1:Product{product_id:1})
  -[:STATE{from:1388534400000,to:9223372036854775807}]->
  (ps1:ProductState{name:'Cheese',price:1.00}),
(p2:Product{product_id:2})
  -[:STATE{from:1388534400000,to:9223372036854775807}]->
  (ps2:ProductState{name:'Crisps',price:0.50}),
(p3:Product{product_id:3})
  -[:STATE{from:1388534400000,to:9223372036854775807}]->
  (ps3:ProductState{name:'Orange Juice',price:1.50}),
(u1:Supplier{supplier_id:1})
  -[:STATE{from:1388534400000,to:9223372036854775807}]->
  (us1:SupplierState{name:'International Imports'}),
(u2:Supplier{supplier_id:2})
  -[:STATE{from:1388534400000,to:9223372036854775807}]->
  (us2:SupplierState{name:'Local Markets'}),
(s2:Shop{shop_id:2})
  -[:STATE{from:1388534400000,to:9223372036854775807}]->
  (ss2:ShopState{name:'Cornershop'}),
(s1)-[:SELLS{from:1388534400000,to:9223372036854775807}]->(p1),
(s1)-[:SELLS{from:1388534400000,to:9223372036854775807}]->(p2),
(s2)-[:SELLS{from:1388534400000,to:9223372036854775807}]->(p3),
(p1)-[:SUPPLIED_BY{from:1388534400000,to:9223372036854775807}]->(u2),
(p2)-[:SUPPLIED_BY{from:1388534400000,to:9223372036854775807}]->(u1),
(p3)-[:SUPPLIED_BY{from:1388534400000,to:9223372036854775807}]->(u2)

Current Queries

To query the current state of a version-aware graph, we need to match current structural and state relationships, which we can do simply by matching the to property on relationships against our EOT value. Here’s a query for finding all products that currently retail for one pound or less, together with their suppliers, and the shops where they can be purchased:

MATCH (u:Supplier)<-[:SUPPLIED_BY{to:9223372036854775807}]-
      (p:Product)-[:STATE{to:9223372036854775807}]->(ps:ProductState)
WHERE ps.price <= 1.0
MATCH (p)<-[:SELLS{to:9223372036854775807}]-(s:Shop)
MATCH (u)-[:STATE{to:9223372036854775807}]->(us:SupplierState)
MATCH (s)-[:STATE{to:9223372036854775807}]->(ss:ShopState)
RETURN us.name AS supplier, 
       ps.name AS product, 
       ps.price AS price, 
       ss.name AS shop
ORDER BY price DESC

Results:

+--------------------------------------------------------------+
| supplier                | product  | price | shop            |
+--------------------------------------------------------------+
| "Local Markets"         | "Cheese" | 1.0   | "General Store" |
| "International Imports" | "Crisps" | 0.5   | "General Store" |
+--------------------------------------------------------------+

As you can see, because of the need to make all parts of the query version-aware, even simple queries can become quite complex.

Modifying the Graph

The kinds of changes we may need to track in a version-aware graph include changes to state and changes to structure. Changes to state involve adding, removing and modifying node properties. Structural changes involve adding and deleting relationships, as well as modifying the strength, weight or quality of relationships by changing one or more of their properties.

Modifying Structure

In the following query, we move product 1 from shop 1 to shop 2. This change will be effective from 1 February 2014 (1391212800000 milliseconds since 1970-01-01T00:00:00Z).

MATCH (p:Product{product_id:1})<-[r:SELLS]-(:Shop)
WHERE r.to = 9223372036854775807
MATCH (s:Shop{shop_id:2})
SET r.to = 1391212800000
CREATE (s)-[:SELLS{from:1391212800000,to:9223372036854775807}]->(p)

As part of this query, we archive the old SELLS relationship, setting its to property to the millisecond value representing 1 February 2014, and introduce a new SELLS relationship that connects product 1’s identity node to shop 2’s identity node. We set the from property of this new relationship to 1 February 2014, and its to property to EOT.

This is what the graph looks like after we’ve made the change. I’ve used a dashed blue arrow to show the archived SELLS relationship, and a thick blue arrow to show the new SELLS relationship.

move-product

Modifying State

In the following query, we update product 1’s price. As with our structural change, this change will be effective from 1 February 2014.

MATCH (p:Product{product_id:1})
       -[r1:STATE{to:9223372036854775807}]->(currentState:ProductState)
SET r1.to = 1391212800000
CREATE (p)-[r2:STATE{from:1391212800000,to:9223372036854775807}]->
       (newState:ProductState)
SET newState = currentState
SET newState.price = 2.0

This time, we archive the product’s current STATE relationship, and then create a new state node and a new STATE relationship. We copy the product’s properties from the old state node to the new state node (SET newState = currentState), and then update the price.

update-product

Historical Queries

We’ve seen that to query the current state of the graph, we need to do an exact match on the to property of both structural and state relationships, matching against our EOT value. Historical queries are slightly different, in that we’re looking for relationships whose from-to period covers the millisecond instant we’re interested in; that is, relationships where the from value is less than or equal to the instant we’re interested in, and the to value is greater than that instant. These kinds of comparisons must be done in the WHERE clause.

The following query finds all the products sold by shop 1 on 5 January 2014 (1388880000000 milliseconds from the Java epoch):

MATCH (s:Shop{shop_id:1})-[r1:SELLS]->(p:Product)
WHERE (r1.from <= 1388880000000 AND r1.to > 1388880000000)
MATCH (p)-[r2:STATE]->(ps:ProductState)
WHERE (r2.from <= 1388880000000 AND r2.to > 1388880000000)
RETURN p.product_id AS productId, 
       ps.name AS product, 
       ps.price AS price
ORDER BY price DESC

Results:

+------------------------------+
| productId | product  | price |
+------------------------------+
| 1         | "Cheese" | 1.0   |
| 2         | "Crisps" | 0.5   |
+------------------------------+

And here’s the same query for 5 February 2014:

MATCH (s:Shop{shop_id:1})-[r1:SELLS]->(p:Product)
WHERE (r1.from <= 1391558400000 AND r1.to > 1391558400000)
MATCH (p)-[r2:STATE]->(ps:ProductState)
WHERE (r2.from <= 1391558400000 AND r2.to > 1391558400000)
RETURN p.product_id AS productId, 
       ps.name AS product, 
       ps.price AS price
ORDER BY price DESC

Results:

+------------------------------+
| productId | product  | price |
+------------------------------+
| 2         | "Crisps" | 0.5   |
+------------------------------+

Observations

Some further comments on this approach:

Additive

This design is purely additive in nature. To delete a structural relationship we simply archive it; that is, timestamp its to property. To delete a node we must archive both its current state relationship and any current structural relationships. Current relationships (both structural and state) are those with an EOT value.

This additive model strikes a reasonable balance between the competing considerations mentioned at the beginning of this post: the cost of writes, the amount of on-disk storage require, the degree of fragmentation incurred, and the performance of reads for both current and historical queries. Changes in state require the addition of nothing more than a single new state relationship and one new state node, which keeps the cost of writes down. Nothing is rewritten; therefore, nothing is deleted, which in turn keeps store growth to a minimum. On the other hand, both current and historical queries require scanning timestamped relationships, and every state lookup requires traversing at least one state relationship.

Indexes and Constraints

Indexes and constraints are not version aware. When using indexes (or label scans), you will have to test the results of each lookup. For a node to be included in the results for a particular point in time, it must be reachable by a single state relationships whose period includes that point. For entities that change frequently, or are (logically) deleted frequently, this check can impact the performance of index lookups.

Having multiple states for a particular entity poses problems for unique constraints. If a uniquely constrained property value doesn’t change between versions, the constraint will fail because there will now be duplicate property values. To get round this problem, we have to introduce a version property that increments with each state change, and a synthetic property that combines the current version number with the value of the property we want to constrain. We then apply the unique constraint to this synthetic property instead of the underlying domain property. I’ll write this up in more detail in the near future.

Variations and Optimisations

The design strategy described here can be modified in several ways, depending on your needs:

Linked Lists

I’ve attached state nodes to identity nodes in a radial fashion. This works well when there are no more than several tens of changes to an entity’s state over its lifetime. If an entity changes far more often than that, you might consider using a linked list.

Linked lists are also useful when you need to compare successive states: the ordering imposed by the linked list helps you identify next and previous states. In fact, you can combine radial relationships with linked lists:

radial-and-linked-lists

CURRENT_STATE Relationships

For graphs with entities that change frequently, you might consider adding a CURRENT_STATE relationship that connects an entity’s identity node directly to its latest state node. This allows you to navigate directly to that entity’s current state, without having to iterate many radial state relationships, or traverse a linked list. The use of a CURRENT_STATE relationship, however, requires more housekeeping when updating state: you’ll have to delete the old CURRENT_STATE relationship, and then create a new one.

Overloading Identity Node with Initial State

For graphs where most entities change infrequently, or not at all, you might consider overloading each entity’s identity node with that entity’s initial state. You then add state nodes and relationships only with the first change to an entity’s state. If you adopt this strategy, you should continue to adjust structure at the level of the identity nodes, however, not the latest state nodes. Further, while it helps reduce the number of additional nodes and relationships in the graph, it does impose some additional cognitive overhead when visualising the resulting structure, because you now have to make sense of a structure in which current structural relationships are connected to historical initial state nodes.

Overloading Identity Node with Current State

Following on from the idea of overloading the identity node with initial state, and as an alternative to using a CURRENT_STATE relationship, you might consider overloading the identity node with the current state of the entity, and building a linked list of previous states. The following diagram shows what our shop-product-supplier model would look like after if we were to adopt this approach:

alternate-solution

At first glance, this looks simpler. There are fewer nodes, for starters. But it comes with its own compromises. Updating a node requires us not only to create a new node at the head of the linked list, but also to physically delete all the structural relationships attached to the old head of the list, before then recreating them on behalf of the new head of the list. The model is optimised for querying the current state of the graph, but at the expense of more costly writes and increased space on disk (because each time we recreate a batch of relationships, we add them to the end of the stores on disk). I’ll be looking at this model in more detail in a future post.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License

Atom   RSS 2.0   Email

Cypher: Calculating Shortest Weighted Path

June 24th, 2013  |  Published in Cypher, Neo4j by Ian Robinson

This example shows a simple solution for calculating the shortest weighted path between two nodes in a reasonably dense graph (say 30 relationships per node) using Cypher.

Consider the following graph:

shortest-weighted-path

Let’s say we want to calculate the shortest weighted path between A and I. All we have to do is sum the weights along each path. Doing this, we find that (A)->(B)->(C)->(I) is weight 15, (A)->(D)->(E)->(I) is weight 12, and (A)->(F)->(G)->(H)->(I) is weight 8. Therefore, (A)->(F)->(G)->(H)->(I) is the shortest weighted path, even though it has one more relationship than either of the other two paths.

Implementation

The Cypher to accomplish this is as follows:

START  startNode=node:node_auto_index(name={startNode}),
       endNode=node:node_auto_index(name={endNode})
MATCH  p=(startNode)-[:CONNECTED_TO*1..4]->(endNode)
RETURN p AS shortestPath, 
       reduce(weight=0, r in relationships(p) : weight+r.weight) AS totalWeight
       ORDER BY totalWeight ASC
       LIMIT 1

The summing of weights is accomplished using Cypher’s reduce function.

For modestly sized graphs with up to 30 relationships per node, this query performs reasonably well. For very large, extremely densely connected graphs, however, it’s not going to cut it: those types of graph require an implementation of a shortest weighted path algorithm such as Dijkstra or A*. Running on a MacBook Air against a graph comprising 1 million nodes, 12.5 million relationships, with between 4 and 20 outgoing CONNECTED_TO relationships per node (average 12), average execution time is 120 ms. Increasing the depth of the search and/or the density of the graph will increase query times.

This example has been tested against Neo4j 1.9. The code for the example can be downloaded from here.

Alternative implementation

Instead of using r in relationships(p) to iterate the relationships along each path, we could bind a path’s CONNECTED_TO relationships to an identifier instead, like this:

MATCH   p=(startNode)-[rels:CONNECTED_TO*1..4]->(endNode)
RETURN  p AS shortestPath, 
        reduce(weight=0, r in rels : weight+r.weight) AS totalWeight

My tests, however, show that using r in relationships(p) is faster than using identifiers.

Atom   RSS 2.0   Email

Cypher: Adding a Node to a Linked List

June 20th, 2013  |  Published in Cypher, Neo4j by Ian Robinson

This example shows how to add a node to the head of a linked list using Cypher.

Let’s assume we have a graph containing an indexed node that can act as the list’s owner. What we want to do is build up a linked list, with the head of the list attached to the owner node using a HEAD relationship. When a new node is added to the list, it becomes the head node. The HEAD relationship is detached from the old head node and attached to this new head node. At the same time, the new head node is linked to the old head node using a PREV relationship.

The diagram below shows a linked list comprising three nodes. The list is attached to an owner node named ‘my-list’.

linked-list

Here’s the Cypher for adding a node to the head of a linked list:

START owner=node:node_auto_index(name={listName})
CREATE UNIQUE (owner)-[:HEAD]->(newHead{newElement})
WITH owner, newHead
MATCH (newHead)<-[:HEAD]-(owner)-[oldRel:HEAD]->(oldHead)
DELETE oldRel
CREATE (newHead)-[:PREV]->(oldHead)

To build the linked list shown above, we would execute this Cypher statement three times.

How it works

The START clause looks up the node representing the list’s owner in an index. The CREATE UNIQUE clause then creates a new node and links it to the owner using a HEAD relationship. The {newElement} parameter here allows us to pass a map of the new node’s properties to the query.

list-1

Having created the new head node, we then pipe it and the owner node to a second query using WITH.

In this second query we try to match the list owner, the new head node and the old head node (if there is one):

list-2

If we find an old head node, we delete its HEAD relationship, because it is no longer valid. We then create a PREV relationship that links the new head node to this old head node:

list-3

The MATCH clause here is similar to a pattern matching clause in a functional language: if Cypher can match the (newHead)<-[:HEAD]-(owner)-[oldRel:HEAD]->(oldHead) pattern in the graph, it will invoke the subsequent DELETE and CREATE clauses; otherwise it will terminate without deleting or creating any further relationships.

This example has been tested against Neo4j 1.8.1 and 1.9. The code for the example can be downloaded from here.

Atom   RSS 2.0   Email

Running the Graph Databases Book Examples

June 11th, 2013  |  Published in Graph Databases, Neo4j by Ian Robinson

The current version of Neo4j has two different types of index: named indexes and automatic indexes. The Cypher query examples throughout the Graph Databases book use named indexes. In this query, for example, venue, city and author are named node indexes:

START theater=node:venue(name='Theatre Royal'), 
      newcastle=node:city(name='Newcastle'), 
      bard=node:author(lastname='Shakespeare')
MATCH (newcastle)<-[:STREET|CITY*1..2]-(theater) 
      <-[:VENUE]-()-[:PERFORMANCE_OF]->()-[:PRODUCTION_OF]-> 
      (play)<-[:WROTE_PLAY]-(bard)
RETURN DISTINCT play.title AS play

There’s a problem, however: data created using a Cypher CREATE statement won’t be indexed in a named index. This has led to some confusion for anyone wanting to code along with the examples: if you use the CREATE statements as published, the query examples won’t work.

(You can find out more about Neo4j indexes in the documentation.)

If you want to code along with the examples, the easiest thing to do is create the data using Cypher and automatic indexes. You’ll need to modify the queries to use the automatic indexes, but the changes aren’t complicated.

Configure Automatic Indexes

The examples below show how to run the Shakespeare examples (pp.47-50) using automatic indexes.

Before you create any data, you’ll need to configure the automatic indexes. The approach you take will depend on whether you are running Neo4j in embedded or server mode.

Embedded mode

If running in embedded mode, configure the database at startup:

GraphDatabaseService graphDb = new GraphDatabaseFactory().
    newEmbeddedDatabaseBuilder( storeDirectory ).
    setConfig( GraphDatabaseSettings.node_keys_indexable, "name,lastname" ).
    setConfig( GraphDatabaseSettings.node_auto_indexing, "true" ).
    newGraphDatabase();

Server mode

If running in server mode and executing Cypher statements through the console or using the REST API, add the following to the conf/neo4j.properties file.

node_auto_indexing=true
node_keys_indexable=name,lastname

Modify the Queries

The Shakespeare queries should then be modified as follows. In each case we’ve replaced a named node index (e.g. node:author) with an automatic node index (node:node_auto_index).

p.47:

START theater=node:node_auto_index(name='Theatre Royal'), 
      newcastle=node:node_auto_index(name='Newcastle'), 
      bard=node:node_auto_index(lastname='Shakespeare')
MATCH (newcastle)<-[:STREET|CITY*1..2]-(theater)
      <-[:VENUE]-()-[:PERFORMANCE_OF]->()-[:PRODUCTION_OF]-> 
      (play)<-[:WROTE_PLAY]-(bard)
RETURN DISTINCT play.title AS play

p.48:

START theater=node:node_auto_index(name='Theatre Royal'),
      newcastle=node:node_auto_index(name='Newcastle'), 
      bard=node:node_auto_index(lastname='Shakespeare')
MATCH (newcastle)<-[:STREET|CITY*1..2]-(theater)
      <-[:VENUE]-()-[:PERFORMANCE_OF]->()-[:PRODUCTION_OF]-> 
      (play)<-[w:WROTE_PLAY]-(bard)
WHERE w.year > 1608
RETURN DISTINCT play.title AS play

p.49:

START theater=node:node_auto_index(name='Theatre Royal'), 
      newcastle=node:node_auto_index(name='Newcastle'), 
      bard=node:node_auto_index(lastname='Shakespeare')
MATCH (newcastle)<-[:STREET|CITY*1..2]-(theater)
      <-[:VENUE]-()-[p:PERFORMANCE_OF]->()-[:PRODUCTION_OF]-> 
      (play)<-[:WROTE_PLAY]-(bard)
RETURN play.title AS play, count(p) AS performance_count 
ORDER BY performance_count DESC

p.50:

START bard=node:node_auto_index(lastname='Shakespeare') 
MATCH (bard)-[w:WROTE_PLAY]->(play)
WITH play
ORDER BY w.year DESC
RETURN collect(play.title) AS plays

Download Working Examples

The code that we used to generate most of the examples for the book, including the use cases, can be found on my GitHub account.

Atom   RSS 2.0   Email

The Power of the Daleks (and Neo4j)

July 11th, 2011  |  Published in Neo4j by Ian Robinson

Jim Webber and I have been busy these last few months writing a tutorial for Neo4j. As we did with REST in Practice, we’ve chosen an example domain that falls a little outside the enterprise norm. With REST in Practice we chose the everyday world of the coffee shop; this time round, we’ve gone for the grand old universe of Doctor Who.

As we’ve worked on the tutorial, we’ve built up a Doctor Who data model that shows how Neo4j can be used to address several different data and domain concerns. For example, part of the dataset includes timeline data, comprising seasons, stories and episodes; elsewhere we’ve social network-like data, with characters connected to one another through being companions, allies or enemies of the Doctor. It’s a messy and densely-connected dataset – much like the data you might find in a real-world enterprise. Some of it is of high quality, some of it is lacking in detail. And for every seeming invariant in the domain, there’s an awkward exception – again, much like the real world. For example, each incarnation of the Doctor has been played by one actor, except the first, who was originally played by William Hartnell, and then later by Richard Hurdnall, who reprised the tetchy first incarnation some years after William Hartnell’s death for the twenty-fifth anniversary story, The Five Doctors. Each episode has a title; at least, they did when the series first began, in 1963, and they do today. But towards the end of the third season, the original series stopped assigning individual episode titles (in the original series, stories typically took place over several episodes); a story’s episodes were simply labelled Part 1, Part 2, etc. And so on.

The Dalek Props

Dalek prop

The earliest alien monsters to appear in Doctor Who, the Daleks have long held a fascination for the British viewing public, and for the enquiring minds of Doctor Who fandom. In 2002, Jon Green began researching the history of the Dalek props created for the original series, which ran from 1963 to 1988. Jon was later joined by a second researcher, Gav, and the result of their hard work is the wonderful Dalek 6388.

Recently, I imported some of this Dalek research into our Doctor Who database. From the wealth of detail available on Dalek 6388 I chose to focus on the ways the different prop parts were reused in different stories.

Each episode or story featuring the Daleks made use of one or more Dalek props. Each prop came in two parts: the top half, called the shoulders, and the bottom half, called the skirt. These parts were handmade, and as a result there were many differences between them: the hemisphere spacing, collars, slats, eyestalks, and gun boxes in particular varied in subtle but noticeable ways. Reconstructing the history of the props has been largely a matter of identifying specific parts in screenshots and photos based on these visual clues.

Over time, the different shoulders and skirts were mixed and matched. For example, the shoulders originally belonging to Dalek 1, one of the props created for the very first Dalek story, The Daleks, were later paired with the skirt belonging to Dalek 5 for the story Day of the Daleks. This composite prop is known as Dalek One-5. A couple of seasons later, the shoulders were paired with the skirt belonging to Dalek 7 for the Daleks’ last outing with Jon Pertwee’s Doctor, Death to the Daleks. This composite prop is known as Dalek One-7.

Structuring the Data

With the details from Dalek 6388 added to our dataset we now have something that resembles a supply chain traceability scheme – another excellent use for Neo4j. Here’s how some of that data is structured (click the image for a closeup view):

Dalek props

The screenshot above shows some of the Dalek prop data as viewed in Neoclipse. Neoclipse allows you to view, create and edit nodes, as well as search any indexes you’ve created for your store. Neo4j’s browser-based Webadmin tool also includes a visualisation tab for viewing a database running in server mode.

The screenshot above show three of the stories in which the Daleks appeared: The Daleks, The Dalek Invasion of Earth and The Power of the Daleks. (They’ve appeared in many more stories – the view here shows only a fraction of the data we have. Even here I’ve only shown full details of the props for one of the stories, The Power of the Daleks. Portions of the data associated with the other two stories are then included to show how the data is interconnected.)

Below each story node is a node representing the group of Dalek props used in that story. I’ve chosen to create these group nodes because there is potentially other data to be included at the group level. For example, each story was assigned a budget for building or maintaining the props used in that story. Moreover, in the research, the voice artists and operators responsible for bringing the Daleks to life are not associated with individual props but with the group as a whole (the details of the voice artists and operators are not, however, currently in our dataset). Hence the Dalek props group nodes USED_IN each story.

Focusing now on the props used in The Power of the Daleks, we can see that four props appeared in the story: Dalek 1, Dalek 2, Dalek 7, and Dalek Six-5. Daleks 1 and 2 also appeared in the two earlier stories, The Daleks and The Dalek Invasion of Earth (together with some other props not shown here), as well as in several other stories not shown in the screenshot. Daleks 7 and Six-5 are here not associated with any other stories, but in the full dataset, Dalek 7 is shown to have appeared in two other stores, The Evil of the Daleks and The Chase, while Dalek Six-5 appeared in four other stories following its debut in the The Power of the Daleks.

Dalek Six-5 is one of those composite props I mentioned previously. It is COMPOSED_OF shoulders whose ORIGINAL_PROP was Dalek 6, and a skirt originally made for Dalek 5. The shoulders, as the graph in the screenshot shows, appeared as part of at least one other composite prop, Dalek Six-7, where they were paired with the skirt belonging to Dalek 7, which also appears in The Power of the Daleks. Dalek Six-5‘s skirt, whose ORIGINAL_PROP was Dalek 5, was at some other time paired with the shoulders from Dalek 1 to create the composite prop Dalek One-5. (To find out which story Dalek One-5 appeared in, all we’d have to do is expand the graph beyond the nodes shown here.)

Querying the Data

The Dalek prop data captures many of the significant domain concepts and relationships described in the Dalek 6388 research. Once we’d built the dataset, I set about asking it some tricky questions. At the top of my list of pressing questions was this one: what’s the hardest working Dalek prop part in showbiz? That is, which of those shoulders and skirts has appeared in more stories than any other part?

Neo4j offers several different ways to query a graph. There’s the Core API, which allows you to deal imperatively with nodes and relationships, and the original Traverser API, which offers a slightly more declarative route into a graph. For more sophisticated traversals there’s the Traversal Framework, which, being more powerful than the Traversal API, has a slightly steeper learning curve. For querying the server there’s the REST API. Next, there’s Gremlin, a powerful graph traversal language with a plugin for Neo4j, and the Pattern Matching library, which is like a regex language for graphs. And finally, introduced in the latest 1.4 milestones, there’s Cypher, a new declarative graph pattern matching language – a SQL for graphs.

In the rest of this post we’ll look at three different ways of figuring out which was the hardest working Dalek prop part. In the first example we’ll look at the Traversal Framework. In the next we’ll use the Pattern Matching library. Finally, we’ll use Cypher.

There’s quite a bit of code here. If you haven’t the patience to wade through it all, do just one thing: go look at the Cypher query. It’s a thing of Dalek-like precision.

Traversal Framework

Here’s what the query to find the hardest working Dalek prop part looks like using the Traversal Framework:

@Test
public void shouldFindTheHardestWorkingPropsUsingTraversalFramework() {
  Node theDaleks = database.index().forNodes("species")
    .get("species", "Dalek").getSingle();

  Traverser traverser = Traversal.description()
    .relationships(Rels.APPEARED_IN, Direction.OUTGOING)
    .relationships(Rels.USED_IN, Direction.INCOMING)
    .relationships(Rels.MEMBER_OF, Direction.INCOMING)
    .relationships(Rels.COMPOSED_OF, Direction.OUTGOING)
    .relationships(Rels.ORIGINAL_PROP, Direction.OUTGOING)
    .depthFirst()
    .evaluator(new Evaluator() {
      @Override
      public Evaluation evaluate(Path path) {
         if (path.lastRelationship() != null && 
            path.lastRelationship().isType(Rels.ORIGINAL_PROP)){
          return Evaluation.INCLUDE_AND_PRUNE;
        }
                        
        return Evaluation.EXCLUDE_AND_CONTINUE;
                        
       }
    })
    .uniqueness(Uniqueness.NONE)
    .traverse(theDaleks);
        
    assertHardestWorkingPropParts(getSortedResults(traverser),
      "Dalek 1 shoulder", 12,
      "Dalek 5 skirt", 12,
      "Dalek 6 shoulder", 12);
}

This example shows an idiomatic usage of the Neo4j Traversal Framework. First, we lookup a starting node for the traversal in an index. In this instance, we look up the Dalek species node. Then we build a traversal description and execute the traversal from this starting node. The traversal crawls the subgraph below the starting node. With each node that it visits, it calls the evaluate() method on the custom evaluator supplied with the description. This method determines whether the traverser is positioned over a node of interest (in this case, an original prop node). If it is, the traverser returns the path to this node to the calling code, and then begins a new travesal further up the subtree. If it isn’t a node of interest, the traverser continues deeper into the subgraph.

The description specifies the types of relationship we’re prepared to follow, and their direction. Every relationship in Neo4j has a type label and a direction. The direction helps provide semantic clarity when dealing with asymmetric relationships between nodes. While a relationship between two nodes is always directed – it has a start node, a direction, and an end node – a traverser can be programmed to follow either an INCOMING or an OUTGOING relationship, and even to ignore the direction entirely (using Direction.BOTH).

Note that we’ve specified that the traversal proceed depth first. This means that from the Dalek species node it will strike out on a depth first hunt for an original prop, following the first APPEARED_IN relationship to an episode and then searching the nodes below that episode. Once it has found a first matching original prop, the traverser returns the path to the prop to the calling code (INCLUDE_AND_PRUNE), and then moves successively back up that path, trying alternate branches, until it works its way all the way back to the species node, from where it tries the next APPEARED_IN relationship.

Here’s the route the traverser takes from the starting Dalek species node through all the nodes below The Power of the Daleks:

  • (Dalek)-[:APPEARED_IN]->(The Power of the Daleks)<-[:USED_IN]-(Daleks)<-[:MEMBER_OF]-(Dalek 7)-[:COMPOSED_OF]->(skirt)-[:ORIGINAL_PROP]->(Dalek 7) INCLUDE_AND_PRUNE
  • -[:COMPOSED_OF]->(shoulder)-[:ORIGINAL_PROP]->(Dalek 7) INCLUDE_AND_PRUNE
  • <-[:MEMBER_OF]-(Dalek Six-5)-[:COMPOSED_OF]->(skirt)-[:ORIGINAL_PROP]->(Dalek 5) INCLUDE_AND_PRUNE
  • -[:COMPOSED_OF]->(shoulder)-[:ORIGINAL_PROP]->(Dalek 6) INCLUDE_AND_PRUNE
  • <-[:MEMBER_OF]-(Dalek 2)-[:COMPOSED_OF]->(skirt)-[:ORIGINAL_PROP]->(Dalek 2) INCLUDE_AND_PRUNE
  • -[:COMPOSED_OF]->(shoulder)-[:ORIGINAL_PROP]->(Dalek 2) INCLUDE_AND_PRUNE
  • <-[:MEMBER_OF]-(Dalek 1)-[:COMPOSED_OF]->(skirt)-[:ORIGINAL_PROP]->(Dalek 1) INCLUDE_AND_PRUNE
  • -[:COMPOSED_OF]->(shoulder)-[:ORIGINAL_PROP]->(Dalek 1) INCLUDE_AND_PRUNE

Now that the traverser is back at the Dalek species node, it tries the next APPEARED_IN relationship:

  • -[:APPEARED_IN]->(The Dalek Invasion of Earth)<-[:USED_IN]-(Daleks)<-[:MEMBER_OF]-(Dalek 6)-[:COMPOSED_OF]->(skirt)-[:ORIGINAL_PROP]->(Dalek 6) INCLUDE_AND_PRUNE
  • -[:COMPOSED_OF]->(shoulder)-[:ORIGINAL_PROP]->(Dalek 6) INCLUDE_AND_PRUNE

And so on.

When it’s run, our traverser will return all the paths from the Dalek species node to the original props nodes associated with each prop part. Each path contains all the details of the nodes and relationships along that path, starting with the Dalek species node and ending with the original prop node, with the relevant episode and prop part nodes somewhere in between.

By itself, the traverser result doesn’t tell us which prop part has been used the most; for that, we need to do some accumulation. To generate the counted and sorted results, our code calls getSortedResults(), which is simply a helper method that delegates to a ResultsAssembler instance. This ResultsAssembler is responsible for accumulating the results for each prop part.

Here’s the getSortedResults() method:

private Map<String, Integer> getSortedResults(Traverser traverser) {
  Map<String, Integer> results = new ResultsAssembler<Path>(traverser)
    .sortResults(new PathAccessor());
  return results;
}

And here is the code for the ResultsAssembler:

public interface QueryResultAccessor<T> {
  String getOriginalProp(T queryResult);
  String getPart(T queryResult);
}
  
public class ResultsAssembler<T>{
  private final Iterable<T> queryResults;

  public ResultsAssembler(Iterable<T> queryResults) {
    super();
    this.queryResults = queryResults;
  }
    
  public Map<String, Integer> sortResults(QueryResultAccessor<T> accessor){
    Map<String, Integer> unsortedResults = new HashMap<String, Integer>();
      
    for (T queryResult : queryResults){
      String originalProp = accessor.getOriginalProp(queryResult);
      String part = accessor.getPart(queryResult);
            
      String key = originalProp + " " + part;
      if (!unsortedResults.containsKey(key)){
        unsortedResults.put(key, 0);
      }
      unsortedResults.put(key, unsortedResults.get(key) + 1);
    }
      
    Comparator<String> valueComparator = Ordering.natural()
      .onResultOf(Functions.forMap(unsortedResults))
      .reverse().compound(Ordering.natural());
    return ImmutableSortedMap.copyOf(unsortedResults, valueComparator);
  }
    
}

The ResultsAssembler builds a map of prop part keys and episode count values. The assembler uses a QueryResultAccessor to retrieve the original prop name and prop part from a query result. Here the query result is a path, but when we get round to the next example, which uses the Pattern Matching library, it’s a pattern match object. Hence the generic parameters.

Here’s the PathAccessor we use to access the original prop name and prop part from a path. The original prop details are held in the last node in the path, while the prop part details are held in the penultimate node. The getOriginalProp() method can retrieve the original prop details quite easily from the path’s endNode() property. To get the part name (shoulders or skirt) the getPath() method uses the path’s nodes() iterator and an iterator helper function to access the last but one node.

private class PathAccessor implements QueryResultAccessor<Path>{

  @Override
  public String getOriginalProp(Path queryResult) {
    Node originalPropNode = queryResult.endNode();
    return getPropertyValueOrNull(originalPropNode, "prop");
  }

  @Override
  public String getPart(Path queryResult) {
    Iterable<Node> pathNodes = IteratorUtil.asIterable(
      queryResult.nodes().iterator());
    Node partNode = IteratorUtil.fromEnd(pathNodes, 1);
    return getPropertyValueOrNull(partNode, "part");
  }
    
  private String getPropertyValueOrNull(Node node, String property){
    if (!node.hasProperty(property)){
      return null;
    }
    return node.getProperty(property).toString();
  }
    
}

Having accumulated and sorted the results, our unit test can then assert that the results contain the top three most used prop parts: Dalek 1‘s shoulders, Dalek 5‘s skirt, and Dalek 6‘s shoulders. Each of these parts appeared in 12 stories.

What do Dalek 1‘s shoulders look like today, nearly 50 years after they first appeared on TV? Dalek 6388, of course, has the answer.

Pattern Matching

The Traversal Framework helped us find the hardest working Dalek prop parts in showbiz, but we still had to do quite a bit of work to accumulate the results. While the traversal description itself was relatively trivial, having to iterate nodes in each of the returned paths proved quite tedious. With the Pattern Matching library we can avoid some of the messy iterative code that cropped up in the PathAccessor.

Using the Pattern Matching library we create an abstract subgraph that describes the graph patterns we want to match in our real graph. When matching, we anchor this abstact graph to a starting node in the real graph, much as we started the traversal in the previous example from a node we’d looked up in an index. The matcher then matches subgraph instances corresponding to the nodes, relationships and constraints defined in our abstract subgraph.

@Test
public void shouldFindTheHardestWorkingPropsUsingPatternMatchers(){
  Node theDaleksNode = database.index().forNodes("species")
    .get("species", "Dalek").getSingle();
      
  final PatternNode theDaleks = new PatternNode();    
  final PatternNode episode = new PatternNode();
  final PatternNode props = new PatternNode();
  final PatternNode prop = new PatternNode();
  final PatternNode part = new PatternNode();
  final PatternNode originalProp = new PatternNode();

  theDaleks.setAssociation(theDaleksNode);
  theDaleks.createRelationshipTo(episode, Rels.APPEARED_IN);
  props.createRelationshipTo(episode, Rels.USED_IN);
  prop.createRelationshipTo(props, Rels.MEMBER_OF);
  prop.createRelationshipTo(part, Rels.COMPOSED_OF);
  part.createRelationshipTo(originalProp, Rels.ORIGINAL_PROP);
        
  PatternMatcher pm = PatternMatcher.getMatcher();
  final Iterable<PatternMatch> matches = pm.match(theDaleks, theDaleksNode);
        
  assertHardestWorkingPropParts(getSortedResults(matches, originalProp, part),
    "Dalek 1 shoulder", 12,
    "Dalek 5 skirt", 12,
    "Dalek 6 shoulder", 12); 
}

The pattern node objects that we define as part of our abstract subgraph can then be used as keys into the match results. Below is the getSortedResults() method and MatchAccessor class for the pattern matching example. You’ll notice that we pass the originalProp and part pattern node objects into the MatchAccessor. The accessor then calls the match’s getNodeFor() method with one or other of these pattern nodes in order to retrieve the real, matched node.

private Map<String, Integer> getSortedResults(Iterable<PatternMatch> matches, 
    PatternNode originalProp, PatternNode part) {
  Map<String, Integer> results = new ResultsAssembler<PatternMatch>(matches)
    .getSortedResults(new MatchAccessor(originalProp, part));
  return results;
}

private class MatchAccessor implements QueryResultAccessor<PatternMatch>{

  private final PatternNode originalProp;
  private final PatternNode part;
    
  public MatchAccessor(PatternNode originalProp, PatternNode part) {
    super();
    this.originalProp = originalProp;
    this.part = part;
  }

  @Override
  public String getOriginalProp(PatternMatch queryResult) {
    return queryResult.getNodeFor(originalProp).getProperty("prop").toString();
  }

  @Override
  public String getPart(PatternMatch queryResult) {
    return queryResult.getNodeFor(part).getProperty("part").toString();
  }  
}

Cypher

The Pattern Matching example still required us to accumulate the results in order to work out which was the hardest working prop part in the original series. The abstract pattern nodes provided convenient keys into matched nodes, but we still had then to pass the matched nodes’ contents to the ResultsAssembler in order to build a map of prop parts and usage counts. With Cypher, all that manual accumulation disappears. Better still, the entire query is expressed using a very simple, declarative syntax. Unencumbered by a “fluent” interface, a Cypher query concisely describes our desired route into the graph:

public void shouldFindTheHardestWorkingPropsUsingCypher() throws Exception {
  CypherParser parser = new CypherParser();
  ExecutionEngine engine = new ExecutionEngine(universe.getDatabase());
  
  String cql = "start daleks=(Species,species,'Dalek')"
    + " match (daleks)-[:APPEARED_IN]->(episode)"
    + "<-[:USED_IN]-(props)"
    + "<-[:MEMBER_OF]-(prop)"
    + "-[:COMPOSED_OF]->(part)"
    + "-[:ORIGINAL_PROP]->(originalprop)"
    + " return originalprop.prop, part.part, count(episode.title)"
    + " order by count(episode.title) desc"
    + " limit 3";

  Query query = parser.parse(cql);
   ExecutionResult result = engine.execute(query);
      
  assertHardestWorkingPropParts(result.javaIterator(),
    "Dalek 1", "shoulder", 12,
    "Dalek 5", "skirt", 12,
    "Dalek 6", "shoulder", 12);
}
  
private void assertHardestWorkingPropParts(
    Iterator<Map<String, Object>> results, Object... partsAndCounts) {
  for (int index = 0; index < partsAndCounts.length; index = index + 3){
    Map<String, Object> row = results.next();
    assertEquals(partsAndCounts[index], row.get("originalprop.prop"));
    assertEquals(partsAndCounts[index + 1], row.get("part.part"));
    assertEquals(partsAndCounts[index + 2], row.get("count(episode.title)"));
  }
  
  assertFalse(results.hasNext());
}

Most of that is just boilerplate code for setting up the Cypher engine and executing the query. The really concise part is the query itself. Here it is in full:

start daleks=(Species,species,'Dalek')
match (daleks)-[:APPEARED_IN]->(episode)<-[:USED_IN]-(props)<-[:MEMBER_OF]-
      (prop)-[:COMPOSED_OF]->(part)-[:ORIGINAL_PROP]->(originalprop)
return originalprop.prop, part.part, count(episode.title)
order by count(episode.title) desc
limit 3

Spend a couple of moments looking at it and you’ll soon appreciate its simplicity and declarative expressiveness. If only the Daleks had such power…

Cypher is still in its infancy, and its syntax is still in flux. In the 1.4 release timeframe, which culminates this week with the release of Neo4j 1.4 GA, Cypher performs only non-mutating operations. Over the course of the 1.5 release, we expect to add mutating operations.

The Tutorial

Our Neo4j tutorial is freely available from GitHub. Download it now and try some of the exercises. And keep checking back for updates: as Neo4j grows, so will the tutorial.

Jim and I are also running one- and two-day versions of the tutorial at a number of conferences and other events:

(Dalek prop image belongs to the Project Dalek Forum gallery, another excellent resource for Dalek prop researchers and Dalek builders.)

Atom   RSS 2.0   Email