I recently took a small detour to learing about graph databases. Here are some notes about my findings:

A friend of a friend (Donald), runs a business that sells school books to various schools (both public and charter). Just like any other business, he feels that he can improve his bottom line and reaches out to me, a “Consultant” :) for help

Here are his primary set of questions:

  • How much inventory should I keep?
  • How can I track my sales in all states where I have stores?
  • How can I analyze what books are sold in what markets?
  • When should I use promotions?
  • How can I negotiate my RMAs (Return to Manufacturer Agreements)?

We engage and here is what emerges:

  • BOOKs are published by PUBLISHERs
  • PUBLISHERs have PRINTING_PRESSes
  • PRINTING_PRESSes exist in COUNTRYs
  • BOOKs have a PACKAGING
  • BOOKs have a COLOR
  • Some BOOKS have a SHIPPING_POLICY
  • All BOOKs sold are recorded as a TRANSACTION where attributes of the sale are recorded.

At this time I’m tempted to place this information in a fully normalized RDBMS like MySQL or Oracle.

What would that entail? Perhaps this ERD will be a good indication:

Challenges

However, here are some challenges that we have to solve:

  1. Not all BOOKs have the same attributes. This will cause the BOOK table to be sparsely populated.
  2. Since BOOKs arrive at different STOREs from several PRINTING_PRESSes around the country, the Id’s of these books do not always sync up. So, Harry Potter - The Deathly Hallows sold in Seattle may have a different id that the same book sold in NewYork. Therefore, in the TRANSACTION table, we cannot rely on the fact that the same ID will refer to the same book. This causes reports (that are used by the PUBLISHERs for analysis) built off of the TRANSACTION table, to be messed up.
  3. This equivalence of Ids is different for different PUBLISHERs. For example, some PUBLISHERs consider a blue Harry Potter - Deathly Hallows to be the same as a red book by the same name, but some PUBLISHERs would like them to be treated separately.
  4. As more reading material is published, the attributes that need to be tracked with each BOOK, keep changing.

As we can see, the above challenges can be solved using a RDBMS based storage solution (and have been for years). But in this post, I’d like to suggest how to use a combination of a Graph db and an RDBMS to reduce the pain in addressing these challenges.


A Graph based solution

Approach:

First we will identify fact tables and dimension tables: Based on our model above, it is obvious that the TRANSACTION table is our fact table and all others would be dimension tables.

Then we will store all dimension data in a Graph db and store the transaction data in an RDBMS.

The rationale for this is that dimensions are what Graphs are good at manipulating and maintaining. There is not much churn on fact tables, but their normalized nature tends to make them good candidates for the RDBMS.

Note that dimension tables tend to be smaller than fact tables in most domains. But that has no bearing on the above strategy.

The implementation

We will use Neo4J for our Graph db and MySQL for our RDBMS.

Please download and install neo4j (Should take ~ 5 mins!)

Then experiment with the sample movie db (should take no more than an hour max)

Then, in your browser at localhost:7474, copy and paste this into your command window and execute it.

CREATE (physicsPendulum: Book {id: '001', name: 'Physics Pendulum', price: 15})
CREATE (mathMayhem: Book {id: '002', name: 'Math Mayhem', price: 9})
CREATE (chemistryChaos: Book {id: '003', name: 'Chemistry Chaos', price: 11})
CREATE (trigTurpitude: Book {id: '004', name: 'Trignometry Turpitude', price: 17})

CREATE (theCornerPub: Publisher {id: '005', name: 'The Corner Pub'})
CREATE (theGoodPub: Publisher {id: '006', name: 'The Good Pub'})
CREATE (peteThePublisher: Publisher {id: '007', name: 'Pete the Publisher'})

CREATE (inBlackAndWhite: PrintingPress {id: '008', name: 'In Black And White'})
CREATE (theStencil: PrintingPress {id: '009', name: 'The Stencil'})
CREATE (featherAndInk: PrintingPress {id: '010', name: 'Feather And Ink'})

CREATE (gautemala: Country {id: '011', name: 'Gautemala'})
CREATE (syria: Country {id: '012', name: 'Syria'})

CREATE (red: Color {id: '013', name: 'red'})
CREATE (blue: Color {id: '014', name: 'blue'})
CREATE (green: Color {id: '015', name: 'green'})

CREATE (hardcover: Packaging {id: '016', name: 'Hardcover'})
CREATE (paperback: Packaging {id: '017', name: 'Paperback'})

CREATE (booksNThings: Store {id: '018', name: 'Books and Things', city: 'Pittsburgh', state: 'PA'})
CREATE (bookWorm: Store {id: '019', name: 'Book Worm', city: 'Anahiem', state: 'CA'})
CREATE (nerdCorner: Store {id: '020', name: 'Nerd Corner', city: 'Armadillo', state: 'TX'})



// Relationships
CREATE
   (physicsPendulum)-[:PUBLISHED_BY]->(printedPurgatory),
   (physicsPendulum)-[:REFLECTS]->(red),
   (physicsPendulum)-[:PACKAGED_IN]->(hardcover),
   (mathMayhem)-[:PUBLISHED_BY]->(onlyGoodBooks),
   (mathMayhem)-[:REFLECTS]->(blue),
   (mathMayhem)-[:PACKAGED_IN]->(hardcover),
   (chemistryChaos)-[:PUBLISHED_BY]->(printedPurgatory),
   (chemistryChaos)-[:REFLECTS]->(green),
   (chemistryChaos)-[:PACKAGED_IN]->(paperback),
   (trigTurpitude)-[:PUBLISHED_BY]->(peteThePublisher),
   (trigTurpitude)-[:REFLECTS]->(red),
// Note that the trig book is in both packaging
   (trigTurpitude)-[:PACKAGED_IN]->(hardcover),
   (trigTurpitude)-[:PACKAGED_IN]->(paperback)


CREATE
   (printedPurgatory)-[:PRINTS_AT]->(inBlackAndWhite),
   (onlyGoodBooks)-[:PRINTS_AT]->(theStencil),
   (peteThePublisher)-[:PRINTS_AT]->(featherAndInk)

CREATE
   (inBlackAndWhite)-[:LOCATED_IN]->(gautemala),
   (theStencil)-[:LOCATED_IN]->(gautemala),
   (featherAndInk)-[:LOCATED_IN]->(syria)

CREATE
   (physicsPendulum)-[:SOLD_IN]->(booksNThings),
   (booksNThings)-[:SELLS]->(physicsPendulum),
   (mathMayhem)-[:SOLD_IN]->(bookWorm),
   (bookWorm)-[:SELLS]->(mathMayhem),
   (chemistryChaos)-[:SOLD_IN]->(booksNThings),
   (booksNThings)-[:SELLS]->(chemistryChaos),
   (trigTurpitude)-[:SOLD_IN]->(nerdCorner),
   (nerdCorner)-[:SELLS]->(trigTurpitude)

By doing above we have created the Dimension entities in the Graph Db and also used the data to establish relationship between them.

Let’s query and see what we have got so far:

Run the following to view the schema:

call db.schema;

This yields:

But since we don’t have much data, we can actually pull the entire dataset and view relationships (without blowing up the browser!):

The first thing to notice is that the optional “properties” of color and packaging is handled via a relationship instead of as an inline property of the book (such as is the case with price).

Now let’s create the TRANSACTION table in MySQL.

Please create a database and run the appropriate DDL and INSERT statements to yield the following data in the TRANSACTION table:

id bookId, qty, price, storeId
1  001      2    10.00   020
2  002      3    11.50   030
3  003      4    12.5    040
4  001      3    36.00   030
5  002      4    54.00   040
6  002      1    05.00   020

So now let us address the problem where the blue and red Harry Potter books are the same for PUBLISHER1 but different for PUBLISHER2

For PUBLISHER1 book 001 and book 002 are euivalent For PUBLISHER2 book 002 and 003 are equivalent.

This Cypher statement will establish that equivalence:

MATCH (b:Book {id: '001'})-[*]-(c: Book {id: '002'})
MERGE (b)-[e: EQUIVALENT {publisher: 'PUBLISHER1', equivalentId: '1001'}]->(c)

MATCH (b:Book {id: '002'})-[*]-(c: Book {id: '003'})
MERGE (b)-[e: EQUIVALENT {publisher: 'PUBLISHER2', equivalentId: '1002'}]->(c)

Notice how we have:

  • Created publisher-specific relationships
  • Identified the equivalent id for each pair in the relationship properties and associated it to the relationship.

This yields the following schema:

And the same relationship is displayed with the data:

Querying

Now let’s see how we can write queries for the two publishers using this equivalence relationship and the transaction data stored in a relational db.

First pull dimension data using Neo4J for PUBLISHER1.

MATCH (b1: Book)-[r1]-(b2: Book), (b: Book)-[*]-(p)-[*]-(r)-[*]-(c: Country)
WHERE r1.publisher = 'PUBLISHER2'
RETURN DISTINCT b1, b2, r1.equivalentId, b, p, r, c;

Fact data that is stored in MySQL can be accessed in either of two ways:

  • Use the JDBC plugin for Neo4J to make a direct connection .
  • Use the Neo4J API to programatically access data.

Here’s an example of how this model can be queried using plain Cypher querries:

Give me all books that are red in color, printed in Gautemala, sold in PA and cost more than $10.00

MATCH (b:Book)-[:PUBLISHED_BY]->()-[:PRINTS_AT]->()-[:LOCATED_IN]->(c:Country),
(b:Book)-[:REFLECTS]-(r:Color),
(b:Book)-[:SOLD_IN]->(s:Store)
WHERE c.name = 'Gautemala'
AND b.price > 10
AND r.name = 'red'
AND s.state = 'PA'
RETURN b;

CREATE (easyPeasy:ShippingPolicy {id: '021', name: 'The Easy Policy'})
CREATE (hardAsNails:ShippingPolicy {id: '022', name: 'The Hard Policy'})

MATCH (b:Book)-[:PUBLISHED_BY]->()-[:PRINTS_AT]->()-[:LOCATED_IN]->(c:Country)
WHERE c.name = 'Gautemala'
CREATE
  (b)-[:GOVERNED_BY]->(s:ShippingPolicy)

But now delete relationship at schema level

MATCH (a)-[r:GOVERNED_BY]->(b) DELETE r;

And create the same at data level (for specific data that meets a condition):

MATCH (b:Book)-[:PUBLISHED_BY]->()-[:PRINTS_AT]->()-[:LOCATED_IN]->(c:Country)
MATCH (p:ShippingPolicy)
WHERE c.name = 'Gautemala'
AND p.name = 'The Easy Policy'
CREATE
  (b)-[:GOVERNED_BY]->(p)


MATCH (b:Book)-[:PUBLISHED_BY]->()-[:PRINTS_AT]->()-[:LOCATED_IN]->(c:Country)
MATCH (p:ShippingPolicy)
WHERE c.name = 'Syria'
AND p.name = 'The Hard Policy'
CREATE
  (b)-[:GOVERNED_BY]->(p)