A guide for modeling a graph database - A lunch with Neo4J chief scientist Jim Webber, London

Since the invention of NOSQL databases, it gets more and more attention from the developer community. One of the remarkable and unique databases exists in this topic of NOSQL is the graph database of NEO4J, an open-source database which stores the data as a Graph.

Modeling a graph database is quite different than modeling the regular RDBMS database, and even from other NOSQL databases such as key-value collections. We got used to identifying the molecularity of the data and save it as the columns, joining similar data together into tables. But since Neo4J is a flexible graph database, this case does not work there.

In order to model the data, we need to identify first the queries that we're about to run on the database. These queries will form the basic logical sentences which are the keys of modeling the database (more about that coming up in a second).

We need to identify where to store each piece of information:
- as a node
- as an edge
- as a property of a node or of an edge

Nodes should store items of information, nouns: Users, Receipts, Comment, Document.
The edges should store relation-items that forms a verb: Created by, Voted By, IS the son of, etc.
Properties are an additional description for either nodes or edges. You should put as a property the data that does not need to be related. if we look at the information as a sentence - the properties are the adjectives.

So, nodes and edges (and properties) should create a logical sentence as part of the modeling:
User A VOTED_YES on Question Q
This sentence creates the model for us:
[User (name=A)] -- VOTED_YES --> [Question (content=Q)]

Where User and Question are nodes. Each of which has properties that describe it (user name, question content/title).
Pay specific attention to VOTED_YES edge.
We could have chosen VOTED (vote_data = 'yes') where we have an edge of VOTED and the content is its property. Why didn't we model it so?

Jim Webber, the chief scientist of NEO4J explains: Performance-wise, it's better to create granularity in the model. So that each of these votes is a unique type of an edge.
And what if we want to collect all the votes for a specific question? We can create an Index for the vote, and easily find all the votes in the index that are pointing to a specific question.

__ to be continued __

No comments:

Natural Language Processing :: A beginner’s (brief) guide for working on NLP problems

This article  was written for the Data Science Society's Datathon held on February 2018, as an answer to questions from the participant...