I've personally have been looking into Graph Databases, specifically, Neo4j, for the past couple of months for the purposes of building a project, and after spending time reading articles on https://neo4j.com/ and browsing through their documentation to understand the vision that the dev team is taking Neo4j to, it has started to become more and more apparent that while Graph Databases solve many of the problems that Data Engineers, Analysts and Scientists face with a number of RDBMS, it is not an indication that we might leave the relational paradigm anytime soon.
Even with all that, I do not believe that they will at anytime replace SQL. Both have completely separate use cases in the industry, and while the RDBMS setup of databases works as needed in some use cases, it is unable to be do things like analysis within strongly and sparsely connected graphs that Graph Databases relies on as a selling point. And that while Graph Databases are much, much better at searching nodes for finding insights and giving easier access to detailed analysis, there will always be a use case for just wanting to store relations tabular that different RDBMS and SQL provides much more optimizations for. Adding onto that, graph databases, I believe, are in memory databases with a master-child architecture when it comes to scaling horizontally. Which allows room for clusters. SQL and RDBMS are notorious for being hard to scale horizontally, not that it cannot be done, but it's one of the considerations that DBAs have to take a look into.
This is an opinion piece, so feel free to disagree and let me know what your thoughts are!
What Are Graph Databases Good For?
Graph Databases, such as GrapheneDB, or Neo4j, offer a set of advantages that help to encounter the enormous performance overhead that comes with performing joins on an extensive number of relations, growing the depth size for more than 6-7, or maybe up til even 10 joins to find and retrieve meaningful information. It doesn't help to write really long queries for that, as adding and writing can result in double checking and making sure if the relations are retrieving the right information as we expect them to. Further adding, the JOIN operation performs a very mitigated version of a Cartesian product on both the relations it is being addressed with.
Please note that the idea of RDBMS, Joins, and a Cartesian Product can quickly heat-up to be a very strong discussion among many DBA professionals, something which might be out of the scope of this blog. JOINS are certainly optimized for different RDBMS, and I even quote myself saying that Relational Databases are of the most complex beasts in Computer Science along with compilers, operating systems, and distributed computing, there are ways to optimize for these operations, and you should definitely look into discussions about this before believing that JOINS are not really optimized.
Take a look here, When And Why Are Joins Expensive?. Here is an article that dives in SQL for optimizing JOINS and SELECT, How to design SQL queries with better performance: SELECT * and EXISTS vs IN vs JOINs.
Most of the references and information below I got from Neo4j themselves, so it may be a biiittt biased. Take it with a grain of salt.
- Performance. Searching is among multiple relations and multiple records is much easier. The relationship feature allows you to attach as many relations as you want to, this means that two nodes with the same label, can have different degrees of incoming and outgoing relations, giving some nodes more weight than others. Querying, searching, and basing these nodes for finding what analysis you may want for OLAP certainly seems promising here.
- Scaling. Graph Databases, though are moving towards maturity from the industry perspective, are moving towards the ability to scale horizontally with clusters and nodes. This allows for having a microservice architecture as a feasibility.
- Agility: Easier change in the data types, schema, without having to hassle with old data types, with having a source and a target data for what the data should look like and in what way.
- Graph Algorithms: Graph Databases bring with them a whole suit of Data Science and Graph Algorithm related package. Check out GDSL. Here is a list of algorithms you test out in something like Neo4j.
What are Graph Databases Bad For?
Graph Databases, in all respects, and just so we are on the same page, are based on 4 main ideas,
- Node: An instance that contains Labels and has it's own Attributes - think of classes in Object Oriented Programming, with their own set of information, initialized, or pre-declared when an object of that class is created or instantiated.
- Relationships: An inference into what 'topic' or 'idea' connects two different nodes, and what the type of that is
- Label: The type of the Node that is being created
- Direction: Whether the graph is directed or un-directed.
For example, the below query,
MATCH (a:Person),(b:Person) WHERE a.name = 'A' AND b.name = 'B' CREATE (a)-[r:RELTYPE]->(b) RETURN type(r)
Here, the above query written in Neo4j, and is trying to 'match' for the nodes with the certain characteristics. For the above, it is two nodes, 'a' and 'b' of type label 'Person', which carries the attribute 'name' , and 'RELTYPE' is the relationship that joins 'a' and 'b', from 'a' to 'b' as the 'direction', and returns the type of the relation that exists between the two variables.
So, now that the above is out of the way, let's discuss some draw backs,
- Making your entire data reside on a single graph results in having to search the whole graph to perhaps perform some aggregate option for this. Many SQL and RDBMS already optimize the aggregate functions to be as fast as possible. The performance overhead that is taken by each query in Neo4j, is directly proportional to the nodes and relationships visited. Whereas, in an RDMBS, it's a single relation for all of the records as a storage, and very simple and relatively easier to perform an aggregate operation
- Limited support is one of the things that maybe of concern new projects, as there are few vendors out there at the moment, Neo4j being the most popular, in contrast to something like different enterprise level support for RDBMS, such as SQL Server, Oracle etc
- Your searching, which is one of the use cases that Neo4j might be good for, depends entirely on how you design your relations. While creating and assiging relations is easy, it really falls on the team what kind of relations they wish to design, and if they really allow you better understanding of data that connects different nodes
- A personal drawback that I think exists is the concern for not putting a lot of information as collectively in nodes. This may induce heavier search times and seek operations on disk, and pulling into memory, from the hard disk, and the reverse process of that, can be very quickly effected by how many nodes you address in something along the lines of 4000-5000 nodes. Redundancy hasn't always been the highest priority for NoSQL databases, but I am still unaware how exactly something like Neo4j retrieves and stores nodes as, and kind of data structure optimzations does it perform.
And since the idea of Graph Databases is to connect node to node, and traverse the depth of relations, there needs to be a discussion if you're in a team of how exactly different instances of nodes relate to what kind of other nodes, and vice versa.
So, should you use Graph Databases?
It completely falls on what your use case is. If you have the freedom to have ton and ton of disk space, and the capacity to scale nodes pretty efficiently for your application, then you are most probably concerned with the degree and number of relationships that are forming between those nodes, rather than the overall number of those nodes. That serves as a pretty good use case since traversal can be much more simplified.
Or are you unsure how deep is the information that you are looking for? You may not be sure if applying JOIN to a certain degree might get you need - you can go as deep as you want into the whole graph to find what information you need.
Perhaps the use case that you are looking for falls in each of the following suggestions. If so, check out the Use Cases On Neo4j.
That's pretty much it. Happy Database Desiging!