Neo4j in the graph databases landscape
Even when restricting the scope to graph databases, there are still different ways to envision such data stores:
- Resource description framework (RDF): Each record is a triplet of the
Subject Predicate Object
type. This is a complex vocabulary that expresses a relationship of a certain type (the predicate) between a subject and an object; for instance:Alice(Subject) KNOWS(Predicate) Bob(Object)
Very famous knowledge bases such as DBedia and Wikidata use the RDF format. We will talk about this a bit more in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph).
- Labeled-property graph (LPG): A labeled-property graph contains nodes and relationships. Both of these entities can be labeled (for instance,
Alice
andBob
are nodes with thePerson
label, and the relationship between them has theKNOWS
label) and have properties (people have names; an acquaintance relationship can contain the date when both people first met as a property).
Neo4j is a labeled-property graph. And even there, like MySQL, PostgreSQL, and Microsoft SQL Server are all relational databases, you will find different vendors proposing LPG graph databases. They differ in many aspects:
- Whether they use a native graph engine or not: As we discussed earlier, it is possible to use a KV store or even a SQL database to store graph data. In this case, we’re talking about non-native storage engines since the storage does not reflect the graphical nature of the data.
- The query language: Unlike SQL, the query language to deal with graph data has not yet been standardized, even if there is an ongoing effort being led by the GQL group (see, for instance, https://gql.today/). Neo4j uses Cypher, a declarative query language developed by the company in 2011 and then open-sourced in the
openCypher
project, allowing other databases to use the same language (see, for instance, RedisGraph or Amazon Neptune). Other vendors have created their own languages (AQL
for ArangoDB orCQL
for TigerGraph, for instance). To me, this is a key point to take into account since the learning curve can be very different from one language to another. Cypher has the advantage of being very intuitive and a few minutes are enough to write your own queries without much effort. - Their (integrated or not) support for graph analytics and data science.
A note about performances
Almost every vendor claims to be the best one, at least in some aspects. This book won’t create another debate about that. The best option, if performances are crucial for your application, is to test the candidates with a scenario close to your final goal in terms of data volume and the type of queries/analysis.
Neo4j ecosystem
The Neo4j database is already very helpful by itself, but the amount of extensions, libraries, and applications related to it makes it the most complete solution. In addition, it has a very active community of members always keen to help each other, which is one of the reasons to choose it.
The core Neo4j database capabilities can be extended thanks to some plugins. Awesome Procedures on Cypher (APOC), a common Neo4j extension, contains some procedures that can extend the database and Cypher capabilities. We will use it later in this book to load JSON data.
The main plugin we will explore in this book is the Graph Data Science Library. Its predecessor, the Graph Algorithm Library, was first released in 2018 by the Neo4j lab team. It was quickly replaced by the Graph Data Science Library, a fully production-ready plugin, with improved performance. Algorithms are improved and added regularly. Version 2.0, released in 2021, takes graph data science even further, allowing us to train models and build analysis pipelines directly from the library. It also comes with a handy Python client, which is very convenient for including graph algorithms into your usual machine learning processes, whether you use scikit-learn
or other machine learning libraries such as TensorFlow or PyTorch.
Besides the plugins, there are also lots of applications out there to help us deal with Neo4j and explore the data it contains. The first application we will use is Neo4j Desktop, which lets us manage several Neo4j databases. Continue reading to learn how to use it. Neo4j Desktop also lets you manage your installed plugins and applications.
Applications installed into Neo4j Desktop are granted access to your active database. While reading this book, you will use the following:
- Neo4j Browser: A simple but powerful application that lets you write Cypher queries and visualize the result as a graph, table, or JSON:
Figure 1.4 – Neo4j Browser
- Neo4j Bloom: A graph visualization application in which you can customize node styles (size, color, and so on) based on their labels and/or properties:
Figure 1.5 – Neo4j Bloom
- Neodash: This is a dashboard application that allows us to draw plots from the data stored in Neo4j, without having to extract this data into a DataFrame first. Plots can be organized into nice dashboards that can be shared with other users:
Figure 1.6 – Neodash
This list of applications is non-exhaustive. You can find out more here: https://install.graphapp.io/.
Good to know
You can create your own graph application to be run within Neo4j Desktop. This is why there are so many diverse applications, some of which are being developed by community members or Neo4j partners.
This section described Neo4j as a database and the various extensions that can be added to it to make it more powerful. Now, it is time to start using it. In the following section, you are going to install Neo4j locally on our computer so that you can run the code examples provided in this book (which you are highly encouraged to do!).