What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Graph Data Science with Neo4j

Introducing and Installing Neo4j

Graph databases in general, and Neo4j in particular, have gained increasing interest in the past few years. They provide a natural way of modeling entities and relationships and take into account observation context, which is often crucial to extract the most out of your data. Among the different graph database vendors, Neo4j has become one of the most popular for both data storage and analytics. A lot of tools have been developed by the company itself or the community to make the whole ecosystem consistent and easy to use: from storage to querying, to visualization to graph data science. As you will see through this book, there is a well-integrated application or plugin for each of these topics.

In this chapter, you will get to know what Neo4j is, positioning it in the broad context of databases. We will also introduce the aforementioned plugins that are used for graph data science.

Finally, you will set up your first Neo4j instance locally if you haven’t done so already and run your first Cypher queries to populate the database with some data and retrieve it.

In this chapter, we’re going to cover the following main topics:

What is a graph database?
Finding or creating a graph database
Neo4j in the graph databases landscape
Setting up Neo4j
Inserting data into Neo4j with Cypher, the Neo4j query language
Extracting data from Neo4j with Cypher pattern matching

What is a graph database?

Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j is and how different it is from the data storage engine you are used to. In this section, we are going to discuss (quickly) the different types of databases you can find today, and why graph databases are so interesting and popular both for developers and data professionals.

Databases

Databases make up an important part of computer science. Discussing the evolution and state-of-the-art areas of the different implementations in detail would require several books like this one – fortunately, this is not a requirement to use such systems effectively. However, it is important to be aware of the existing tools related to data storage and how they differ from each other, to be able to choose the right tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and Neo4j in your data science project doesn’t mean you will have to use it every single time you start a new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.

A database, in the context of computing, is a system that allows you to store and query data on a computer, phone, or, more generally, any electronic device.

As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:

Relational databases (SQL) such as MySQL or PostgreSQL. These store data as records in tables whose columns are attributes or fields and whose rows represent each entity. They have a predefined schema, defining how data is organized and the type of each field. Relationships between entities in this representation are modeled by foreign keys (requiring unique identifiers). When the relationship is more complex, such as when attributes are required or when we can have many relationships between the same objects, an intermediate junction (join) table is required.
NoSQL databases contain many different types of databases:
- Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a simple lookup database where the key is usually a string, and the value can be a more complex object that can’t be used to filter the query – it can only be retrieved. They are known to be very efficient for caching in a web context, where the key is the page URL and the value is the HTML content of the page, which is dynamically generated. KV stores can also be used to model graphs when building a native graph engine is not an option. You can see KV stores in action in the following projects:
  - IndraDB: This is a graph database written in Rust that relies on different types of KV stores: https://github.com/indradb/indradb
- Document-oriented databases such as MongoDB or CouchDB. These are useful for storing schema-less documents (usually JSON (or a derivative) objects). They are much more flexible compared to relational databases, since each document may have different fields. However, relationships are harder to model, and such databases rely a lot on nested JSON and information duplication instead of joining multiple tables.

The preceding list is non-exhaustive; other types of data stores have been created over time and abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector representations to ease searching in the vector space, with many applications in machine learning once a vector representation (embedding) of an observation has been computed.

Graph databases can also be classified as NoSQL databases. They bring another approach to the data storage landscape, especially in the data model phase.

Graph database

In the previous section, we talked about databases. Before discussing graph databases, let’s introduce the concept of graphs.

A graph is a mathematical object defined by the following:

A set of vertices or nodes (the dots)
A set of edges (the connections between these dots)

The following figure shows several examples of graphs, big and small:

Figure 1.1 – Representations of some graphs

As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But in practice, far more objects can be seen as graphs:

Time series: Each observation is connected to the next one
Images: Each pixel is linked to its eight neighbors (see the bottom-right picture in Figure 1.1)
Texts: Here, each word is connected to its surrounding words or a more complex mapping, depending on its meaning (see the following figure):

Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

A graph can be seen as a generalization of these static representations, where links can be created with fewer constraints.

Another advantage of graphs is that they can be easily traversed, going from one node to another by following edges. They have been used for representing networks for a long time – road networks or communication infrastructure, for instance. The concept of a path, especially the shortest path in a graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other groups?) and node ranking (node importance). We will discuss these algorithms in more detail in Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.

So, we know what a database is and what a graph is. Now comes the natural question: what is a graph database? The answer is quite simple: in a graph database, data is saved into nodes, which can be connected through edges to model relationships between them.

At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If you do not have graph data, why would you need a graph database? There are two possible answers to this question, both of which we are going to discuss.

Finding or creating a graph database

Data scientists know how to find or generate datasets that fit their needs. Randomly generating a variable distribution while following some probabilistic law is one of the first things you’ll learn in a statistics course. Similarly, graph datasets can be randomly generated, following some rules. However, this book is not a graph theory book, so we are not going to dig into these details here. Just be aware that this can be done. Please refer to the references in the Further reading section to learn more.

Regarding existing datasets, some of them are very popular and data scientists know about them because they have used them while learning data science and/or because they are the topic of well-known Kaggle competitions. Think, for instance, about the Titanic or house price datasets. Other datasets are also used for model benchmarking, such as the MNIST or ImageNet datasets in computer vision tasks.

The same holds for graph data science, where some datasets are very common for teaching or benchmarking purposes. If you investigate graph theory, you will read about the Zachary’s karate club (ZKC) dataset, which is probably one of the most famous graph datasets out there (side note: there is even a ZKC trophy, which is awarded to the first person in a graph conference that mentions this dataset). The ZKC dataset is very simple (30 nodes, as we’ll see in Chapter 3, Characterizing a Graph Dataset, and Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, on how to characterize a graph dataset), but bigger and more complex datasets are also available.

There are websites referencing graph datasets, which can be used for benchmarking in a research context or educational purpose, such as this book. Two of the most popular ones are the following:

The Stanford Network Analysis Project (SNAP) (https://snap.stanford.edu/data/index.html) lists different types of networks in different categories (social networks, citation networks, and so on)
The Network Repository Project, via its website at https://networkrepository.com/index.php, provides hundreds of graph datasets from real-world examples, classified into categories (for example, biology, economics, recommendations, road, and so on)

If you browse these websites and start downloading some of the files, you’ll notice the data comes in unfamiliar formats. We’re going to list some of them next.

A note about the graph dataset’s format

The datasets we are used to are mainly exchanged as CSV or JSON files. To represent a graph, with nodes on one side and edges on the other, several specific formats have been imagined.

The main data formats that are used to save graph data as text files are the following:

Edge list: This is a text file where each row contains an edge definition. For instance, a graph with three nodes (A, B, C) and two edges (A-B and C-A) is defined by the following edgelist file:
```
A B
```
```
C A
```
Matrix Market (with the .mtx extension): This format is an extension of the previous one. It is quite frequent on the network repository website.
Adjacency matrix: The adjacency matrix is an NxN matrix (where N is the number of nodes in the graph) where the ij element is 1 if nodes i and j are connected through an edge and 0 otherwise. The adjacency matrix of the simple graph with three nodes and two edges is a 3x3 matrix, as shown in the following code block. I have explicitly displayed the row and column names only for convenience, to help you identify what i and j are:
```
  A B C
```
```
A 0 1 0
```
```
B 0 0 0
```
```
C 1 0 0
```

Note

The adjacency matrix is one way to vectorize a graph. We’ll come back to this topic in Chapter 7, Automatically Extracting Features with Graph Embeddings for Machine Learning.

GraphML: Derived from XML, the GraphML format is much more verbose but lets us define more complex graphs, especially those where nodes and/or edges carry properties. The following example uses the preceding graph but adds a name property to nodes and a length property to edges:

<?xml version='1.0' encoding='utf-8'?>

<graphml xmlns="http://graphml.graphdrawing.org/xmlns"

   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

   xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"

    <!-- DEFINING PROPERTY NAME WITH TYPE AND ID -->

    <key attr.name="name" attr.type="string" for="node" id="d1"/>

    <key attr.name="length" attr.type="double" for="edge" id="d2"/>

    <graph edgedefault="directed">

       <!-- DEFINING NODES -->

       <node id="A">

             <!-- SETTING NODE PROPERTY -->

            <data key="d1">"Point A"</data>

        </node>

        <node id="B">

            <data key="d1">"Point B"</data>

        </node>

        <node id="C">

            <data key="d1">"Point C"</data>

        </node>

        <!-- DEFINING EDGES

        with source and target nodes and properties

-->

        <edge id="AB" source="A" target="B">

            <data key="d2">123.45</data>

        </edge>

        <edge id="CA" source="C" target="A">

            <data key="d2">56.78</data>

        </edge>

    </graph>

</graphml>

If you find a dataset already formatted as a graph, it is likely to be using one of the preceding formats. However, most of the time, you will want to use your own data, which is not yet in graph format – it might be stored in the previously described databases or CSV or JSON files. If that is the case, then the next section is for you! There, you will learn how to transform your data into a graph.

Modeling your data as a graph

The second answer to the main question in this section is: your data is probably a graph, without you being aware of it yet. We will elaborate on this topic in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph), but let me give you a quick overview.

Let’s take the example of an e-commerce website, which has customers (users) and products. As in every e-commerce website, users can place orders to buy some products. In the relational world, the data schema that’s traditionally used to represent such a scenario is represented on the left-hand side of the following screenshot:

Figure 1.3 – Modeling e-commerce data as a graph

The relational data model works as follows:

A table is created to store users, with a unique identifier (id) and a username (apart from security and personal information required for such a website, you can easily imagine how to add columns to this table).
Another table contains the data about the available products.
Each time a customer places an order, a new row is added to an order table, referencing the user by its ID (a foreign key with a one-to-many relationship, where a user can place many orders).
To remember which products were part of which orders, a many-to-many relationship is created (an order contains many products and a product is part of many orders). We usually create a relationship table, linking orders to products (the order product table, in our example).

Note

Please refer to the colored version of the preceding figure, which can be found in the graphics bundle link provided in the Preface, for a better understanding of the correspondence between the two sides of the figure.

In a graph database, all the _id columns are replaced by actual relationships, which are real entities with graph databases, not just conceptual ones like in the relational model. You can also get rid of the order product table since information specific to a product in a given order such as the ordered quantity can be stored directly in the relationship between the order and the product node. The data model is much more natural and easier to document and present to other people on your team.

Now that we have a better understanding of what a graph database is, let’s explore the different implementations out there. Like the other types of databases, there is no single implementation for graph databases, and several projects provide graph database functionalities.

In the next section, we are going to discuss some of the differences between them, and where Neo4j is positioned in this technology landscape.

Key benefits

Extract meaningful information from graph data with Neo4j's latest version 5

Use Graph Algorithms into a regular Machine Learning pipeline in Python

Learn the core principles of the Graph Data Science Library to make predictions and create data science pipelines.

Description

Neo4j, along with its Graph Data Science (GDS) library, is a complete solution to store, query, and analyze graph data. As graph databases are getting more popular among developers, data scientists are likely to face such databases in their career, making it an indispensable skill to work with graph algorithms for extracting context information and improving the overall model prediction performance. Data scientists working with Python will be able to put their knowledge to work with this practical guide to Neo4j and the GDS library that offers step-by-step explanations of essential concepts and practical instructions for implementing data science techniques on graph data using the latest Neo4j version 5 and its associated libraries. You’ll start by querying Neo4j with Cypher and learn how to characterize graph datasets. As you get the hang of running graph algorithms on graph data stored into Neo4j, you’ll understand the new and advanced capabilities of the GDS library that enable you to make predictions and write data science pipelines. Using the newly released GDSL Python driver, you’ll be able to integrate graph algorithms into your ML pipeline. By the end of this book, you’ll be able to take advantage of the relationships in your dataset to improve your current model and make other types of elaborate predictions.

Who is this book for?

If you’re a data scientist or data professional with a foundation in the basics of Neo4j and are now ready to understand how to build advanced analytics solutions, you’ll find this graph data science book useful. Familiarity with the major components of a data science project in Python and Neo4j is necessary to follow the concepts covered in this book.

What you will learn

Use the Cypher query language to query graph databases such as Neo4j

Build graph datasets from your own data and public knowledge graphs

Make graph-specific predictions such as link prediction

Explore the latest version of Neo4j to build a graph data science pipeline

Run a scikit-learn prediction algorithm with graph data

Train a predictive embedding algorithm in GDS and manage the model store

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Frequently bought together

Mex$963.99

Mex$922.99

Causal Inference and Discovery in Python

Mex$820.99

Total Mex$ 2,707.97

Filter reviews by

All

Packt verified reviews

Amazon verified reviews

Nathan Smith Mar 30, 2023

The examples in this book are really useful, and people to whom I have recommended the book have also liked it. The author does a good job of walking you through the whole graph data science lifecycle, end-to-end. It is especially valuable in providing guidance that goes beyond the Neo4j product documentation on complex topics like link prediction pipelines. I have not seen a better explanation of the Neo4j Pregel API anywhere.

Amazon Verified review

kbpr Mar 14, 2023

Graph Data Science with Neo4j is a fantastic overview for those starting out in their graph data science journey. The author walks the reader through everything from installation, graph theory and graph algorithms, all the way to supervised machine learning pipelines. She has very clear descriptions and explanations throughout, building up concepts step by step to *show* the reader how to interact with graphs, rather than “telling” and providing a list of features. The result is a solid jumping off point for aspiring graph data scientists, with many great tidbits for the more experienced graph practitioners to return to and reference.

Phani Dathar Mar 17, 2023

Estelle has done an excellent job of making this book a must read for both the beginners and graph practitioners alike. Given her extensive experience building solutions using graph technology, the practical advice to the readers on how to leverage Neo4j graph database and Neo4j graph data science to address real-world use cases is very valuable.The book starts with step-by-step walkthrough of creating a graph database, introduction to cypher and easy-to-follow examples of ingesting data and building a knowledge graph. It is very important for readers to understand how to store and extract the connections in their data and understand the topology of the network before using graph machine learning and/or building applications on graph databases. The first three chapters in this book combined with the chapter on graph visualization do an excellent job in laying that foundation for moving on to the advanced concepts in the graph data science world.Graph algorithms and graph machine learning are the advanced topics that are also covered in this book with a good set of explanations on why, where and how-to use them. Leveraging graph algorithms to generate graph features, graph embeddings for dimensionality reduction, building machine learning pipelines by integrating graph features are all explained very well and author's expertise shines through in this book.I would recommend this book as a must-read for developers, data scientists and analysts who are starting their journey with graph technology and graph data science as well as graph practitioners who are familiar with the concepts of network science.

Shanthababu Pandian May 26, 2023

Data Science is a buzzing word in recent eras; collecting data from various sources and representing them is a challenging task. A graph is a mathematical object set and derived by vertices or nodes and edges. It has the characteristics of being easily traversed, going from one node to another by following edges.The author started his innings by defining what is a graph database in Part 1 and answering by stating that “Data is saved into nodes, which can be connected through edges to model relationships between them”The author introduces the Neo4j ecosystem and outlines the Neo4j Browser, Neo4j Bloom, and Neodash. Helping with clear steps to create a Neo4j database, inserting data into Neo4j with Cypher, the Neo4j query language. Importing Data into Neo4j to Build a Knowledge Graph.Importing Data into Neo4j to Build a Knowledge Graph using CSV data into Neo4j with Cypher with the classical data from netflix.zip from defining the graph schema to creating nodes and relationships the author provided the detailed steps. Introducing Awesome Procedures on Cypher (APOC) Neo4j plugin and playing with JSON data is a great credit for readers to understand how to utilise the Neo4j environment.In Part 2 the author takes us on a tour of exploring and characterising Graph Data with Neo4j, where we can learn how to characterise a graph dataset, Neo4j Graph Data Science (GDS) library and the most common graph algorithms to analyse the graph topology. Characterising a graph from its node and edge properties like Link direction, Link weight and Node type.Coming to computing the graph degree distribution author gives an idea of the definition of a node’s degree Incoming, Outgoing and Total degree precisely. Help us to understand computing the node degree with Cypher, Building the degree distribution of a graph and how to Improve degree distribution.Under characterising metrics, the author gives heads-up on Triangle count, various clustering coefficients, and their calculations. Furthermore, digging into the Neo4j GDS library and installing the same with Neo4j Desktop is a power pack for the reader and makes them hands-on concerning GDS.Visualizing Graph Data is the final goal for Graph data, the author has provided a detailed walkthrough of visualising the complexity of graph data visualisation, small graph with networkx and matplotlib and large graphs with Gephi.In the end, we have to make the prediction, the author has covered this in Part 3 as Making Predictions on a Graph. In this part. Here the author re-introduces a well-known Python library, namely sci-kit-learn, and extracts data from Neo4j to build a model and how the GDS library helps us to embeddings to build node classification and link prediction pipelines.Build a Machine Learning Model with Graph Features using the GDS Python client and explain how the GDS library is allowing it to run graph algorithms directly from Python without writing any Cypher.The author has provided a very detailed note on graph embedding techniques and algorithms – classification, Node2Vec and building a GDS pipeline for classification model and trainingand predicting future edges are fruitful topics and must-read.Overall … I can give 4.0/5.0 for this. Certainly, a special effort from the author is really much appreciable.-Shanthababu PandianArtificial Intelligence and Analytics | Cloud Data and ML Architect | Scrum MasterNational and International Speaker | Blogger |

Om S Mar 13, 2023

Graph Data Science with Neo4j" is a practical guide for data scientists who are looking to enhance their skill set by learning how to extract meaningful information from graph data. The book covers various essential concepts and practical instructions for implementing data science techniques on graph data using the latest Neo4j version 5 and its associated libraries. The book is divided into ten chapters, starting with an introduction to Neo4j, installation, and building a knowledge graph.The book then moves on to characterizing a graph dataset and using graph algorithms to analyze it. Readers will also learn how to visualize graph data and build a machine learning model with graph features. The book introduces readers to the newly released GDSL Python driver, which allows for the integration of graph algorithms into a machine learning pipeline.The latter part of the book covers more advanced topics such as building a GDS pipeline for node classification model training and predicting future edges. The book concludes by teaching readers how to write their custom graph algorithm with the Pregel API.Overall, "Graph Data Science with Neo4j" is a useful and practical guide for data scientists looking to learn about graph data science. The book covers a range of topics, from the basics to advanced techniques. The step-by-step instructions and real-world examples make the book easy to follow and understand.

Graph Data Science with Neo4j: Learn how to use Neo4j 5 with Graph Data Science library 2.0 and its Python driver for your project

What do you get with a Packt Subscription?

Graph Data Science with Neo4j

Introducing and Installing Neo4j

Technical requirements

What is a graph database?

Databases

Graph database

Finding or creating a graph database

A note about the graph dataset’s format

Modeling your data as a graph

Page 1 of 11

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs

Graph Data Science with Neo4j: Learn how to use Neo4j 5 with Graph Data Science library 2.0 and its Python driver for your project

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs