Graph Data Modeling in Python

Introducing Graphs in the Real World

Social network analysis, fraud detection, modeling the stability of systems (for example, rail and energy grids), and recommendation systems all rely on graphs as the lynchpin underpinning these types of networks. In each of these examples, the relationships between individual people, bank accounts, or other single units are fundamental to describe and model the data. As opposed to traditional data models, a graph is a perfect way to represent groups of interacting elements.

This chapter will serve as an introduction to why graphs are important and introduce you to the fundamentals of what makes up a graph network. Moreover, we will look at how to transition from traditional data storage strategies, such as relational databases (RDBs), to how you can use this knowledge to work with graph databases (GDBs). Throughout this book, we will be working with a popular graph database, namely Neo4j. This will be followed by an explanation of how graphs are utilized in the real world and then a gentle introduction to working with the main package workhorses, known as igraph and NetworkX, which are the best and most stable graph packages for graph data analysis and modeling.

In this chapter, we’re going to cover the following topics:

Why should you use graphs?
The fundamentals of nodes and edges and the properties of a graph
Comparing RDBs and GDBs
The use of graphs across various industries
Introduction to NetworkX and igraph

Why should you use graphs?

In modern, data-driven solutions and enterprises, graph data structures are becoming more and more common. This is because, in our modern, data-driven world, relationships between things are becoming as, if not more important, than the things themselves. In modern industries and enterprises, graphs are starting to become more common and powerful in understanding the relationships between entities. I would say that these relationships and how they are connected have become more important than the entities themselves. We will demonstrate examples of real-life graphs in our use cases in the following chapters with detailed instructions on how to build these networks and the core considerations you need to make for the graph design.

Graphs are fundamental to many systems we use every day. Each time you are online and receive a product recommendation, it is likely that a graph solution is powering this recommendation. This is why learning how to work with graph data and leveraging these types of networks is a fast-growing and key skill in data science.

Composite components of a graph

Networks are a tool to represent complex systems and the complex nature of the connections arising in today’s data. We have already referenced how graphs are powering some of the big powerhouse recommendation systems in action today.

Graph methods tend to fall into four different areas:

Movement: Movement is concerned with how things travel (move) through a network. These types of graphs are the drivers behind routing and GPS solutions and are utilized by the biggest players in finding the optimal path across a road network.
Influence: On social media, this area specifies who the known influencers are and how they propagate this influence across a network.
Groups and interactions: This area involves identifying groups and how actors in the network interact with each other. We will look at an example of how to apply community detection methods to find these communities through the node (the actor involved) and its connections (the edges). Don’t worry if you don’t know what these terms are for now; we will focus on these in the Fundamentals of nodes, edges, and the properties of a graph section.
Pattern detection: Pattern detection involves using a graph to find similarities in the network that can be explored. We must look at this from the actor’s (node’s) point of view and find similarities between that actor and other actors in the network. Here, actor is taken to mean person, author profile, and so on.

In this section, we have explained the core components of a graph by providing simple working definitions. In the following section, we will delve deeper into these fundamental elements, which make up every graph you will come across in the industry. We will look at nodes, edges, and the various properties of a graph.

The fundamentals of nodes and edges and the properties of a graph

Graphs, or networks, are particularly powerful data structures for modeling and describing relationships between things, whether these things are people, products, or machines. In a graph, those things that we coined earlier are represented by nodes (sometimes known as vertices). Nodes are connected by edges (sometimes referred to as relationships). In a network, an edge represents a relationship between two things, indicating that, somehow, they are linked.

The following sections will look at the structures and types of graphs. First, we will start with undirected graphs before moving on to directed graphs. After that, we will look at node properties, then delve into heterogeneous graphs, and end by looking at schema design considerations.

Undirected graphs

To illustrate, a simple example is that of a real-life social network. In the following example, Jeremy and Mark are each represented by a node. Jeremy is friends with Mark, and the friend relationship is represented by an edge connecting the two nodes. The following diagram shows an undirected graph:

Figure 1.1 – Two friend nodes are linked together with a single edge

However, not all social networks are the same, and in some online social media platforms, relationships between users of a social network may not be mutual.

For example, on Twitter, one user can follow another, but this doesn’t mean the inverse must be true. On Twitter, Jeremy may follow Mark, but Mark may not follow Jeremy.

Directed graphs

Here, a directional edge is used to show that Jeremy follows Mark, while the absence of an edge in the reverse direction shows that Mark does not follow Jeremy in return:

Figure 1.2 – Two friend nodes are linked together with a single edge

This type of graph is known as a directed graph. For reference, sometimes, undirected edges like those in Figure 1.1 are shown as bidirectional edges, pointing to both nodes. This bidirectional representation is equivalent to an undirected edge in most senses and represents a mutual relationship.

Importantly, when creating a data model with directional edges, naming relationships appropriately becomes important. In our Twitter example, if the edge representing the interaction between Mark and Jeremy is follows, then the edge goes from the Jeremy node to the Mark node.

On the other hand, if the edge represents a concept such as followed by, then this should be in the other direction – that is, from Mark to Jeremy. This has particularly strong implications for some more complex graph modeling and use cases, which we will cover in Chapter 2, Working with Graph Data Models.

Node properties

While nodes and edges (directional or not) are the basic building blocks of a graph, they are often not sufficient to fully describe a dataset. Nodes can have data associated with them that may not be relational, so it would not be expressed as a relationship with another node.

In these cases, to represent data associated with nodes, we can use node properties (sometimes known as node attributes). Similarly, where an edge has additional information associated with it, in addition to representing a relationship, edge properties can be used to hold that data.

The following diagram shows a black line, indicating that Jeremy follows Mark but that Mark does not follow Jeremy – therefore, the black line indicates directionality:

Figure 1.3 – Two friend nodes are linked together with a single edge

In the preceding model, node properties are used to describe the number of followers Mark and Jeremy have, as well as the locations listed in their Twitter bios. This kind of additional node information is particularly important for querying graph data when asking questions that involve filtering.

In our Twitter example, properties would need to be present in the graph if, for example, we wanted to know who followed users with above 1,000 followers. We will revisit answering graphical questions using nodes, edges, and properties in later chapters.

Depending on the dataset, there may be cases where different nodes have different sets of properties. In this case, it is common to have several distinct types of nodes in the same graph.

Heterogeneous graphs

Node types can also be referred to as layers, or nodes with different labels, though for this book, they will be known simply as types.

The following diagram shows the nodes representing Jeremy and Mark as people, where each node type has different properties, and there are multiple relationship types. Due to this, we can term these multiple relationships as heterogenous:

Figure 1.4 – Example of a heterogenous Twitter graph

Now, we have added nodes representing Mark and Jeremy as people, relationships representing their relationship outside of Twitter, and their ownership of their respective accounts. Note that since we have increased the number of node types, we also need new edge types to refer to the different interactions between different types of nodes.

Graphs with multiple node types are known as heterogeneous, multilayer, or multilevel graphs, though going forward we will use the term heterogeneous to refer to graphs with multiple types of nodes. In contrast, graphs with only one node type, as in the previous examples, are referred to as homogeneous graphs.

Schema design

At this point, it is reasonable to ask the question: What features of a dataset should be nodes, edges, and properties?

For any given dataset, there are multiple ways to represent data as a graph, and each is more suited to different purposes. Herein lies the trick to good graph modeling: a question or use case-driven schema design.

If we were particularly interested in the locations of Twitter users in our network, then we could move the location node property on the Twitter user nodes to create the LOCATED_IN relationship type. This is shown in the following diagram:

Figure 1.5 – The same graph but with the location property moved from a node property to a node type

If we were particularly interested in the locations of Twitter users in our network, then we could move the location node property on the Twitter user nodes to a separate node type and create the LOCATED_IN relationship type. We could even go one step further to represent the information we know about these locations, adding the country related to each location as a separate, abstracted node.

This graph structure models the same data in a different way, which may be more or less suitable or performant for particular use cases. We will explore the effects of schema design on the types of questions that can be asked, and performance, in later chapters.

In the next section, we will compare how graph data structures differ from traditional RDBs. This will expand on why GDBs can be more performant when modeled as a graph data problem.

Comparing RDBs and GDBs

RDBs have been a standard for data storage and data analysis across most industries for a very long time. Their strength lies in being able to hold multiple tables of different information, where some table fields are found across tables, enabling data linkage.

With this data linkage, complex questions can be asked of data in an RDB. However, there are drawbacks to this relational structure. While RDBs are useful for returning a large number of rows that meet particular criteria, they are not suited to questions involving many chained relationships.

To illustrate this, consider a standard database containing train services and their station stops, alongside a graph that might represent the same information:

Figure 1.6 – Relational data structure of trains and their stops

In an RDB structure, it would not be difficult to retrieve all trains that service a particular stop. On the other hand, it may be a slow operation that returns a series of trains that can be taken between two chosen stations.

Consider the steps needed in a traditional RDB to find the route between Truro and Glasgow Central in the preceding table. Starting at Truro, we would know the GW1426 train service stops at Truro, Liskeard, and Plymouth. Knowing that these stations can be reached from Truro, we would then need to find what train services stop at each of these stations to find our route.

Upon finding that Plymouth station is reachable and that a separate service runs to many more stations, we would need to repeat this process over and over until Glasgow Central is reached.

These steps essentially result in a series of computationally costly join operations on the same table, where one resulting row would give us the path between our stations of interest.

GDBs to the rescue

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points, as illustrated in the following diagram:

Figure 1.7 – Graph data structure of trains and their stops

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points. Starting from Truro station, as in the RDB example, we find the train that services that station. However, when traversing the graph to find a possible route between Truro and Glasgow Central, at each station or train node we are considering fewer data points, and therefore fewer options.

This is in contrast to the RDB example, where repeated table joins are necessary to return a path. In this case, the complexity of the operations required over the graph representation is lower, which equates to a faster, more efficient method. Among many other use cases, those that require some sort of pathfinding often benefit from a graph data model.

In addition to being more suitable for specific types of queries, graphs are typically useful where a flexible, evolving data model is needed. Again, using the example of the train network, imagine that, as the database administrator, you have received a request to add bus transport links to the data model.

With an RDB, a new table would be required, since several bus services would likely serve each train station. In this new table, the names of each station would need to be duplicated from the existing table, to list alongside their related bus services.

Not only does this duplication increase the size of data stored, but it also increases the complexity of the database schema:

Figure 1.8 – Adding a new data type (buses) to the train station graph

Where the train station data is represented with a graph, the new information on buses can be added directly to the existing database as a new node type.

There is no requirement for a new table, and no need to duplicate each station node to represent the required information; the existing train nodes can be directly linked to new Bus nodes. This holds for any new data type that would require the addition of a new table in a traditional RDB.

In a graph, where new data could be represented in an equivalent RDB as a new column in an existing table, this may be a good candidate for a node property, as opposed to a new node type.

Here, an example suitable for being represented as a node property would be a code for each train station, where stations and their codes have a 1-to-1 relationship.

A comparison, in short, is captured in the following:

RDBs have a rigid data format and a new table must be added for a new type of data. GDBs are more flexible when it comes to the format of the data and can be extended with new node types.
RDBs can be queried via path-based queries – for example, how many steps between two people in a friend network, which involves multiple joins and can be extremely slow as the paths become longer. GDBs query paths directly, with no join operations, so information retrieval is more streamlined and quite frankly faster.

In summary, where the use case for a database concerns querying many relationships between objects – that is, paths – or when a flexible data schema is needed, a graph data model is likely to be a good fit to represent your data.

Introduction to NetworkX and igraph

In this chapter, we will introduce two Python packages for creating in-memory graphs: NetworkX and igraph.

NetworkX lets you create graphs, perform graph manipulation, study and visualize their structures, and perform several graph manipulation functions when working with graphs. Their website (https://networkx.org/) contains details of the major changes to the package and the intended usage of the tool.

igraph contains a suite of useful and practical analysis tools, with the aim being to make these efficient and easy to use, in a reproducible way. What is great about igraph is that it is open source and free, plus it supports networks to be built in R, Python, Mathematica, and C/C++. This is our recommended package for creating large networks that can load much more quickly than NetworkX. To read more about igraph, go to https://igraph.org/.

In the following subsections, we will look at the basics of both NetworkX and igraph, with easy-to-follow coding steps. This is the first time you are going to get your hands dirty with graph data modeling.

NetworkX basics

NetworkX is one of the originally available graph libraries for Python and is particularly focused on being user-friendly and Pythonic in its style. It also natively includes methods for calculating some classic network analysis measures:

To import NetworkX into Python, use the following command:
```
import networkx as nx
```
And to create an empty graph, g, use the following command:
```
g = nx.Graph()
```
Now, we need to add nodes to our graph, which can be done using methods of the Graph object belonging to g. There are multiple ways to do this, with the simplest being adding one node at a time:
```
g.add_node(Jeremy)
```
Alternatively, multiple nodes can be added to the graph at once, like so:
```
g.add_nodes_from([Mark, Jeremy])
```
Properties can be added to nodes during creation by passing a node and dictionary tuple to Graph.add_nodes_from:
```
g.add_nodes_from([(Mark, {followers: 2100}), (Jeremy, {followers: 130})])
```
To add an edge to the graph, we can use the Graph.add_edge method, and reference the nodes already present in the graph:
```
g.add_edge(Jeremy, Mark)
```

It is worth noting that, in NetworkX, when adding an edge, any nodes specified as part of that edge not already in the graph will be added implicitly.

To confirm that our graph now contains nodes and edges, we may want to plot it, using matplotlib and networkx.draw(). The with_labels parameter adds the names of the nodes to the plot:
```
import matplotlib.pyplot as plt
nx.draw(g, with_labels=True)
plt.show()
```

This section showed you how you can get up and running with NetworkX in a couple of lines of Python code. In the next section, we will turn our focus to the popular igraph package, which allows us to perform calculations over larger datasets much quicker than using the popular NetworkX.

igraph basics

NetworkX, while user-friendly, suffers from slow speeds when using larger graphs. This is due to its implementation behind the scenes and because it is written in Python, with some C, C++, and FORTRAN.

In contrast, igraph is implemented in pure C, giving the library an advantage when working with large graphs and complex network algorithms. While not as immediately accessible as NetworkX for beginners, igraph is a useful tool to have under your belt when code efficiency is paramount.

Initially, working with igraph is very similar to working with NetworkX. Let’s take a look:

To import igraph into Python, use the following command:
```
import igraph as ig
```
And to create an empty graph, g, use the following command:
```
g = ig.Graph()
```

In contrast to NetworkX, in igraph, all nodes have a prescribed internal integer ID. The first node that’s added has an ID of 0, with all subsequent nodes assigned increasing integer IDs.

Similar to NetworkX, changes can be made to a graph by using the methods of a Graph object. Nodes can be added to the graph with the Graph.add_vertices method (note that a vertex is another way to refer to a node). Two nodes can be added to the graph with the following code:
```
g.add_vertices(2)
```
This will add nodes 0 and 1 to the graph. To name them, we have to assign properties to the nodes. We can do this by accessing the vertices of the Graph object. Similar to how you would access elements of a list, each node’s properties can be accessed by using the following notation. Here, we are setting the name and followers attributes of nodes 0 and 1:
```
g.vs[0][name] = Jeremy
g.vs[1][name] = Mark
g.vs[0][followers] = 130
g.vs[1][followers] = 2100
```
Node properties can also be added listwise, where the first list element corresponds to node ID 0, the second to node ID 1, and so on. The following two lines are equivalent to the four lines shown in step 4:
```
g.vs["name"] = [Jeremy, Mark]
g.vs[followers] = [130, 2100]
```
To add an edge, we can use the Graph.add_edges() method:
```
g.add_edges([(0, 1)])
```

Here, we are only adding one edge, but additional edges can be added to the list parameter required by add_edges. As with NetworkX, if edges are added for nodes that are not currently in the graph, nodes will be created implicitly. However, since igraph requires nodes to have sequential IDs, attempting to add the edge pair (1, 3) to a graph with two vertices will fail.

Summary

In this chapter, we looked at why you should start to think graph, from the benefits of why these methods are becoming the most widely utilized and discussed approaches in various industries. We looked at what a graph is and explained the various types of graphs, such as graphs that are concerned with how things move through a network, to influence graphs (who is influencing who on social media), to graph methods to identify groups and interactions, and how graphs can be utilized to detect patterns in a network.

Moving on from that, we examined the fundamentals of what makes up a graph. Here, we looked at the fundamental elements of nodes, edges, and properties and delved into the difference between an undirected and directed graph. Additionally, we examined the properties of nodes, looked at heterogeneous graphs, and examined the types of schema contained within a graph.

This led to how GDBs compare to legacy RDBs and why, in many cases, graphs are much easier and faster to transverse and query, with examples of how graphs can be utilized to optimize the stops on a train journey and how this can be extended, with ease, to add bus stops as well, as a new data source.

Following this, we looked at how graphs are being deployed across various industries and some use cases for why graphs are important in those industries, such as fraud detection in the finance sector, intelligence profiling in the government sector, patient journeys in hospitals, churn across networks, and customer segmentation in marketing. Graphs truly are becoming ubiquitous across various industries.

We wrapped up this chapter by providing an introduction to the powerhouses of graph analytics and network analysis – igraph and NetworkX. We showed you how, in a few lines of Python code, you can easily start to populate a graph.

In the next chapter, we will look at how to work with and create graph data models. The next chapter will contain many more hands-on examples of how to structure your data using graph data models in Python.

Filter reviews by

All

Amazon verified reviews

Jacob Silva Mar 07, 2024

Great book! It quickly teaches a beginner about graph data with easy-to-understand content, hands-on exercises, and provided source code. Then, it touches on advanced topics like projections on graph data. I am happy we picked this book for our Tech Book Club

Amazon Verified review

Amanda Fetch Sep 26, 2023

As someone who is currently working with a knowledge graph based solution within an organization, it is imperative that I find solid resources to cover both the basics and more advanced concepts of graph data modeling. "Graph Data Modeling in Python" starts with the basics of graph structures and their real-world applications, then the book systematically covers practical use-cases from making recommendations using Python to designing complex pipelines with Neo4j and Cypher. Its in-depth coverage on transforming relational databases to graph databases, building knowledge graphs, and addressing common errors I found very useful. A perfect blend of theory and hands-on examples makes it an indispensable resource for anyone venturing into the world of graph databases. Highly recommended for both beginners and seasoned professionals!

Jacob Vandergriff Sep 28, 2023

Graph Data Modeling in Python provides a great introduction for those familiar with python and relational databases.The flow from relational databases, to python graph modeling, to scaling up to Neo4j databases was exceptional and can get any beginner to a graph developer.

Om S Jul 05, 2023

The book begins by highlighting the growing importance of graphs in our daily lives, driving various applications such as social media, recommendation systems, and fraud detection. Readers are introduced to the benefits of graph data models, which enhance efficiency and uncover hidden insights through complex network analysis.With a focus on practical use cases, readers learn how to design optimal graph models capable of supporting a wide range of queries and features. The transition from traditional relational databases to dynamic graph data structures is seamlessly addressed, empowering readers to unlock the power of path-based analyses. The book also covers working with Neo4j for persistent graph database management.By the end of the book, readers will possess the skills to transform tabular data into powerful graph data models, ranging from beginner to advanced-level proficiency. They will have a deep understanding of schema design best practices, proficiency in NetworkX and igraph frameworks, and the ability to store, query, ingest, and refactor graph data."Graph Data Modeling in Python: A Practical Guide to Curating, Analyzing, and Modeling Data with Graphs" is a must-read for those interested in graph databases and data modeling. The book caters to individuals with prior knowledge of Python, while also being accessible to beginners in graph data modeling. Its practical approach, real-world examples, and coverage of common errors and debugging make it an essential companion for those looking to extract valuable insights from graph data.

SEAN W. GRANT Aug 18, 2023

This was a good book to read that helped in understanding how to build a graph from various sources (rdbms tables, NLP outputs). It showed how you can use some common python libraries (NLTK, Spacy) to process some data and create a graph from it. There were a couple chapters that I especially liked: Chapter 4 and 7.Chapter 4- Building Knowledge Graphs: Author does a great job of explaining what a knowledge graph is, how to design one and then walk through how to build one. Knowledge graphs are becoming a popular topic and structure that businesses are using to provide better understanding and this chapter will help you understand how you can start looking at your data to build a knowledge graph.Chapter 7-Refactoring and Evolving Schemas: I found this to be a great explanation of power of the graph database and how it can adjust fast to your needs without causing a problem with live data. The hands-on examples were great at showing how easy it is to change the schema in the Neo4j database.Overall this was a great book for getting your hands dirty on designing, understand and building graphs that will help you think about how connected data really is.

Graph Data Modeling in Python: A practical guide to curating, analyzing, and modeling data with graphs

What do you get with eBook?

Graph Data Modeling in Python

Introducing Graphs in the Real World

Technical requirements

Why should you use graphs?

Composite components of a graph

The fundamentals of nodes and edges and the properties of a graph

Undirected graphs

Directed graphs

Node properties

Heterogeneous graphs

Schema design

Comparing RDBs and GDBs

GDBs to the rescue

The use of graphs across various industries

Introduction to NetworkX and igraph

NetworkX basics

igraph basics

Summary

Page 1 of 8

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs

Graph Data Modeling in Python: A practical guide to curating, analyzing, and modeling data with graphs

What do you get with eBook?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs