In this section, we will give a general introduction to graph theory. Moreover, in order to merge theoretical concepts with their practical implementation, we will enrich our explanation with code snippets in Python, using networkx
.
A simple undirected graph (or simply, a graph) G is defined as a couple G=(V,E) , where V={, .., } is a set of nodes (also called vertices) and E={{, .., {,}} is a set of two-sets (set of two elements) of edges (also called links), representing the connection between two nodes belonging to V.
It is important to underline that since each element of E is a two-set, there is no order between each edge. To provide more detail, {, and {, represent the same edge.
We now provide definitions for some basic properties of graphs and nodes, as follows:
- The order of a graph is the number of its vertices |V|. The size of a graph is the number of its edges |E|.
- The degree of a vertex is the number of edges that are adjacent to it. The neighbors of a vertex v in a graph G is a subset of vertex induced by all vertices adjacent to v.
- The neighborhood graph (also known as an ego graph) of a vertex v in a graph G is a subgraph of G, composed of the vertices adjacent to v and all edges connecting vertices adjacent to v.
An example of what a graph looks like can be seen in the following screenshot:
Figure 1.1 – Example of a graph
According to this representation, since there is no direction, an edge from Milan to Paris is equal to an edge from Paris to Milan. Thus, it is possible to move in the two directions without any constraint. If we analyze the properties of the graph depicted in Figure 1.1, we can see that it has order and size equal to 4
(there are, in total, four vertices and four edges). The Paris and Dublin vertices have degree 2
, Milan has degree 3
, and Rome has degree 1
. The neighbors for each node are shown in the following list:
Paris
= {Milan
, Dublin
}
Milan
= {Paris
, Dublin
, Rome
}
Dublin
= {Paris
, Milan
}
Rome
= {Milan
}
The same graph can be represented in networkx
, as follows:
import networkx as nx
G = nx.Graph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Milan','Paris'), ('Paris','Dublin'), ('Milan','Rome')]
G.add_nodes_from(V)
G.add_edges_from(E)
Since by default, the nx.Graph()
command generates an undirected graph, we do not need to specify both directions of each edge. In networkx
, nodes can be any hashable object: strings, classes, or even other networkx
graphs. Let's now compute some properties of the graph we previously generated.
All the nodes and edges of the graph can be obtained by running the following code:
print(f"V = {G.nodes}")
print(f"E = {G.edges}")
Here is the output of the previous commands:
V = ['Rome', 'Dublin', 'Milan', 'Paris']
E = [('Rome', 'Milan'), ('Dublin', 'Milan'), ('Dublin', 'Paris'), ('Milan', 'Paris')]
We can also compute the graph order, the graph size, and the degree and neighbors for each of the nodes, using the following commands:
print(f"Graph Order: {G.number_of_nodes()}")
print(f"Graph Size: {G.number_of_edges()}")
print(f"Degree for nodes: { {v: G.degree(v) for v in G.nodes} }")
print(f"Neighbors for nodes: { {v: list(G.neighbors(v)) for v in G.nodes} }")
The result will be the following:
Graph Order: 4
Graph Size: 4
Degree for nodes: {'Rome': 1, 'Paris': 2, 'Dublin':2, 'Milan': 3}
Neighbors for nodes: {'Rome': ['Milan'], 'Paris': ['Milan', 'Dublin'], 'Dublin': ['Milan', 'Paris'], 'Milan': ['Dublin', 'Paris', 'Rome']}
Finally, we can also compute an ego graph of a specific node for the graph G
, as follows:
ego_graph_milan = nx.ego_graph(G, "Milan")
print(f"Nodes: {ego_graph_milan.nodes}")
print(f"Edges: {ego_graph_milan.edges}")
The result will be the following:
Nodes: ['Paris', 'Milan', 'Dublin', 'Rome']
Edges: [('Paris', 'Milan'), ('Paris', 'Dublin'), ('Milan', 'Dublin'), ('Milan', 'Rome')]
The original graph can be also modified by adding new nodes and/or edges, as follows:
#Add new nodes and edges
new_nodes = {'London', 'Madrid'}
new_edges = [('London','Rome'), ('Madrid','Paris')]
G.add_nodes_from(new_nodes)
G.add_edges_from(new_edges)
print(f"V = {G.nodes}")
print(f"E = {G.edges}")
This would output the following lines:
V = ['Rome', 'Dublin', 'Milan', 'Paris', 'London', 'Madrid']
E = [('Rome', 'Milan'), ('Rome', 'London'), ('Dublin', 'Milan'), ('Dublin', 'Paris'), ('Milan', 'Paris'), ('Paris', 'Madrid')]
Removal of nodes can be done by running the following code:
node_remove = {'London', 'Madrid'}
G.remove_nodes_from(node_remove)
print(f"V = {G.nodes}")
print(f"E = {G.edges}")
This is the result of the preceding commands:
V = ['Rome', 'Dublin', 'Milan', 'Paris']
E = [('Rome', 'Milan'), ('Dublin', 'Milan'), ('Dublin', 'Paris'), ('Milan', 'Paris')]
As expected, all the edges that contain the removed nodes are automatically deleted from the edge list.
Also, edges can be removed by running the following code:
node_edges = [('Milan','Dublin'), ('Milan','Paris')]
G.remove_edges_from(node_edges)
print(f"V = {G.nodes}")
print(f"E = {G.edges}")
The final result will be as follows:
V = ['Dublin', 'Paris', 'Milan', 'Rome']
E = [('Dublin', 'Paris'), ('Milan', 'Rome')]
The networkx
library also allows us to remove a single node or a single edge from a graph G
by using the following commands: G. remove_node('Dublin')
and G.remove_edge('Dublin', 'Paris')
.
Types of graphs
In the previous section, we described how to create and modify simple undirected graphs. Here, we will show how we can extend this basic data structure in order to encapsulate more information, thanks to the introduction of directed graphs (digraphs), weighted graphs, and multigraphs.
Digraphs
A digraph G is defined as a couple G=(V, E), where V={, .., } is a set of nodes and E={(, .., (,)} is a set of ordered couples representing the connection between two nodes belonging to V.
Since each element of E is an ordered couple, it enforces the direction of the connection. The edge , means the node goes into . This is different from , since it means the node goes to .The starting node is called the head, while the ending node is called the tail.
Due to the presence of edge direction, the definition of node degree needs to be extended.
Indegree and outdegree
For a vertex v, the number of head ends adjacent to v is called the indegree (indicated by of v, while the number of tail ends adjacent to v is its outdegree (indicated by ).
An example of what a digraph looks like is available in the following screenshot:
Figure 1.2 – Example of a digraph
The direction of the edge is visible from the arrow—for example, Milan -> Dublin means from Milan to Dublin. Dublin has = 2 and = 0, Paris has = 0 and = 2, Milan has = 1 and = 2, and Rome has = 1 and = 0.
The same graph can be represented in networkx
, as follows:
G = nx.DiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Paris','Milan'), ('Paris','Dublin'), ('Milan','Rome')]
G.add_nodes_from(V)
G.add_edges_from(E)
The definition is the same as that used for simple undirected graphs; the only difference is in the networkx
classes that are used to instantiate the object. For digraphs, the nx.DiGraph()
class is used.
Indegree
and Outdegree
can be computed using the following commands:
print(f"Indegree for nodes: { {v: G.in_degree(v) for v in G.nodes} }")
print(f"Outdegree for nodes: { {v: G.out_degree(v) for v in G.nodes} }")
The results will be as follows:
Indegree for nodes: {'Rome': 1, 'Paris': 0, 'Dublin': 2, 'Milan': 1}
Outdegree for nodes: {'Rome': 0, 'Paris': 2, 'Dublin': 0, 'Milan': 2}
As for the undirected graphs, G.add_nodes_from()
, G.add_edges_from()
, G.remove_nodes_from()
, and G.remove_edges_from()
functions can be used to modify a given graph G
.
Multigraph
We will now introduce the multigraph object, which is a generalization of the graph definition that allows multiple edges to have the same pair of start and end nodes.
A multigraph G is defined as G=(V, E), where V is a set of nodes and E is a multi-set (a set allowing multiple instances for each of its elements) of edges.
A multigraph is called a directed multigraph if E is a multi-set of ordered couples; otherwise, if E is a multi-set of two-sets, then it is called an undirected multigraph.
An example of a directed multigraph is available in the following screenshot:
Figure 1.3 – Example of a multigraph
In the following code snippet, we show how to use networkx
in order to create a directed or an undirected multigraph:
directed_multi_graph = nx.MultiDiGraph()
undirected_multi_graph = nx.MultiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Milan','Dublin'), ('Paris','Milan'), ('Paris','Dublin'), ('Milan','Rome'), ('Milan','Rome')]
directed_multi_graph.add_nodes_from(V)
undirected_multi_graph.add_nodes_from(V)
directed_multi_graph.add_edges_from(E)
undirected_multi_graph.add_edges_from(E)
The only difference between a directed and an undirected multigraph is in the first two lines, where two different objects are created: nx.MultiDiGraph()
is used to create a directed multigraph, while nx.MultiGraph()
is used to build an undirected multigraph. The function used to add nodes and edges is the same for both objects.
Weighted graphs
We will now introduce directed, undirected, and multi-weighted graphs.
An edge-weighted graph (or simply, a weighted graph) G is defined as G=(V, E ,w) where V is a set of nodes, E is a set of edges, and is the weighted function that assigns at each edge a weight expressed as a real number.
A node-weighted graph G is defined as G=(V, E ,w) ,where V is a set of nodes, E is a set of edges, and is the weighted function that assigns at each node a weight expressed as a real number.
Please keep the following points in mind:
- If E is a set of ordered couples, then we call it a directed weighted graph.
- If E is a set of two-sets, then we call it an undirected weighted graph.
- If E is a multi-set, we will call it a weighted multigraph (directed weighted multigraph).
- If E is a multi-set of ordered couples, it is an undirected weighted multigraph.
An example of a directed edge-weighted graph is available in the following screenshot:
Figure 1.4 – Example of a directed edge-weighted graph
From Figure 1.4, it is easy to see how the presence of weights on graphs helps to add useful information to the data structures. Indeed, we can imagine the edge weight as a "cost" to reach a node from another node. For example, reaching Dublin from Milan has a "cost" of 19
, while reaching Dublin from Paris has a "cost" of 11
.
In networkx
, a directed weighted graph can be generated as follows:
G = nx.DiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin', 19), ('Paris','Milan', 8), ('Paris','Dublin', 11), ('Milan','Rome', 5)]
G.add_nodes_from(V)
G.add_weighted_edges_from(E)
Bipartite graphs
We will now introduce another type of graph that will be used in this section: multipartite graphs. Bi- and tripartite graphs—and, more generally, kth-partite graphs—are graphs whose vertices can be partitioned in two, three, or more k-th sets of nodes, respectively. Edges are only allowed across different sets and are not allowed within nodes belonging to the same set. In most cases, nodes belonging to different sets are also characterized by particular node types. In Chapters 7, Text Analytics and Natural Language Processing Using Graphs, and Chapter 8, Graphs Analysis for Credit Cards Transaction, we will deal with some practical examples of graph-based applications and you will see how multipartite graphs can indeed arise in several contexts—for example, in the following scenarios:
- When processing documents and structuring the information in a bipartite graph of documents and entities that appear in the documents
- When dealing with transactional data, in order to encode the relations between the buyers and the merchants
A bipartite graph can be easily created in networkx
with the following code:
import pandas as pd
import numpy as np
n_nodes = 10
n_edges = 12
bottom_nodes = [ith for ith in range(n_nodes) if ith % 2 ==0]
top_nodes = [ith for ith in range(n_nodes) if ith % 2 ==1]
iter_edges = zip(
np.random.choice(bottom_nodes, n_edges),
np.random.choice(top_nodes, n_edges))
edges = pd.DataFrame([
{"source": a, "target": b} for a, b in iter_edges])
B = nx.Graph()
B.add_nodes_from(bottom_nodes, bipartite=0)
B.add_nodes_from(top_nodes, bipartite=1)
B.add_edges_from([tuple(x) for x in edges.values])
The network can also be conveniently plotted using the bipartite_layout
utility function of networkx
, as illustrated in the following code snippet:
from networkx.drawing.layout import bipartite_layout
pos = bipartite_layout(B, bottom_nodes)
nx.draw_networkx(B, pos=pos)
The bipatite_layout
function produces a graph, as shown in the following screenshot:
Figure 1.5 – Example of a bipartite graph
Graph representations
As described in the previous sections, with networkx
, we can actually define and manipulate a graph by using node and edge objects. In different use cases, such a representation would not be as easy to handle. In this section, we will show two ways to perform a compact representation of a graph data structure—namely, an adjacency matrix and an edge list.
Adjacency matrix
The adjacency matrix M of a graph G=(V,E) is a square matrix (|V| × |V|) matrix such that its element is 1 when there is an edge from node i to node j, and 0 when there is no edge. In the following screenshot, we show a simple example where the adjacency matrix of different types of graphs is displayed:
Figure 1.6 – Adjacency matrix for an undirected graph, a digraph, a multigraph, and a weighted graph
It is easy to see that adjacency matrices for undirected graphs are always symmetric, since no direction is defined for the edge. The symmetry instead is not guaranteed for the adjacency matrix of a digraph due to the presence of constraints in the direction of the edges. For a multigraph, we can instead have values greater than 1 since multiple edges can be used to connect the same couple of nodes. For a weighted graph, the value in a specific cell is equal to the weight of the edge connecting the two nodes.
In networkx
, the adjacency matrix for a given graph can be computed in two different ways. If G
is the networkx
of Figure 1.6, we can compute its adjacency matrix as follows:
nx.to_pandas_adjacency(G) #adjacency matrix as pd DataFrame
nt.to_numpy_matrix(G) #adjacency matrix as numpy matrix
For the first and second line, we get the following results respectively:
Rome Dublin Milan Paris
Rome 0.0 0.0 0.0 0.0
Dublin 0.0 0.0 0.0 0.0
Milan 1.0 1.0 0.0 0.0
Paris 0.0 1.0 1.0 0.0
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[1. 1. 0. 0.]
[0. 1. 1. 0.]]
Since a numpy
matrix cannot represent the name of the nodes, the order of the element in the adjacency matrix is the one defined in the G.nodes
list.
Edge list
As well as an adjacency matrix, an edge list is another compact way to represent graphs. The idea behind this format is to represent a graph as a list of edges.
The edge list L of a graph G=(V,E) is a list of size |E| matrix such that its element is a couple representing the tail and the end node of the edge i. An example of the edge list for each type of graph is available in the following screenshot:
Figure 1.7 – Edge list for an undirected graph, a digraph, a multigraph, and a weighted graph
In the following code snippet, we show how to compute in networkx
the edge list of the simple undirected graph G available in Figure 1.7:
print(nx.to_pandas_edgelist(G))
By running the preceding command, we get the following result:
source target
0 Milan Dublin
1 Milan Rome
2 Paris Milan
3 Paris Dublin
Other representation methods, which we will not discuss in detail, are also available in networkx
. Some examples are nx.to_dict_of_dicts(G)
and nx.to_numpy_array(G)
, among others.