Graph Machine Learning

Chapter 1: Getting Started with Graphs

Graphs are mathematical structures that are used for describing relations between entities and are used almost everywhere. For example, social networks are graphs, where users are connected depending on whether one user "follows" the updates of another user. They can be used for representing maps, where cities are linked through streets. Graphs can describe biological structures, web pages, and even the progression of neurodegenerative diseases.

Graph theory, the study of graphs, has received major interest for years, leading people to develop algorithms, identify properties, and define mathematical models to better understand complex behaviors.

This chapter will review some of the concepts behind graph-structured data. Theoretical notions will be presented, together with examples to help you understand some of the more general concepts and put them into practice. In this chapter, we will introduce and use some of the most widely used libraries for the creation, manipulation, and study of the structure dynamics and functions of complex networks, specifically looking at the Python networkx library.

The following topics will be covered in this chapter:

Introduction to graphs with networkx
Plotting graphs
Graph properties
Benchmarks and repositories
Dealing with large graphs

Plotting graphs

As we have seen in previous sections, graphs are intuitive data structures represented graphically. Nodes can be plotted as simple circles, while edges are lines connecting two nodes.

Despite their simplicity, it could be quite difficult to make a clear representation when the number of edges and nodes increases. The source of this complexity is mainly related to the position (space/Cartesian coordinates) to assign to each node in the final plot. Indeed, it could be unfeasible to manually assign to a graph with hundreds of nodes the specific position of each node in the final plot.

In this section, we will see how we can plot graphs without specifying coordinates for each node. We will exploit two different solutions: networkx and Gephi.

networkx

networkx offers a simple interface to plot graph objects through the nx.draw library. In the following code snippet, we show how to use the library in order to plot graphs:

def draw_graph(G, nodes_position, weight):
      nx.draw(G, pos_ position, with_labels=True, font_size=15, node_size=400, edge_color='gray', arrowsize=30)
             if plot_weight:
             edge_labels=nx.get_edge_attributes(G,'weight')
         nx.draw_networkx_edge_labels(G, pos_ position, edge_labels=edge_labels)

Here, nodes_position is a dictionary where the keys are the nodes and the value assigned to each key is an array of length 2, with the Cartesian coordinate used for plotting the specific node.

The nx.draw function will plot the whole graph by putting its nodes in the given positions. The with_labels option will plot its name on top of each node with the specific font_size value. node_size and edge_color will respectively specify the size of the circle, representing the node and the color of the edges. Finally, arrowsize will define the size of the arrow for directed edges. This option will be used when the graph to be plotted is a digraph.

In the following code example, we show how to use the draw_graph function previously defined in order to plot a graph:

G = nx.Graph()
V = {'Paris', 'Dublin','Milan', 'Rome'}
E = [('Paris','Dublin', 11), ('Paris','Milan', 8),
     ('Milan','Rome', 5), ('Milan','Dublin', 19)]
G.add_nodes_from(V)
G.add_weighted_edges_from(E)
node_position = {"Paris": [0,0], "Dublin": [0,1], "Milan": [1,0], "Rome": [1,1]}
draw_graph(G, node_position, True)

The result of the plot is available to view in the following screenshot:

Figure 1.8 – Result of the plotting function

The method previously described is simple but unfeasible to use in a real scenario since the node_position value could be difficult to decide. In order to solve this issue, networkx offers a different function to automatically compute the position of each node according to different layouts. In Figure 1.9, we show a series of plots of an undirected graph, obtained using the different layouts available in networkx. In order to use them in the function we proposed, we simply need to assign node_position to the result of the layout we want to use—for example, node_position = nx.circular_layout(G). The plots can be seen in the following screenshot:

Figure 1.9 – Plots of the same undirected graph with different layouts

networkx is a great tool for easily manipulating and analyzing graphs, but it does not offer good functionalities in order to perform complex and good-looking plots of graphs. In the next section, we will investigate another tool to perform complex graph visualization: Gephi.

Gephi

In this section, we will show how Gephi (an open source network analysis and visualization software) can be used for performing complex, fancy plots of graphs. For all the examples showed in this section, we will use the Les Miserables.gexf sample (a weighted undirected graph), which can be selected in the Welcome window when the application starts.

The main interface of Gephi is shown in Figure 1.10. It can be divided into four main areas, as follows:

Graph: This section shows the final plot of the graph. The image is automatically updated each time a filter or a specific layout is applied.
Appearance: Here, it is possible to specify the appearance of nodes and edges.
Layout: In this section, it is possible to select the layout (as in networkx) to adjust the node position in the graph. Different algorithms, from a simple random position generator to a more complex Yifan Hu algorithm, are available.
Filters & Statistics: In this set area, two main functions are available, outlined as follows:
a. Filters: In this tab, it is possible to filter and visualize specific subregions of the graph according to a set property computed using the Statistics tab.
b. Statistics: This tab contains a list of available graph metrics that can be computed on the graph using the Run button. Once metrics are computed, they can be used as properties to specify the edges' and nodes' appearance (such as node and edge size and color) or to filter a specific subregion of the graph.
You can see the main interface of Gephi in the following screenshot:

Figure 1.10 – Gephi main window

Our exploration of Gephi starts with the application of different layouts to the graph. As previously described, in networkx the layouts allow us to assign to each node a specific position in the final plot. In Gephi 1.2, different layouts are available. In order to apply a specific layout, we have to select from the Layout area one of the available layouts, and then click on the Run button that appears after the selection.

The graph representation, visible in the Graph area, will be automatically updated according to the new coordinates defined by the layout. It should be noted that some layouts are parametric, hence the final graph plot can significantly change according to the parameters used. In the following screenshot, we propose several examples for the application of three different layouts:

Figure 1.11 – Plot of the same graph with different layout

We will now introduce the available options in the Appearance menu visible in Figure 1.10. In this section, it is possible to specify the style to be applied to edges and nodes. The style to be applied can be static or can be dynamically defined by specific properties of the nodes/edges. We can change the color and the size of the nodes by selecting the Nodes option in the menu.

In order to change the color, we have to select the color palette icon and decide, using the specific button, if we want to assign a Unique color, a Partition (discrete values), or a Ranking (range of values) of colors. For Partition and Ranking, it is possible to select from the drop-down menu a specific Graph property to use as reference for the color range. Only the properties computed by clicking Run in the Statistics area are available in the drop-down menu. The same procedure can be used in order to set the size of the nodes. By selecting the concentric circles icon, it is possible to set a Unique size to all the nodes or to specify a Ranking of size according to a specific property.

As for the nodes, it is also possible to change the style of the edges by selecting the Edges option in the menu. We can then select to assign a Unique color, a Partition (discrete values), or a Ranking (range of values) of colors. For Partition and Ranking, the reference value to build the color scale is defined by a specific Graph property that can be selected from the drop-down menu.

It is important to remember that in order to apply a specific style to the graph, the Apply button should be clicked. As a result, the graph plot will be updated according to the style defined. In the following screenshot, we show an example where the color of the nodes is given by the Modularity Class value and the size of each node is given by its degree, while the color of each edge is defined by the edge weight:

Figure 1.12 – Example of graph plot changing nodes' and edges' appearance

Another important section that needs to be described is Filters & Statistics. In this menu, it is possible to compute some statistics based on graph metrics.

Finally, we conclude our discussion on Gephi by introducing the functionalities available in the Statistics menu, visible in the right panel in Figure 1.10. Through this menu, it is possible to compute different statistics on the input graph. Those statistics can be easily used to set some properties of the final plot, such as nodes'/edges' color and size, or to filter the original graph to plot just a specific subset of it. In order to compute a specific statistic, the user then needs to explicitly select one of the metrics available in the menu and click on the Run button (Figure 1.10, right panel).

Moreover, the user can select a subregion of the graph, using the options available in the Filters tab of the Statistics menu, visible in the right panel in Figure 1.10. An example of filtering a graph can be seen in Figure 1.13. To provide more details of this, we build and apply to the graph a filter, using the Degree property. The result of the filters is a subset of the original graph, where only the nodes (and their edges) having the specific range of values for the degree property are visible.

This is illustrated in the following screenshot:

Figure 1.13 – Example of a graph filtered according to a range of values for Degree

Of course, Gephi allows us to perform more complex visualization tasks and contains a lot of functionalities that cannot be fully covered in this book. Some good references to better investigate all the features available in Gephi are the official Gephi guide (https://gephi.org/users/) or the Gephi Cookbook book by Packt Publishing.

Graph properties

As we have already learned, a graph is a mathematical model that is used for describing relations between entities. However, each complex network presents intrinsic properties. Such properties can be measured by particular metrics, and each measure may characterize one or several local and global aspects of the graph.

In a graph for a social network such as Twitter, for example, users (represented by the nodes of the graph) are connected to each other. However, there are users that are more connected than others (influencers). On the Reddit social graph, users with similar characteristics tend to group into communities.

We have already mentioned some of the basic features of graphs, such as the number of nodes and edges in a graph, which constitute the size of the graph itself. Those properties already provide a good description of the structure of a network. Think about the Facebook graph, for example: it can be described in terms of the number of nodes and edges. Such numbers easily allow it to be distinguished from a much smaller network (for example, the social structure of an office) but fail to characterize more complex dynamics (for example, how similar nodes are connected). To this end, more advanced graph-derived metrics can be considered, which can be grouped into four main categories, outlined as follows:

Integration metrics: These measure how nodes tend to be interconnected with each other.
Segregation metrics: These quantify the presence of groups of interconnected nodes, known as communities or modules, within a network.
Centrality metrics: These assess the importance of individual nodes inside a network.
Resilience metrics: These can be thought of as a measure of how much a network is able to maintain and adapt its operational performance when facing failures or other adverse conditions.

Those metrics are defined as global when expressing a measure of an overall network. On the other hand, local metrics measure values of individual network elements (nodes or edges). In weighted graphs, each property may or may not account for the edge weights, leading to weighted and unweighted metrics.

In the following section, we describe some of the most commonly used metrics that measure global and local properties. For simplicity, unless specified differently in the text, we illustrate the global unweighted version of the metric. In several cases, this is obtained by averaging the local unweighted properties of the node.

Integration metrics

In this section, some of the most frequently used integration metrics will be described.

Distance, path, and shortest path

The concept of distance in a graph is often related to the number of edges to traverse in order to reach a target node from a given source node.

In particular, consider a source node and a target node . The set of edges connecting node to node is called a path. When studying complex networks, we are often interested in finding the shortest path between two nodes. A shortest path between a source node and a target node is the path having the lowest number of edges compared to all the possible paths between and . The diameter of a network is the number of edges contained in the longest shortest path among all possible shortest paths.

Take a look at the following screenshot. There are different paths to reach Tokyo from Dublin. However, one of them is the shortest (the edges on the shortest path are highlighted):

Figure 1.14 – The shortest path between two nodes

The shortest_path function of the networkx Python library enables users to quickly compute the shortest path between two nodes in a graph. Consider the following code, in which a seven-node graph is created by using networkx:

G = nx.Graph()
nodes = {1:'Dublin',2:'Paris',3:'Milan',4:'Rome',5:'Naples',
         6:'Moscow',7:'Tokyo'}
G.add_nodes_from(nodes.keys())
G.add_edges_from([(1,2),(1,3),(2,3),(3,4),(4,5),(5,6),(6,7),(7,5)])

The shortest path between a source node (for example, 'Dublin', identified by the key 1) and a target node (for example, 'Tokyo', identified by the key 7) can be obtained as follows:

path = nx.shortest_path(G,source=1,target=7)

This should output the following:

[1,3,4,5,6]

Here, [1,3,4,5,7] are the nodes contained in the shortest path between 'Tokyo' and 'Dublin'.

Characteristic path length

The characteristic path length is defined as the average of all the shortest path lengths between all possible pair of nodes. If is the average path length between the node and all the other nodes, the characteristic path length is computed as follows:

Here, is the set of nodes in the graph and represents its order. This is one of the most commonly used measures of how efficiently information is spread across a network. Networks having shorter characteristic path lengths promote the quick transfer of information and reduce costs. Characteristic path length can be computed through networkx using the following function:

nx.average_shortest_path_length(G)

This should give us the following:

2.1904761904761907

However, this metric cannot be always defined since it is not possible to compute a path among all the nodes in disconnected graphs. For this reason, network efficiency is also widely used.

Global and local efficiency

Global efficiency is the average of the inverse shortest path length for all pairs of nodes. Such a metric can be seen as a measure of how efficiently information is exchanged across a network. Consider that is the shortest path between a node and a node . The network efficiency is defined as follows:

Efficiency is at a maximum when a graph is fully connected, while it is minimal for completely disconnected graphs. Intuitively, the shorter the path, the lower the measure.

The local efficiency of a node can be computed by considering only the neighborhood of the node in the calculation, without the node itself. Global efficiency is computed in networkx using the following command:

nx.global_efficiency(G)

The output should be as follows:

0.6111111111111109

Average local efficiency is computed in networkx using the following command:

nx.local_efficiency(G)

The output should be as follows:

0.6666666666666667

In the following screenshot, two examples of graphs are depicted. As observed, a fully connected graph on the left presents a higher level of efficiency compared to a circular graph on the right. In a fully connected graph, each node can be reached from any other node in the graph, and information is exchanged rapidly across the network. However, in a circular graph, several nodes should instead be traversed to reach the target node, making it less efficient:

Figure 1.15 – Global efficiency of a fully connected graph (left) and a circular graph (right)

Integration metrics well describe the connection among nodes. However, more information about the presence of groups can be extracted by considering segregation metrics.

Segregation metrics

In this section, some of the most common segregation metrics will be described.

Clustering coefficient

The clustering coefficient is a measure of how much nodes cluster together. It is defined as the fraction of triangles (complete subgraph of three nodes and three edges) around a node and is equivalent to the fraction of the node's neighbors that are neighbors of each other. A global clustering coefficient is computed in networkx using the following command:

nx.average_clustering(G)

This should output the following:

0.6666666666666667

The local clustering coefficient is computed in networkx using the following command:

nx.clustering(G)

This should output the following:

{1: 1.0,
 2: 1.0,
 3: 0.3333333333333333,
 4: 0,
 5: 0.3333333333333333,
 6: 1.0,
 7: 1.0}

The output is a Python dictionary containing, for each node (identified by the respective key), the corresponding value. In the graph represented in Figure 1.16, two clusters of nodes can be easily identified. By computing the clustering coefficient for each single node, it can be observed that Rome has the lowest value. Tokyo and Moscow, as well as Paris and Dublin, are instead very well connected within their respective groups (notice the size of each node is drawn proportionally to each node's clustering coefficient). The graph can be seen in the following screenshot:

Figure 1.16 – Local clustering coefficient representation

Transitivity

A common variant of the clustering coefficient is known as transitivity. This can simply be defined as the ratio between the observed number of closed triplets (complete subgraph with three nodes and two edges) and the maximum possible number of closed triplets in the graph. Transitivity can be computed using networkx, as follows:

nx.transitivity(G)

The output should be as follows:

0.5454545454545454

Modularity

Modularity was designed to quantify the division of a network in aggregated sets of highly interconnected nodes, commonly known as modules, communities, groups, or clusters. The main idea is that networks having high modularity will show dense connections within the module and sparse connections between modules.

Consider a social network such as Reddit: members of communities related to video games tend to interact much more with other users in the same community, talking about recent news, favorite consoles, and so on. However, they will probably interact less with users talking about fashion. Differently from many other graph metrics, modularity is often computed by means of optimization algorithms.

Modularity in networkx is computed using the modularity function of the networkx.algorithms.community module, as follows:

import networkx.algorithms.community as nx_comm
nx_comm.modularity(G, communities=[{1,2,3}, {4,5,6,7}])

Here, the second argument—communities—is a list of sets, each representing a partition of the graph. The output should be as follows:

0.3671875

Segregation metrics help to understand the presence of groups. However, each node in a graph has its own importance. To quantify it, we can use centrality metrics.

Centrality metrics

In this section, some of the most common centrality metrics will be described.

Degree centrality

One of the most common and simple centrality metrics is the degree centrality metric. This is directly connected with the degree of a node, measuring the number of incident edges on a certain node .

Intuitively, the more a node is connected to an other node, the more its degree centrality will assume high values. Note that, if a graph is directed, the in-degree centrality and out-degree centrality will be considered for each node, related to the number of incoming and outcoming edges, respectively. Degree centrality is computed in networkx by using the following command:

nx.degree_centrality(G)

The output should be as follows:

{1: 0.3333333333333333, 2: 0.3333333333333333, 3: 0.5, 4: 0.3333333333333333, 5: 0.5, 6: 0.3333333333333333, 7: 0.3333333333333333}

Closeness centrality

The closeness centrality metric attempts to quantify how much a node is close (well connected) to other nodes. More formally, it refers to the average distance of a node to all other nodes in the network. If is the shortest path between node and node , the closeness centrality is defined as follows:

Here, V is the set of nodes in the graph. Closeness centrality can be computed in networkx using the following command:

nx.closeness_centrality(G)

The output should be as follows:

{1: 0.4, 2: 0.4, 3: 0.5454545454545454, 4: 0.6, 5: 0.5454545454545454, 6: 0.4, 7: 0.4}

Betweenness centrality

The betweenness centrality metric evaluates how much a node acts as a bridge between other nodes. Even if poorly connected, a node can be strategically connected, helping to keep the whole network connected.

If is the total number of shortest paths between node and node and is the total number of shortest paths between and passing through node , then the betweenness centrality is defined as follows:

If we observe the formula, we can notice that the higher the number of shortest paths passing through node , the higher the value of the betweenness centrality. Betweenness centrality is computed in networkx by using the following command:

nx.betweenness_centrality(G)

The output should be as follows:

{1: 0.0, 2: 0.0, 3: 0.5333333333333333, 4: 0.6, 5: 0.5333333333333333, 6: 0.0, 7: 0.0}

In Figure 1.17, we illustrate the difference between degree centrality, closeness centrality, and betweenness centrality. Milan and Naples have the highest degree centrality. Rome has the highest closeness centrality since it is the closest to any other node. It also shows the highest betweenness centrality because of its crucial role in connecting the two visible clusters and keeping the whole network connected.

You can see the differences here:

Figure 1.17 – Degree centrality (left), closeness centrality (center), and betweenness centrality (right)

Centrality metrics allow us to measure the importance of a node inside the network. Finally, we will mention resilience metrics, which enable us to measure the vulnerability of a graph.

Resilience metrics

There are several metrics that measure a network's resilience. Assortativity is one of the most used.

Assortativity coefficient

Assortativity is used to quantify the tendency of nodes being connected to similar nodes. There are several ways to measure such correlations. One of the most commonly used methods is the Pearson correlation coefficient between the degrees of directly connected nodes (nodes on two opposite ends of a link). The coefficient assumes positive values when there is a correlation between nodes of a similar degree, while it assumes negative values when there is a correlation between nodes of a different degree. Assortativity using the Pearson correlation coefficient is computed in networkx by using the following command:

nx.degree_pearson_correlation_coefficient(G)

The output should be as follows:

-0.6

Social networks are mostly assortative. However, the so-called influencers (famous singers, football players, fashion bloggers) tend to be followed (incoming edges) by several standard users, while tending to be connected with each other and showing a disassortative behavior.

It is important to remark that the previously presented properties are a subset of all the possible metrics used to describe graphs. A wider set of metrics and algorithms can be found at https://networkx.org/documentation/stable/reference/algorithms/.

Benchmarks and repositories

Now that we have understood the basic concepts and notions about graphs and network analysis, it is now time to dive into some practical examples that will help us to start to put into practice the general concepts we have learned so far. In this section, we will present some examples and toy problems that are generally used to study the properties of networks, as well as benchmark performances and effectiveness of networks' algorithms. We will also provide some useful links of repositories where network datasets can be found and downloaded, together with some tips on how to parse and process them.

Examples of simple graphs

We start by looking at some very simple examples of networks. Fortunately, networkx already comes with a number of graphs already implemented, ready to be used and played with. Let's start by creating a fully connected undirected graph, as follows:

complete = nx.complete_graph(n=7)

This has edges and a clustering coefficient C=1. Although fully connected graphs are not very interesting on their own, they represent a fundamental building block that may arise within larger graphs. A fully connected subgraph of n nodes within a larger graph is generally referred to as a clique of size n.

Definition

A clique, C, in an undirected graph is defined a subset of its vertices, C V, such that every two distinct vertices in the subset are adjacent. This is equivalent to the condition that the induced subgraph of G induced by C is a fully connected graph.

Cliques represent one of the basic concepts in graph theory and are often also used in mathematical problems where relations need to be encoded. Besides, they also represent the simplest unit when constructing more complex graphs. On the other hand, the task of finding cliques of a given size n in larger graphs (clique problem) is of great interest and it can be shown that it is a nondeterministic polynomial-time complete (NP-complete) problem often studied in computer science.

Some simple examples of networkx graphs can be seen in the following screenshot:

Figure 1.18 – Simple examples of graphs with networkx: (left) fully connected graph; (center) lollipop graph; (right) barbell graph

In Figure 1.18, we showed a complete graph along with two other simple examples containing cliques that can be easily generated with networkx, outlined as follows:

A lollipop graph formed by a clique of size n and a branch of m nodes, as shown in the following code snippet:
```
lollipop = nx.lollipop_graph(m=7, n=3)
```
A barbell graph formed by two cliques of size m1 and m2 joined by a branch of nodes, which resembles the sample graph we used previously to characterize some of the global and local properties. The code to generate this is shown in the following snippet:
```
barbell = nx.barbell_graph(m1=7, m2=4)
```

Such simple graphs are basic building blocks that can be used to generate more complex networks by combining them. Merging subgraphs is very easy with networkx and can be done with just a few lines of code, as shown in the following code snippet, where the three graphs are merged together into a single graph and some random edges are placed to connect them:

def get_random_node(graph):
    return np.random.choice(graph.nodes)
allGraphs = nx.compose_all([complete, barbell, lollipop])
allGraphs.add_edge(get_random_node(lollipop), get_random_node(lollipop))
allGraphs.add_edge(get_random_node(complete), get_random_node(barbell))

Other very simple graphs (that can then be merged and played around with) can be found at https://networkx.org/documentation/stable/reference/generators.html#module-networkx.generators.classic.

Generative graph models

Although creating simple subgraphs and merging them is a way to generate new graphs of increasing complexity, networks may also be generated by means of probabilistic models and/or generative models that let a graph grow by itself. Such graphs usually share interesting properties with real networks and have long been used to create benchmarks and synthetic graphs, especially in times when the amount of data available was not as overwhelming as today. Here, we present some examples of random generated graphs, briefly describing the models that underlie them.

Watts and Strogatz (1998)

This model was used by the authors to study the behavior of small-world networks—that is to say, networks that resemble, to some extent, common social networks. The graph is generated by first displacing n nodes in a ring and connecting each node with its k neighbors. Each edge of such a graph then has a probability p of being rewired to a randomly chosen node. By ranging p, the Watts and Strogatz model allows a shift from a regular network (p=0) to a completely random network (p=1). In between, graphs exhibit small-world features; that is, they tend to bring this model closer to social network graphs. These kinds of graphs can be easily created with the following command:

graph = nx.watts_strogatz_graph(n=20, k=5, p=0.2)

Barabási-Albert (1999)

The model proposed by Albert and Barabási is based on a generative model that allows the creation of random scale-free networks by using a preferential attachment schema, where a network is created by progressively adding new nodes and attaching them to already existing nodes, with a preference for nodes that have more neighbors. Mathematically speaking, the underlying idea of this model is that the probability for a new node to be attached to an existing node i depends on the degree of the i-th node, according to the following formula:

Thus, nodes with a large number of edges (hubs) tend to develop even more edges, whereas nodes with few links will not develop other links (periphery). Networks generated by this model exhibit a power-law distribution for the connectivity (that is, degree) between nodes. Such a behavior is also found in real networks (for example, the World Wide Web (WWW) network and the actor collaboration network), interestingly showing that it is the popularity of a node (how many edges it already has) rather than its intrinsic node properties that influences the creation of new connections. The initial model has then been extended (and this is the version that is available on networkx) to also allow the preferential attachment of new edges or rewiring of existing edges.

The Barabási-Albert model is illustrated in the following screenshot:

Figure 1.19 – Barabási-Albert model (left) with 20 nodes (right) distribution of connectivity with n=100.000 nodes, showing the scale-free power law distribution

In Figure 1.19, we showed an example of the Barabasi-Albert model for a small network, where you can already observe the emergence of hubs (on the left), as well as the probability distribution of the degree of the nodes, which exhibits a scale-free power-law behavior (on the right). The preceding distribution can easily be replicated in networkx, as follows:

ba_model = nx.extended_barabasi_albert_graph(n,m=1,p=0,q=0)
degree = dict(nx.degree(ba_model)).values()
bins = np.round(np.logspace(np.log10(min(degree)), np.log10(max(degree)), 10))
cnt = Counter(np.digitize(np.array(list(degree)), bins))

Benchmarks

Digitalization has profoundly changed our lives, and today, any activity, person, or process generates data, providing a huge amount of information to be drilled, analyzed, and used to promote data-driven decision making. A few decades ago, it was hard to find datasets ready to be used to develop or test new algorithms. On the other hand, there exist today plenty of repositories that provide us with datasets, even of fairly large dimensions, to be downloaded and analyzed. These repositories, where people can share datasets, also provide a benchmark where algorithms can be applied, validated, and compared with each other.

In this section, we will briefly go through some of the main repositories and file formats used in network science, in order to provide you with all the tools needed to import datasets—of different sizes—to analyze and play around with.

In such repositories, you will find network datasets coming from some of the common areas of network science, such as social networks, biochemistry, dynamic networks, documents, co-authoring and citations networks, and networks arising from financial transactions. In Part 3, Advanced Applications of Graph Machine Learning, we will discuss some of the most common type of networks (social networks, graphs arising when processing corpus documents, and financial networks) and analyze them more thoroughly by applying the techniques and algorithms described in Part 2, Machine Learning on Graphs.

Also, networkx already comes with some basic (and very small) networks that are generally used to explain algorithms and basic measures, which can be found at https://networkx.org/documentation/stable/reference/generators.html#module-networkx.generators.social. These datasets are, however, generally quite small. For larger datasets, refer to the repositories we present next.

Network Data Repository

The Network Data Repository is surely one of the largest repositories of network data (http://networkrepository.com/) with several thousand different networks, featuring users and donations from all over the world and top-tier academic institutions. If a network dataset is freely available, chances are that you will find it there. Datasets are classified in about 30 domains, including biology, economics, citations, social network data, industrial applications (energy, road), and many others. Besides providing the data, the website also provides a tool for interactive visualization, exploration, and comparison of datasets, and we suggest you check it out and explore it.

The data in the Network Data Repository is generally available under the Matrix Market Exchange Format (MTX) file format. The MTX file format is basically a file format for specifying dense or sparse matrices, real or complex, via readable text files (American Standard Code for Information Interchange, or ASCII). For more details, please refer to http://math.nist.gov/MatrixMarket/formats.html#MMformat.

A file in MTX format can be easily read in Python using scipy. Some of the files we downloaded from the Network Data Repository seemed slightly corrupted and required a minimal fix on a 10.15.2 OSX system. In order to fix them, just make sure the header of the file is compliant with the format specifications; that is, with a double % and no spaces at the beginning of the line, as in the following line:

%%MatrixMarket matrix coordinate pattern symmetric

Matrices should be in coordinate format. In this case, the specification points also to an unweighted, undirected graph (as understood by pattern and symmetric). Some of the files have some comments after the first header line, which are preceded by a single %.

As an example, we consider the Astro Physics (ASTRO-PH) collaboration network. The graph is generated using all the scientific papers available from the e-print arXiv repository published in the Astrophysics category in the period from January 1993 to April 2003. The network is built by connecting (via undirected edges) all the authors that co-authored a publication, thus resulting in a clique that includes all authors of a given paper. The code to generate the graph can be seen here:

from scipy.io import mmread
adj_matrix = mmread("ca-AstroPh.mtx")
graph = nx.from_scipy_sparse_matrix(adj_matrix)

The dataset has 17,903 nodes, connected by 196,072 edges. Visualizing so many nodes cannot be done easily, and even if we were to do it, it might not be very informative, as understanding the underlying structure would not be very easy with so much information. However, we can get some insights by looking at specific subgraphs, as we will do next.

First, we can start by computing some basic properties we described earlier and put them into a pandas DataFrame for our convenience to later use, sort, and analyze. The code to accomplish this is illustrated in the following snippet:

stats = pd.DataFrame({
    "centrality": nx.centrality.betweenness_centrality(graph), 
    "C_i": nx.clustering(graph), 
    "degree": nx.degree(graph)
})

We can easily find out that the node with the largest degree centrality is the one with ID 6933, which has 503 neighbors (surely a very popular and important scientist in astrophysics!), as illustrated in the following code snippet:

neighbors = [n for n in nx.neighbors(graph, 6933)]

Of course, also plotting its ego network (the node with all its neighbors) would still be a bit messy. One way to produce some subgraphs that can be plotted is by sampling (for example, with a 0.1 ratio) its neighbors in three different ways: random (sorting by index is a sort of random sorting), selecting the most central neighbors, or selecting the neighbors with the largest C_i values. The code to accomplish this is shown in the following code snippet:

nTop = round(len(neighbors)*sampling)
idx = {
    "random": stats.loc[neighbors].sort_index().index[:nTop], 
    "centrality": stats.loc[neighbors]\
         .sort_values("centrality", ascending=False)\
         .index[:nTop],
    "C_i": stats.loc[neighbors]\
         .sort_values("C_i", ascending=False)\
         .index[:nTop]
}

We can then define a simple function for extracting and plotting a subgraph that includes only the nodes related to certain indices, as shown in the following code snippet:

def plotSubgraph(graph, indices, center = 6933):
    nx.draw_kamada_kawai(
        nx.subgraph(graph, list(indices) + [center])
    )

Using the preceding function, we can plot the different subgraphs, obtained by filtering the ego network using the three different criteria, based on random sampling, centrality, and the clustering coefficient we presented previously. An example is provided here:

plotSubgraph(graph, idx["random"])

In Figure 1.20, we compare these results where the other networks have been obtained by changing the key value to centrality and C_i. The random representation seems to show some emerging structure with separated communities. The graph with the most central nodes clearly shows an almost fully connected network, possibly made up of all full professors and influential figures in astrophysics science, publishing on multiple topics and collaborating frequently with each other. Finally, the last representation, on the other hand, highlights some specific communities, possibly connected with a specific topic, by selecting the nodes that have a higher clustering coefficient. These nodes might not have a large degree of centrality, but they very well represent specific topics. You can see examples of the ego subgraph here:

Figure 1.20 – Examples of the ego subgraph for the node that has largest degree in the ASTRO-PH dataset. Neighbors are sampled with a ratio=0.1. (left) random sampling; (center) nodes with largest betweenness centrality; (right) nodes with largest clustering coefficient

Another option to visualize this in networkx could also be to use the Gephi software that allows for fast filtering and visualizations of graphs. In order to do so, we need to first export the data as Graph Exchange XML Format (GEXF) (which is a file format that can be imported in Gephi), as follows:

nx.write_gext(graph, "ca-AstroPh.gext")

Once data is imported in Gephi, with few filters (by centrality or degree) and some computations (modularity), you can easily do plots as nice as the one shown in Figure 1.21, where nodes have been colored using modularity in order to highlight clusters. Coloring also allows us to easily spot nodes that connect the different communities and that therefore have large betweenness.

Some of the datasets in the Network Data Repository may also be available in the EDGE file format (for instance, the citations networks). The EDGE file format slightly differs from the MTX file format, although it represents the same information. Probably the easiest way to import such files into networkx is to convert them by simply rewriting its header. Take, for instance, the Digital Bibliography and Library (DBLP) citation network.

A sample plot can be seen in the following screenshot:

Figure 1.21 – Example of the visualization ASTRO-PH dataset with Gephi. Nodes are filtered by degree centrality and colored by modularity class; node sizes are proportional to the value of the degree

Here is the code for the header of the file:

% asym unweighted
% 49743 12591 12591

This can be easily converted to comply with the MTX file format by replacing these lines with the following code:

%%MatrixMarket matrix coordinate pattern general
12591 12591 49743

Then, you can use the import functions described previously.

Stanford Large Network Dataset Collection

Another valuable source of network datasets is the website of the Stanford Network Analysis Platform (SNAP) (https://snap.stanford.edu/index.html), which is a general-purpose network analysis library that was written in order to handle even fairly large graphs, with hundreds of millions of nodes and billions of edges. It is written in C++ to achieve top computational performance, but it also features interfaces with Python in order to be imported and used in native Python applications.

Although networkx is currently the main library to study networkx, SNAP or other libraries (more on this shortly) can be orders of magnitude faster than networkx, and they may be used in place of networkx for tasks that require higher performance. In the SNAP website, you will find a specific web page for Biomedical Network Datasets (https://snap.stanford.edu/biodata/index.html), besides other more general networks (https://snap.stanford.edu/data/index.html), covering similar domains and datasets as the Network Data Repository described previously.

Data is generally provided in a text file format containing a list of edges. Reading such files can be done with networkx in one code line, using the following command:

g = nx.read_edgelist("amazon0302.txt")

Some graphs might have extra information, other than about edges. Extra information is included in the archive of the dataset as a separated file—for example, where some metadata of the nodes is provided and is related to the graph via the id node.

Graphs can also be read directly using the SNAP library and its interface via Python. If you have a working version of SNAP on your local machine, you can easily read the data as follows:

from snap import LoadEdgeList, PNGraph
graph = LoadEdgeList(PNGraph, "amazon0302.txt", 0, 1, '\t')

Keep in mind that at this point, you will have an instance of a PNGraph object of the SNAP library, and you can't directly use networkx functionalities on this object. If you want to use some networkx functions, you first need to convert the PNGraph object to a networkx object. To make this process simpler, in the supplementary material for this book (available at https://github.com/PacktPublishing/Graph-Machine-Learning), we have written some functions that will allow you to seamlessly swap back and forth between networkx and SNAP, as illustrated in the following code snippet:

networkx_graph = snap2networkx(snap_graph)
snap_graph = networkx2snap(networkx_graph)

Open Graph Benchmark

This is the most recent update (dated May 2020) in the graph benchmark landscape, and this repository is expected to gain increasing importance and support in the coming years. The Open Graph Benchmark (OGB) has been created to address one specific issue: current benchmarks are actually too small compared to real applications to be useful for machine learning (ML) advances. On one hand, some of the models developed on small datasets turn out to not be able to scale to large datasets, proving them unsuitable in real-world applications. On the other hand, large datasets also allow us to increase the capacity (complexity) of the models used in ML tasks and explore new algorithmic solutions (such as neural networks) that can benefit from a large sample size to be efficiently trained, allowing us to achieve very high performance. The datasets belong to diverse domains and they have been ranked on three different dataset sizes (small, medium, and large) where the small-size graphs, despite their name, already have more than 100,000 nodes and/or more than 1 million edges. On the other hand, large graphs feature networks with more than 100 million nodes and more than 1 billion edges, facilitating the development of scalable models.

Beside the datasets, the OGB also provides, in a Kaggle fashion, an end-to-end ML pipeline that standardizes the data loading, experimental setup, and model evaluation. OGB creates a platform to compare and evaluate models against each other, publishing a leaderboard that allows tracking of the performance evolution and advancements on specific tasks of node, edge, and graph property prediction. For more details on the datasets and on the OGB project, please refer to https://arxiv.org/pdf/2005.00687.pdf.

Dealing with large graphs

When approaching a use case or an analysis, it is very important to understand how large the data we focus on is or will be in the future, as the dimension of the datasets may very well impact both the technologies we use and the analysis that we can do. As already mentioned, some of the approaches that have been developed on small datasets hardly scale to real-world applications and larger datasets, making them useless in practice.

When dealing with (possibly) large graphs, it is crucial to understand potential bottlenecks and limitation of the tools, technologies, and/or algorithms we use, assessing which part of our application/analysis may not scale when increasing the number of nodes or edges. Even more importantly, it is crucial to structure a data-driven application, however simple or at early proof of concept (POC) stages, in a way that would allow its scaling out in the future when data/users would increase, without rewriting the whole application.

Creating a data-driven application that resorts to graphical representation/modeling is a challenging task that requires a design and implementation that is a lot more complicated than simply importing networkx. In particular, it is often useful to decouple the component that processes the graph—named graph processing engine—from the one that allows querying and traversing the graph—the graph storage layer. We will further discuss these concepts in Chapter 9, Building a Data-Driven Draft-Powered Application. Nevertheless, given the focus of the book on ML and analytical techniques, it makes sense to focus more on graph processing engines than on graph storage layers. We therefore find it useful to provide you already at this stage with some of the technologies that are used for graph processing engines to deal with large graphs, crucial when scaling out an application.

In this respect, it is important to classify graph processing engines into two categories (that impact the tools/libraries/algorithms to be used), depending whether the graph can fit a shared memory machine or requires distributed architectures to be processed and analyzed.

Note that there is no absolute definition of large and small graphs, but it also depends on the chosen architecture. Nowadays, thanks to the vertical scaling of infrastructures, you can find servers with random-access memory (RAM) larger than 1 terabyte (TB) (usually called fat nodes), and with tens of thousands of central processing units (CPUs) for multithreading in most cloud-provider offerings, although these infrastructures might not be economically viable. Even without scaling out to such extreme architectures, graphs with millions of nodes and tens of millions of edges can nevertheless be easily handled in single servers with ~100 gigabytes (GB) of RAM and ~50 CPUs.

Although networkx is a very popular, user-friendly, and intuitive library, when scaling out to such reasonably large graphs it may not be the best available choice. networkx, being natively written in pure Python, which is an interpreted language, can be substantially outperformed by other graph engines fully or partly written in more performant programming languages (such as C++ and Julia) and that make use of multithreading, such as the following:

SNAP (http://snap.stanford.edu/), which we have already seen in the previous section, is a graph engine developed at Stanford and is written in C++ with available bindings in Python.
igraph (https://igraph.org/) is a C library and features bindings in Python, R, and Mathematica.
graph-tool (https://graph-tool.skewed.de/), despite being a Python module, has core algorithms and data-structures written in C++ and uses OpenMP parallelization to scale on multi-core architectures.
NetworKit (https://networkit.github.io/) is also written in C++ with OpenMP boost for parallelization for its core functionalities, integrated in a Python module.
LightGraphs (https://juliagraphs.org/LightGraphs.jl/latest/) is a library written in Julia that aims to mirroring networkx functionalities in a more performant and robust library.

All the preceding libraries are valid alternatives to networkx when achieving better performance becomes an issue. Improvements can be very substantial, with speed-ups varying from 30 to 300 times faster, with the best performance generally achieved by LightGraphs.

In the forthcoming chapters, we will mostly focus on networkx in order to provide a consistent presentation and provide the user with basic concepts on network analysis. We want you to be aware that other options are available, as this becomes extremely relevant when pushing the edge from a performance standpoint.

Filter reviews by

All

Feefo verified reviews

Amazon verified reviews

Carlo Estopia Feb 18, 2024

Feefo Verified review

Vlasta Kus Jul 30, 2021

Disclaimer: I received a free copy of this book from the publisher. To the best of my knowledge, this did not influence my review.This book is exactly what it promises to be: thorough introduction to ML on graph data that will please any data enthusiasts (engineers and scientist alike) looking to get introduced into this hot topic of this decade.It discusses important bits of graph theory as well as graph technologies, shallow as well as deep learning on graphs, several interesting use cases, even basics on Natural Language Processing in connection to building Knowledge Graphs (well covered intro here!) and some advice for building graph applications. All this accompanied by a python code that shows, among others, how easy it can be to build graph applications with various graph-libraries that already exist and keep getting more and more evolved over the years.It is a bit surprising that a book of this topic almost completely omits the leading property graph database, Neo4j, which makes the dive into graph domain and building applications so much easier, especially thanks to their Graph Data Science library. On the other hand, it’s good to see this topic from a purely analytical point of view and get an overview of the various technologies that are out there.Perhaps I would appreciate more balanced approach to the use of graph ML libraries. This book almost seems to promote StellarGraph, not even mentioning/using at all other projects such as Deep Graph Library which seems to me more general (for e.g. framework agnostic) and evolved.I appreciate that this book does not try to serve all, resulting in not satisfying anyone - you have to have some ML basics. But don’t worry, it’s very readable.Altogether I can only recommend this book as an accessible, easy to read and comprehensive enough introduction into graph machine learning. This is the kind of book that won’t serve was a dust collector on your shelf.

Amazon Verified review

akhil101 Sep 17, 2021

We at work use graph databases to make ml models for better recommendations and things. Having such a detailed explanation of various aspects helped me to come up with many improvements which we are currently Implementing

Mohit Jul 27, 2021

This books fill a critical void in an emerging field of graphs and ML. The latest research done in field of GNN, graph embeddings to more application of graphs (like NLP, SNAs etc).Easy to read with code walkthroughs helps in understanding a complicated concept.Overall, really great book.

S M Aug 29, 2021

The book is easy to read and well-written. It starts with basic concepts and ends with more advanced ones. It is also very comprehensive, covering unsupervised methods to use cases for social networks, natural language processing, and financial transactions. There's code available to see how to apply the knowledge imparted in every chapter and visuals to assist with its understanding.

Graph Machine Learning: Take graph data to the next level by applying machine learning techniques and algorithms

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the 3 authors

FAQs