Benchmarks and repositories
Now that we have understood the basic concepts and notions about graphs and network analysis, it is now time to dive into some practical examples that will help us to start to put into practice the general concepts we have learned so far. In this section, we will present some examples and toy problems that are generally used to study the properties of networks, as well as benchmark performances and effectiveness of networks' algorithms. We will also provide some useful links of repositories where network datasets can be found and downloaded, together with some tips on how to parse and process them.
Examples of simple graphs
We start by looking at some very simple examples of networks. Fortunately, networkx
already comes with a number of graphs already implemented, ready to be used and played with. Let's start by creating a fully connected undirected graph, as follows:
complete = nx.complete_graph(n=7)
This has edges and a clustering coefficient C=1. Although fully connected graphs are not very interesting on their own, they represent a fundamental building block that may arise within larger graphs. A fully connected subgraph of n nodes within a larger graph is generally referred to as a clique of size n.
Definition
A clique, C, in an undirected graph is defined a subset of its vertices, C  V, such that every two distinct vertices in the subset are adjacent. This is equivalent to the condition that the induced subgraph of G induced by C is a fully connected graph.
Cliques represent one of the basic concepts in graph theory and are often also used in mathematical problems where relations need to be encoded. Besides, they also represent the simplest unit when constructing more complex graphs. On the other hand, the task of finding cliques of a given size n in larger graphs (clique problem) is of great interest and it can be shown that it is a nondeterministic polynomial-time complete (NP-complete) problem often studied in computer science.
Some simple examples of networkx
graphs can be seen in the following screenshot:
In Figure 1.18, we showed a complete graph along with two other simple examples containing cliques that can be easily generated with networkx
, outlined as follows:
- A lollipop graph formed by a clique of size n and a branch of m nodes, as shown in the following code snippet:
lollipop = nx.lollipop_graph(m=7, n=3)
- A barbell graph formed by two cliques of size m1 and m2 joined by a branch of nodes, which resembles the sample graph we used previously to characterize some of the global and local properties. The code to generate this is shown in the following snippet:
barbell = nx.barbell_graph(m1=7, m2=4)
Such simple graphs are basic building blocks that can be used to generate more complex networks by combining them. Merging subgraphs is very easy with networkx
and can be done with just a few lines of code, as shown in the following code snippet, where the three graphs are merged together into a single graph and some random edges are placed to connect them:
def get_random_node(graph): Â Â Â Â return np.random.choice(graph.nodes) allGraphs = nx.compose_all([complete, barbell, lollipop]) allGraphs.add_edge(get_random_node(lollipop), get_random_node(lollipop)) allGraphs.add_edge(get_random_node(complete), get_random_node(barbell))
Other very simple graphs (that can then be merged and played around with) can be found at https://networkx.org/documentation/stable/reference/generators.html#module-networkx.generators.classic.
Generative graph models
Although creating simple subgraphs and merging them is a way to generate new graphs of increasing complexity, networks may also be generated by means of probabilistic models and/or generative models that let a graph grow by itself. Such graphs usually share interesting properties with real networks and have long been used to create benchmarks and synthetic graphs, especially in times when the amount of data available was not as overwhelming as today. Here, we present some examples of random generated graphs, briefly describing the models that underlie them.
Watts and Strogatz (1998)
This model was used by the authors to study the behavior of small-world networks—that is to say, networks that resemble, to some extent, common social networks. The graph is generated by first displacing n nodes in a ring and connecting each node with its k neighbors. Each edge of such a graph then has a probability p of being rewired to a randomly chosen node. By ranging p, the Watts and Strogatz model allows a shift from a regular network (p=0) to a completely random network (p=1). In between, graphs exhibit small-world features; that is, they tend to bring this model closer to social network graphs. These kinds of graphs can be easily created with the following command:
graph = nx.watts_strogatz_graph(n=20, k=5, p=0.2)
Barabási-Albert (1999)
The model proposed by Albert and Barabási is based on a generative model that allows the creation of random scale-free networks by using a preferential attachment schema, where a network is created by progressively adding new nodes and attaching them to already existing nodes, with a preference for nodes that have more neighbors. Mathematically speaking, the underlying idea of this model is that the probability for a new node to be attached to an existing node i depends on the degree of the i-th node, according to the following formula:
Thus, nodes with a large number of edges (hubs) tend to develop even more edges, whereas nodes with few links will not develop other links (periphery). Networks generated by this model exhibit a power-law distribution for the connectivity (that is, degree) between nodes. Such a behavior is also found in real networks (for example, the World Wide Web (WWW) network and the actor collaboration network), interestingly showing that it is the popularity of a node (how many edges it already has) rather than its intrinsic node properties that influences the creation of new connections. The initial model has then been extended (and this is the version that is available on networkx
) to also allow the preferential attachment of new edges or rewiring of existing edges.
The Barabási-Albert model is illustrated in the following screenshot:
In Figure 1.19, we showed an example of the Barabasi-Albert model for a small network, where you can already observe the emergence of hubs (on the left), as well as the probability distribution of the degree of the nodes, which exhibits a scale-free power-law behavior (on the right). The preceding distribution can easily be replicated in networkx
, as follows:
ba_model = nx.extended_barabasi_albert_graph(n,m=1,p=0,q=0) degree = dict(nx.degree(ba_model)).values() bins = np.round(np.logspace(np.log10(min(degree)), np.log10(max(degree)), 10)) cnt = Counter(np.digitize(np.array(list(degree)), bins))
Benchmarks
Digitalization has profoundly changed our lives, and today, any activity, person, or process generates data, providing a huge amount of information to be drilled, analyzed, and used to promote data-driven decision making. A few decades ago, it was hard to find datasets ready to be used to develop or test new algorithms. On the other hand, there exist today plenty of repositories that provide us with datasets, even of fairly large dimensions, to be downloaded and analyzed. These repositories, where people can share datasets, also provide a benchmark where algorithms can be applied, validated, and compared with each other.
In this section, we will briefly go through some of the main repositories and file formats used in network science, in order to provide you with all the tools needed to import datasets—of different sizes—to analyze and play around with.
In such repositories, you will find network datasets coming from some of the common areas of network science, such as social networks, biochemistry, dynamic networks, documents, co-authoring and citations networks, and networks arising from financial transactions. In Part 3, Advanced Applications of Graph Machine Learning, we will discuss some of the most common type of networks (social networks, graphs arising when processing corpus documents, and financial networks) and analyze them more thoroughly by applying the techniques and algorithms described in Part 2, Machine Learning on Graphs.
Also, networkx
already comes with some basic (and very small) networks that are generally used to explain algorithms and basic measures, which can be found at https://networkx.org/documentation/stable/reference/generators.html#module-networkx.generators.social. These datasets are, however, generally quite small. For larger datasets, refer to the repositories we present next.
Network Data Repository
The Network Data Repository is surely one of the largest repositories of network data (http://networkrepository.com/) with several thousand different networks, featuring users and donations from all over the world and top-tier academic institutions. If a network dataset is freely available, chances are that you will find it there. Datasets are classified in about 30 domains, including biology, economics, citations, social network data, industrial applications (energy, road), and many others. Besides providing the data, the website also provides a tool for interactive visualization, exploration, and comparison of datasets, and we suggest you check it out and explore it.
The data in the Network Data Repository is generally available under the Matrix Market Exchange Format (MTX) file format. The MTX file format is basically a file format for specifying dense or sparse matrices, real or complex, via readable text files (American Standard Code for Information Interchange, or ASCII). For more details, please refer to http://math.nist.gov/MatrixMarket/formats.html#MMformat.
A file in MTX format can be easily read in Python using scipy
. Some of the files we downloaded from the Network Data Repository seemed slightly corrupted and required a minimal fix on a 10.15.2 OSX system. In order to fix them, just make sure the header of the file is compliant with the format specifications; that is, with a double %
and no spaces at the beginning of the line, as in the following line:
%%MatrixMarket matrix coordinate pattern symmetric
Matrices should be in coordinate format. In this case, the specification points also to an unweighted, undirected graph (as understood by pattern
and symmetric
). Some of the files have some comments after the first header line, which are preceded by a single %
.
As an example, we consider the Astro Physics (ASTRO-PH) collaboration network. The graph is generated using all the scientific papers available from the e-print arXiv repository published in the Astrophysics category in the period from January 1993 to April 2003. The network is built by connecting (via undirected edges) all the authors that co-authored a publication, thus resulting in a clique that includes all authors of a given paper. The code to generate the graph can be seen here:
from scipy.io import mmread adj_matrix = mmread("ca-AstroPh.mtx") graph = nx.from_scipy_sparse_matrix(adj_matrix)
The dataset has 17,903 nodes, connected by 196,072 edges. Visualizing so many nodes cannot be done easily, and even if we were to do it, it might not be very informative, as understanding the underlying structure would not be very easy with so much information. However, we can get some insights by looking at specific subgraphs, as we will do next.
First, we can start by computing some basic properties we described earlier and put them into a pandas DataFrame
for our convenience to later use, sort, and analyze. The code to accomplish this is illustrated in the following snippet:
stats = pd.DataFrame({ Â Â Â Â "centrality": nx.centrality.betweenness_centrality(graph), Â Â Â Â "C_i": nx.clustering(graph), Â Â Â Â "degree": nx.degree(graph) })
We can easily find out that the node with the largest degree centrality is the one with ID 6933
, which has 503 neighbors (surely a very popular and important scientist in astrophysics!), as illustrated in the following code snippet:
neighbors = [n for n in nx.neighbors(graph, 6933)]
Of course, also plotting its ego network (the node with all its neighbors) would still be a bit messy. One way to produce some subgraphs that can be plotted is by sampling (for example, with a 0.1 ratio) its neighbors in three different ways: random (sorting by index is a sort of random sorting), selecting the most central neighbors, or selecting the neighbors with the largest C_i
values. The code to accomplish this is shown in the following code snippet:
nTop = round(len(neighbors)*sampling) idx = { Â Â Â Â "random": stats.loc[neighbors].sort_index().index[:nTop], Â Â Â Â "centrality": stats.loc[neighbors]\ Â Â Â Â Â Â Â Â Â .sort_values("centrality", ascending=False)\ Â Â Â Â Â Â Â Â Â .index[:nTop], Â Â Â Â "C_i": stats.loc[neighbors]\ Â Â Â Â Â Â Â Â Â .sort_values("C_i", ascending=False)\ Â Â Â Â Â Â Â Â Â .index[:nTop] }
We can then define a simple function for extracting and plotting a subgraph that includes only the nodes related to certain indices, as shown in the following code snippet:
def plotSubgraph(graph, indices, center = 6933): Â Â Â Â nx.draw_kamada_kawai( Â Â Â Â Â Â Â Â nx.subgraph(graph, list(indices) + [center]) Â Â Â Â )
Using the preceding function, we can plot the different subgraphs, obtained by filtering the ego network using the three different criteria, based on random sampling, centrality, and the clustering coefficient we presented previously. An example is provided here:
plotSubgraph(graph, idx["random"])
In Figure 1.20, we compare these results where the other networks have been obtained by changing the key value to centrality
and C_i
. The random representation seems to show some emerging structure with separated communities. The graph with the most central nodes clearly shows an almost fully connected network, possibly made up of all full professors and influential figures in astrophysics science, publishing on multiple topics and collaborating frequently with each other. Finally, the last representation, on the other hand, highlights some specific communities, possibly connected with a specific topic, by selecting the nodes that have a higher clustering coefficient. These nodes might not have a large degree of centrality, but they very well represent specific topics. You can see examples of the ego subgraph here:
Another option to visualize this in networkx
could also be to use the Gephi software that allows for fast filtering and visualizations of graphs. In order to do so, we need to first export the data as Graph Exchange XML Format (GEXF) (which is a file format that can be imported in Gephi), as follows:
nx.write_gext(graph, "ca-AstroPh.gext")
Once data is imported in Gephi, with few filters (by centrality or degree) and some computations (modularity), you can easily do plots as nice as the one shown in Figure 1.21, where nodes have been colored using modularity in order to highlight clusters. Coloring also allows us to easily spot nodes that connect the different communities and that therefore have large betweenness.
Some of the datasets in the Network Data Repository may also be available in the EDGE file format (for instance, the citations networks). The EDGE file format slightly differs from the MTX file format, although it represents the same information. Probably the easiest way to import such files into networkx
is to convert them by simply rewriting its header. Take, for instance, the Digital Bibliography and Library (DBLP) citation network.
A sample plot can be seen in the following screenshot:
Here is the code for the header of the file:
% asym unweighted % 49743 12591 12591
This can be easily converted to comply with the MTX file format by replacing these lines with the following code:
%%MatrixMarket matrix coordinate pattern general 12591 12591 49743
Then, you can use the import functions described previously.
Stanford Large Network Dataset Collection
Another valuable source of network datasets is the website of the Stanford Network Analysis Platform (SNAP) (https://snap.stanford.edu/index.html), which is a general-purpose network analysis library that was written in order to handle even fairly large graphs, with hundreds of millions of nodes and billions of edges. It is written in C++ to achieve top computational performance, but it also features interfaces with Python in order to be imported and used in native Python applications.
Although networkx
is currently the main library to study networkx
, SNAP or other libraries (more on this shortly) can be orders of magnitude faster than networkx
, and they may be used in place of networkx
for tasks that require higher performance. In the SNAP website, you will find a specific web page for Biomedical Network Datasets (https://snap.stanford.edu/biodata/index.html), besides other more general networks (https://snap.stanford.edu/data/index.html), covering similar domains and datasets as the Network Data Repository described previously.
Data is generally provided in a text file format containing a list of edges. Reading such files can be done with networkx
in one code line, using the following command:
g = nx.read_edgelist("amazon0302.txt")
Some graphs might have extra information, other than about edges. Extra information is included in the archive of the dataset as a separated file—for example, where some metadata of the nodes is provided and is related to the graph via the id node.
Graphs can also be read directly using the SNAP library and its interface via Python. If you have a working version of SNAP on your local machine, you can easily read the data as follows:
from snap import LoadEdgeList, PNGraph graph = LoadEdgeList(PNGraph, "amazon0302.txt", 0, 1, '\t')
Keep in mind that at this point, you will have an instance of a PNGraph
object of the SNAP library, and you can't directly use networkx
functionalities on this object. If you want to use some networkx
functions, you first need to convert the PNGraph
object to a networkx
object. To make this process simpler, in the supplementary material for this book (available at https://github.com/PacktPublishing/Graph-Machine-Learning), we have written some functions that will allow you to seamlessly swap back and forth between networkx
and SNAP, as illustrated in the following code snippet:
networkx_graph = snap2networkx(snap_graph) snap_graph = networkx2snap(networkx_graph)
Open Graph Benchmark
This is the most recent update (dated May 2020) in the graph benchmark landscape, and this repository is expected to gain increasing importance and support in the coming years. The Open Graph Benchmark (OGB) has been created to address one specific issue: current benchmarks are actually too small compared to real applications to be useful for machine learning (ML) advances. On one hand, some of the models developed on small datasets turn out to not be able to scale to large datasets, proving them unsuitable in real-world applications. On the other hand, large datasets also allow us to increase the capacity (complexity) of the models used in ML tasks and explore new algorithmic solutions (such as neural networks) that can benefit from a large sample size to be efficiently trained, allowing us to achieve very high performance. The datasets belong to diverse domains and they have been ranked on three different dataset sizes (small, medium, and large) where the small-size graphs, despite their name, already have more than 100,000 nodes and/or more than 1 million edges. On the other hand, large graphs feature networks with more than 100 million nodes and more than 1 billion edges, facilitating the development of scalable models.
Beside the datasets, the OGB also provides, in a Kaggle fashion, an end-to-end ML pipeline that standardizes the data loading, experimental setup, and model evaluation. OGB creates a platform to compare and evaluate models against each other, publishing a leaderboard that allows tracking of the performance evolution and advancements on specific tasks of node, edge, and graph property prediction. For more details on the datasets and on the OGB project, please refer to https://arxiv.org/pdf/2005.00687.pdf.