Data preparation and feature engineering
Before we can use ML, we first need to collect our data and convert it into a format that the model can use. We can’t just feed the graph G to Random Forest and call it a day. We could feed a graph’s adjacency matrix and a set of labels to Random Forest and it’d work, but I want to showcase some feature engineering that we can do.
Feature engineering is using domain knowledge to create additional features (most call them columns) that will be useful for our models. For instance, looking at the networks from the previous section, if we want to be able to spot the revolutionaries, then we may want to give our model additional data such as each node’s number of degrees (connections), betweenness centrality, closeness centrality, page rank, clustering, and triangles:
- Let’s start by first building our network. This should be easy by now, as we have done this several times:
import networkx as nx
import pandas...