Predicting web traffic with node regression
In machine learning, regression refers to the prediction of continuous values. It is often contrasted with classification, where the goal is to find the correct categories (which are not continuous). In graph data, their counterparts are node classification and node regression. In this section, we will try to predict a continuous value instead of a categorical variable for each node.
The dataset we will use is the Wikipedia Network (GNU General Public License v3.0), introduced by Rozemberckzi et al. in 2019 [2]. It is composed of three page-page networks: chameleons (2,277 nodes and 31,421 edges), crocodiles (11,631 nodes and 170,918 edges), and squirrels (5,201 nodes and 198,493 edges). In these datasets, nodes represent articles and edges are mutual links between them. Node features reflect the presence of particular words in the articles. Finally, the goal is to predict the log average monthly traffic of December 2018.
In this section...