Classifying nodes on PubMed
In this section, we will implement a GraphSAGE architecture to perform node classification on the PubMed
dataset (available under the MIT license from https://github.com/kimiyoung/planetoid) [4].
Previously, we saw two other citation network datasets from the same Planetoid family – Cora
and CiteSeer
. The PubMed
dataset displays a similar but larger graph, with 19,717 nodes and 88,648 edges. Figure 8.3 shows a visualization of this dataset as created by Gephi (https://gephi.org/).
Figure 8.4 – A visualization of the PubMed dataset
Node features are TF-IDF-weighted word vectors with 500 dimensions. The goal is to correctly classify nodes into three categories – diabetes mellitus experimental, diabetes mellitus type 1, and diabetes mellitus type 2. Let’s implement it step by step using PyG:
- We load the
PubMed
dataset from thePlanetoid
class and print some information about the graph:from...