Building a document topic classifier
To show you how to leverage a graph structure, we will focus on using the topological information and the connections between the entities provided by the bipartite entity-document graph to train multi-label classifiers. This will help us predict the document topics. To do this, we will analyze two different approaches:
- A shallow machine-learning approach, where we will use the embeddings we extracted from the bipartite network to train traditional classifiers, such as a RandomForest classifier.
- A more integrated and differentiable approach based on using a graphical neural network that's been applied to heterogeneous graphs (such as the bipartite graph).
Let's consider the first 10 topics, which we have enough documentation on to train and evaluate our models:
from collections import Counter topics = Counter( [label for document_labels in corpus["label"...