Tutorial – clustering and topic modeling
Similar to some of the previous examples we have seen so far, much of our data can either be classified in a supervised setting or clustered in an unsupervised one. In most cases, text-based data is generally made available to us in the form of real-world data in the sense that it is in a raw and unlabeled form.
Let's look at an example where we can make sense of our data and label it from an unsupervised perspective. Our main objective here will be to preprocess our raw text, cluster the data into five clusters, and then determine the main topics for each of those clusters. If you are following along using the provided code and documentation, please note that your results may vary as the dataset is dynamic, and its contents change as new data is populated into the PubMed database. I would urge you to customize the queries to topics that interest you. With that in mind, let's go ahead and begin.
We will begin by querying...