Classifying Topics of Newsgroup Posts
The large volumes of unstructured text that large corporations and organizations need to sort daily necessitate automatizing tedious and time-consuming manual tasks. The good news is that machine learning (ML) is also of assistance when analyzing this type of data. This chapter will educate us on how to tag a text document using a list of predefined topics. The aim is to assign each sample to one and only one label, which becomes more challenging as the number of topics increases.
We will attack the problem by utilizing supervised and unsupervised ML techniques. First, we expand on the basic exploratory data analysis presented in the previous chapter and create richer visualizations with extra meaning and depth. The transformation of data from a high-dimensional space into a low-dimensional one assists in this task, so we will discuss pertinent techniques throughout the chapter. Then, we will implement two classifiers using one of Python’...