Introduction to text classification
Text classification (also known as text categorization) is a way of mapping a document (sentence, Twitter/X post, book chapter, email content, and so on) to a category out of a predefined list (classes). In the case of two classes that have positive and negative labels, we use binary classification – more specifically, sentiment analysis. For more than two classes, we call it multi-class classification, where the classes are mutually exclusive, or multi-label classification, where the classes are not mutually exclusive, which means a document can receive more than one label. For instance, the content of a news article may be related to sports and politics at the same time. Beyond this classification, we may want to score the documents in a range of [-1,1]
or rank them in a range of [1-5]
. We can solve this kind of problem with a regression model, where the type of the output is numeric, not categorical.
Luckily, the transformer architecture...