Classical Approaches to Text Representation
Text representation approaches have evolved significantly over the years, and the advent of neural networks and deep neural networks has made a significant impact on the way we now represent text (more on that later). We have come a long way indeed: from handcrafting features to marking if a certain word is present in the text, to creating powerful representations such as word embeddings. While there are a lot of approaches, some more suitable for the task than the others, we will discuss a few major classical approaches and work with all of them in Python.
One-Hot Encoding
One-hot encoding is, perhaps, one of the most intuitive approaches toward text representation. A one-hot encoded feature for a word is a binary indicator of the term being present in the text. It's a simple approach that is easy to interpret – the presence or absence of a word. To understand this better, let's consider our sample text before stemming...