Summary
In this chapter, we began by discussing the peculiarities of text data and how ambiguity makes NLP difficult. We discussed that there are two key ideas in working with text – preprocessing and representation. We discussed the many tasks involved in preprocessing, that is, getting your data cleaned up and ready for analysis. We saw various approaches to removing imperfections from the data.
Representation was the next big aspect – we understood the considerations in representing text and converting text into numbers. We looked at various approaches, beginning with classical approaches, which included one-hot encoding, the count-based approach, and the TF-IDF method.
Word embeddings are a whole new approach to representing text that leverage ideas from distributional semantics – terms that appear in similar contexts have similar meanings. The word2vec algorithm smartly exploits this idea by formulating a prediction problem: predict a target word given...