Working with Text
Text is a huge source of data; it's in books, reports, social media, and transcriptions of speech. We can use data science in several different ways with text data to extract useful information and hidden patterns. Much of data science that has to do with text is called natural language processing, or NLP. This is the process of using computers to extract information or gain an understanding of natural human language. Of course, we need to turn our text into numbers to be able to process it with most machine learning and analytics tools, adding another step to the process. There are also many nuances regarding text analysis that we'll learn about. In this chapter, we'll cover:
- Basic text preprocessing and cleaning, including TFIDF and word vectors
- Text analytics such as word counts and word collocations
- Unsupervised learning for text analysis, including topic modeling
- Supervised learning (classification) with text ...