We've talked about what text analysis is, where we can find the data, and some of the things to keep in mind before diving into text analysis. But after all, what motivation do you, the reader, have to actually go about doing text analysis?
For starters, it's the sheer abundance of easily available data that we can use. In the big data age, there really is no excuse to not have a look at what all our data really means. In fact, apart from the massive data sets, we can download off the internet, we also have access to small data – text messages, emails, a collection of poems are such examples. You could even do a meta-analysis and run an analysis on this very book! Textual data is even easier to get a hold-off, but far more importantly - it's easy to interpret and understand the results of the analysis. Numbers might not always make sense and are not always appealing to look at - but words are easier for us human beings to appreciate.
Text analysis remains exciting also because we can use data which directly involves the user- our own text conversations, our favorite childhood book, or tweets by our favorite celebrity. The personal nature of text data always adds an extra bit of motivation, and it also likely means we are aware of the nature of the data, and what kind of results to expect.
NLP techniques can also help us construct tools that can assist personal businesses or enterprises – chatbots, for example, are becoming increasingly common in major websites, and with the right approach, it is possible to have a personal chat-bot. This is largely due to a subfield of machine learning, called Deep Learning, where we use algorithms and structures that are inspired by the structure of the human brain. These algorithms and structures are also referred to as neural networks. Advances in deep learning have introduced to powerful neural networks such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). Now, even with minimal knowledge of the mathematical functioning of these algorithms, high-level APIs are allowing us to use these tools. Integrating this into our daily life is no longer reserved for computer science researchers or full-time engineers – with the right collection of data and open source packages, this is well within our capabilities.
Open source packages have become industry standard – Google has released and maintains TensorFlow [21], and packages such as scikit-learn [22] are used by Apple and Spotify, and spaCy [23], which we will extensively discuss throughout this book – is used by Quora, a popular question-answer website.
We are no longer limited by either data or the tools – the only two things we would need to do text analysis.
The programming language Python will be our friend throughout the book, and all the tools we will use will all be free open-source software. While we move towards open science, we also move towards open source code, and this will remain a key philosophy throughout the book. In the world of research, open source code means academic results are reproducible and available to all those interested. Python remains an easy-to-use and powerful language and serves as a great way to enter the world of natural language processing.
One could argue that the last thing needed was the knowledge of how to apply these tools and to wrangle with the data – but that is precisely the purpose of the book and, hoping to let the reader build their own natural language processing pipelines and models at the end of the journey.