While it is important to be aware of the techniques and the tools involved in NLP and CL, it is, of course, pointless without any data. Luckily for us, we have access to an abundance of data if we look in the right places. The easiest way to find textual data to work on is to look for a corpus.
A text corpus is a large and structured set of texts and is a great way to start off with text analysis. Examples of such corpora that are free are the Open American National Corpus [5] or the British National Corpus [6]. Wikipedia has a useful list of the largest corpuses available in its article on text corpuses [7]. These are not limited to the English language, and there also exist various corpuses in European and Asian languages, and there are constant efforts worldwide to create corpuses for majority of languages. Universities research labs are another valuable source for obtaining corpuses – indeed, one of the most iconic English language corpuses, the Brown Corpus, was put together at Brown University.
Different corpuses tend to have varying levels of information present, usually dependent on the primary purpose for that corpora – for example, corpora whose primary function is to aid during translation would have the same sentence present in multiple languages. Another way corpora have extra information is through annotation. Examples of annotation in text usually include Part-Of-Speech (POS) tagging or Named-Entity-Recognition (NER). POS-tagging refers to marking each word in a sentence with its part of speech (Noun, verb, adverb, and so on), and a corpus annotated for NER would have all named entities recognized, such as places, people, and times. We'll be further going into details of both POS-tagging and NER later on in the book, in Chapter 5, POS-Tagging and its Applications and Chapter 6, NER-Tagging and its Applications.
Based on the structure and varying levels of information present in the corpora, it would have a different purpose. Some corpora are also built to evaluate clustering or classification tasks, where rather than annotation being important, the label or class would be. This means that some corpora are designed to aid with machine learning tasks such as cluster or classification by providing text with labels tagged by humans. Clustering refers to the task of grouping similar objects together, and classification is the process of deciding which predefined class an identifying what exactly your dataset is going to be used for is a crucial part of text analysis and an important first step.
Apart from downloading datasets or scraping data off the internet, there are still some rich sources for gathering our textual data – in particular, literature. One example of this is the research done at the University of Pennsylvania, where Alejandro Ribeiro, Santiago Segarra, Mark Eisen, and Gabriel Egan discovered possible collaborators of Shakespeare, a literary history problem that stumbled many researchers [14]. They approached the problem by identifying literary styles – an upcoming field of study in computational linguistics called style analysis.
The increased use of computational tools to perform research in the humanities has also led to the growth of Digital Humanities labs in universities, where traditional research approaches are either aided or overtaken by computer science, and in particular machine learning (and by extension), natural language processing. Speeches of politicians, or proceedings in parliament, for example, are another example of a data source used often in this community. TheyWorkForYou [17] is A UK parliament tracking system, which gets speeches and uploads them and is an example of the many sites available doing this kind of work.
Project Gutenberg is likely the best resource to download books and contains over 50,000 free eBooks and many literary classics. Personal PDFs and eBooks also remain a resource, but again, it is important to know the legal nature of your text before analyzing it. Downloading a pirated copy of, say, Harry Potter off the internet and publishing text analysis results might not be the best idea if you cannot explain where you got the text from! Similarly, text analysis on private text messages might not only annoy your friends but also could be infringing on privacy laws.
Fig 1.4 An example of a text dataset list – here, it is of reviews datasets found on
So where else apart from downloading a structured data-set straight off the internet, do we get our textual data? Well, the internet, of course. Even if it isn't labelled, the sheer amount of text on the internet means that we can access large parts of it – the [7] is one such example, and the media dump of all the content on Wikipedia, after unzipping, is about 58 GB (as of April 2018) – more than enough text to play around with. The popular news aggregation website reddit.com [9] allows for easy web-scraping and is another great resource for text analysis.
Python again remains a great choice to use for any such web-scraping, and libraries such as BeautifulSoup [10], urllib [11] and scrapy [12] are designed particularly for this. It is important to remain careful about the legal side of things here, and make sure to check the terms and conditions of the website where you are scraping the data from – a number of websites will not allow you to use the information on the website for commercial purposes.
Twitter is another website that is fast becoming a very important part of text analysis – you even have academia taking this resource very seriously (What is Twitter, a social network or a news media? [13] has over 5000 citations!), with multiple papers being written on text analysis of tweets, and even full-fledged tools [15] to do sentiment analysis have been built! The Twitter-streaming API allows us to easily mine for textual data from Twitter as well, and the Python interface [16] is straightforward. Most world leaders are users of Twitter, as well as celebrities and major news corporations – there is a lot of interesting insights Twitter can offer us.
Fig 1.5 An example of the rich text resource Twitter has become, with multiple structured datasets available [7]. These datasets, all mined from Twitter, have particular tasks, which can be used for and fall under the category of labeled datasets which we discussed before.
Other examples of textual information you can get off the internet include research articles, medical reports, restaurant reviews (the Yelp! dataset comes to mind), and other social media websites. Sentiment analysis is usually the prime objective in these cases. As the name suggests, sentiment analysis refers to the task of identifying sentiment in text. These sentiments can be basic, such as positive or negative sentiment, but we could have more complex sentiment analysis tasks where we analyze whether a sentence contains happy, sad, or angry sentiments.
It's clear that if we look hard enough, it's more than easy to find data to play around with. But let's take a small step back from downloading data off the internet – where else can we try and find information?
Right in our hands, as it may seem – we send and receive text messages and emails every day, and we can use this text for text analysis. Most text messaging applications have interfaces to download chats. WhatsApp, for example, will mail the data to you [18], with both media and text. Most mail clients have the same option, and the advantage in both these cases is that this kind of data is often well organized, allowing for easy cleaning and preprocessing before we dive into the data.
One aspect we've ignored so far whilst talking about data is the noise which is often in the text – in tweets, for example, short forms and emoticons which are often used, and in some cases, we have multi-lingual data where a simple analysis might fail. This brings us to arguably the most important aspect of text analysis – preprocessing.