Introduction
Many interesting analysis techniques can be used on a large corpus of words. Whether it be examining the structure of a sentence or the content of a book, these recipes will introduce us to some useful tools.
When manipulating strings for data analysis, some of the most common functions are among substring search and edit distance computations. Since numbers are often found in a corpus of text, this chapter will start by showing how to represent numbers in an arbitrary base as a string. We will cover a couple of string-searching algorithms and then focus on extracting text to study not only the words but also how the words are used together.
Many practical applications can be constructed given the simple set of tools provided in this section. For example, in the last recipe, we will demonstrate a way to correct spelling mistakes. How we use these algorithms is entirely up to our creativity, but at least having them at our disposal is an excellent start.