Chapter 6. Words and Pixels – Working with Unstructured Data
Most of the data we have looked at thus far is composed of rows and columns with numerical or categorical values. This sort of information fits in both traditional spreadsheet software and the interactive Python notebooks used in the previous exercises. However, data is increasingly available in both this form, usually called structured data, and more complex formats such as images and free text. These other data types, also known as unstructured data, are more challenging than tabular information to parse and transform into features that can be used in machine learning algorithms.
What makes unstructured data challenging to use? It is challenging largely because images and text are extremely high dimensional, consisting of a much larger number of columns or features than we have seen previously. For example, this means that a document may have thousands of words, or an image thousands of individual pixels. Each...