Data wrangling
It is frequently said that 80–90% of a data scientist's job is dealing with data. At a minimum, you should understand the data granularity (that is, what the rows represent) and know what each column in the dataset means. Presented with a raw data source, there are multiple steps required to clean, organize, and transform your data into a modeling-ready dataset format.
The dataset used for the Lending Club example in Chapters 3, 5, and 7 was derived from a raw data file that we begin with here. In this section, we will illustrate the following steps:
- Import the raw data and determine which columns to keep.
- Define the problem, and create a response variable.
- Convert the implied numeric data from strings into numeric values.
- Clean up any messy categorical columns.
Let's begin with the first step: importing the data.
Importing the raw data
We import the raw data file using the following code:
input_csv = "rawloans...