Cleansing the dataset
This step can come before or after the data aggregation we talked about in the previous section. We introduced some concepts around data cleansing in Chapter 2, Machine Learning Basics, so let's look at how to actually do it on a dataset. For this, let's start with the Automobile Dataset. Please refer to the Technical requirements section to access the UCI repository for this dataset:
- Let's download two files:
imports-85.data
andimports-85.names
. The data file is in.csv
format, so let's rename the file with the.csv
extension and open it using Excel (you can use any text editor). You will now see the data (Figure 4.6): - You will notice in the preceding screenshot that it is missing the header information. To retrieve the header information, open the
.names
file in any text editor. You will see the names of attributes as well as their definitions. Create an empty row at the top of your.csv
file...