Chapter 6: Feature Selection and Dimensionality Reduction
Activity 11: Converting the CBWD Feature of the Beijing PM2.5 Dataset into One-Hot Encoded Columns
Read the Beijing PM2.5 dataset into the DataFrame PM25:
PM25 <- read.csv("PRSA_data_2010.1.1-2014.12.31.csv")
Create a variable cbwd_one_hot for storing the result of the dummyVars function with ~ cbwd as its first argument:
library(caret) cbwd_one_hot <- dummyVars(" ~ cbwd", data = PM25)
Use the output of the predict() function on cbwd_one_hot and case it as DataFrame:
cbwd_one_hot <- data.frame(predict(cbwd_one_hot, newdata = PM25))
Remove the original cbwd variable from the PM25 DataFrame:
PM25$cbwd <- NULL
Using the cbind() function, add cbwd_one_hot to the PM25 DataFrame:
PM25 <- cbind(PM25, cbwd_one_hot)
Print the top 6 rows of PM25:
head(PM25)
The output of the previous command is as follows:
## No year month day hour pm2.5 DEWP TEMP PRES Iws Is Ir cbwd.cv cbwd.NE ## 1 1 2010 1 1 0 NA -21 -11 1021 1.79 0 0 0 0 ## 2 2 2010 1 1 1 NA -21 -12 1020 4.92 0 0 0 0 ## 3 3 2010 1 1 2 NA -21 -11 1019 6.71 0 0 0 0 ## 4 4 2010 1 1 3 NA -21 -14 1019 9.84 0 0 0 0 ## 5 5 2010 1 1 4 NA -20 -12 1018 12.97 0 0 0 0 ## 6 6 2010 1 1 5 NA -19 -10 1017 16.10 0 0 0 0 ## cbwd.NW cbwd.SE ## 1 1 0 ## 2 1 0 ## 3 1 0 ## 4 1 0 ## 5 1 0 ## 6 1 0
Observe the variable cbwd in the output of the head(PM25) command: it is now transformed into one-hot encoded columns with the NE, NW, and SE suffixes.