Creating new variables
Creating new variables can be useful for data scientists when they need to analyze something that is not present in the data as it was acquired. Common tasks to create new data are splitting a column, creating a calculation, encoding text, and applying a custom function over a variable.
We went over some good examples of column splitting in this book, such as a datetime split. Now, to illustrate the separate()
function from tidyr, the example to be used is based on the Census Income dataset. Look at the target
column: it has values such as <=50k and > 50k. Let’s say we wanted to separate only the >
or <=
signs and put them in a separate column; here is how to do that:
# Split variable target into sign and amount df_no_na %>% separate(target, into=c("sign", "amt"), sep="\\b")
We took the dataset clean of NAs
and separated the target
column into two new variables: sign
and amt
. To accomplish that, the...