Tidying when multiple variables are stored as column names
One particular flavor of messy data appears whenever the column names contain multiple different variables themselves. A common example of this scenario occurs when age and sex are concatenated together. To tidy datasets like this, we must manipulate the columns with the pandas .str
attribute. This attribute contains additional methods for string processing.
In this recipe, we will first identify all the variables, of which some will be concatenated together as column names. We then reshape the data and parse the text to extract the correct variable values.
How to do it…
- Read in the men's weightlifting dataset, and identify the variables:
>>> weightlifting = pd.read_csv('data/weightlifting_men.csv') >>> weightlifting Weight Category M35 35-39 ... M75 75-79 M80 80+ 0 56 137 ... 62 55 1 62 152...