Tidying when variables are stored in column names and values
One particularly difficult form of messy data to diagnose appears whenever variables are stored both horizontally across the column names and vertically down column values. You will typically encounter this type of dataset, not in a database, but from a summarized report that someone else has already generated.
Getting ready
In this recipe, variables are identified both vertically and horizontally and reshaped into tidy data with the melt
and pivot_table
methods.
How to do it...
- Read in the
sensors
dataset and identify the variables:
>>> sensors = pd.read_csv('data/sensors.csv') >>> sensors
- The only variable placed correctly in a vertical column is
Group
. TheProperty
column appears to have three unique variables,Pressure
,Temperature
, andFlow
. The rest of the columns2012
to2016
are themselves a single variable, which we can sensibly nameYear
. It isn't possible to restructure this kind of messy data with a single...