Using subsetting to examine logical inconsistencies in variable relationships
At a certain point, data issues come down to deductive logic problems, such as variable x has to be greater than some quantity a when variable y is less than some quantity b. Once we are through some initial data cleaning, it is important to check for logical inconsistencies. pandas
makes this kind of error checking relatively straightforward with subsetting tools such as loc
and Boolean indexing. This can be combined with summary methods on series and data frames to allow us to easily compare values for a particular row to values for the whole dataset or some subset of rows. We can also easily aggregate over columns. Just about any question we might have about the logical relationships between variables can be answered with these tools. We work through some examples in this recipe.
Getting ready
We will work with the National Longitudinal Survey of Youth (NLS), mainly with data on employment and education...