Using subsetting to examine logical inconsistencies in variable relationships
At a certain point, data issues come down to deductive logic problems, such as variable x has to be greater than some quantity a when variable y is less than some quantity b. Once we are through some initial data cleaning, it is important to check for logical inconsistencies. pandas
makes this kind of error checking relatively straightforward with subsetting tools such as loc
and Boolean indexing. This can be combined with summary methods on Series and DataFrames to allow us to easily compare values for a particular row with values for the whole dataset or some subset of rows. We can also easily aggregate over columns. Just about any question we might have about the logical relationships between variables can be answered with these tools. We work through some examples in this recipe.
Getting ready
We will work with the NLS data, mainly with data on employment and education. We use apply
and lambda...