Data scrubbing
It is a commonly repeated statistic that at least 80 percent of a data scientist's work is data scrubbing. This is the process of detecting potentially corrupt or incorrect data and either correcting or filtering it out.
Note
Data scrubbing is one of the most important (and time-consuming) aspects of working with data. It's a key step to ensuring that subsequent analysis is performed on data that is valid, accurate, and consistent.
The nil
value at the end of the election year column may indicate dirty data that ought to be removed. We've already seen that filtering columns of data can be accomplished with Incanter's i/$
function. For filtering rows of data we can use Incanter's i/query-dataset
function.
We let Incanter know which rows we'd like it to filter by passing a Clojure map of column names and predicates. Only rows for which all predicates return true will be retained. For example, to select only the nil
values from our dataset:
(-> (load-data :uk) (i/query-dataset {"Election Year" {:$eq nil}}))
If you know SQL, you'll notice this is very similar to a WHERE
clause. In fact, Incanter also provides the i/$where
function, an alias to i/query-dataset
that reverses the order of the arguments.
The query is a map of column names to predicates and each predicate is itself a map of operator to operand. Complex queries can be constructed by specifying multiple columns and multiple operators together. Query operators include:
:$gt
greater than:$lt
less than:$gte
greater than or equal to:$lte
less than or equal to:$eq
equal to:$ne
not equal to:$in
to test for membership of a collection:$nin
to test for non-membership of a collection:$fn
a predicate function that should return a true response for rows to keep
If none of the built-in operators suffice, the last operator provides the ability to pass a custom function instead.
We'll continue to use Clojure's thread-last macro to make the code intention a little clearer, and return the row as a map of keys and values using the i/to-map
function:
(defn ex-1-5 [] (->> (load-data :uk) (i/$where {"Election Year" {:$eq nil}}) (i/to-map))) ;; {:ILEU nil, :TUSC nil, :Vote nil ... :IVH nil, :FFR nil}
Looking at the results carefully, it's apparent that all (but one) of the columns in this row are nil
. In fact, a bit of further exploration confirms that the non-nil row is a summary total and ought to be removed from the data. We can remove the problematic row by updating the predicate map to use the :$ne
operator, returning only rows where the election year is not equal to nil
:
(->> (load-data :uk) (i/$where {"Election Year" {:$ne nil}}))
The preceding function is one we'll almost always want to make sure we call in advance of using the data. One way of doing this is to add another implementation of our load-data
multimethod, which also includes this filtering step:
(defmethod load-data :uk-scrubbed [_] (->> (load-data :uk) (i/$where {"Election Year" {:$ne nil}})))
Now with any code we write, can choose whether to refer to the :uk
or :uk-scrubbed
datasets.
By always loading the source file and performing our scrubbing on top, we're preserving an audit trail of the transformations we've applied. This makes it clear to us—and future readers of our code—what adjustments have been made to the source. It also means that, should we need to re-run our analysis with new source data, we may be able to just load the new file in place of the existing file.