Dealing with messy data
The first thing that we need to deal with is qualitative data from the shape
and description
fields.
The shape
field seems like a likely place to start. Let's see how many items have good data for it:
user=> (def data (m/read-data "data/ufo_awesome.tsv")) user=> (count (remove (comp str/blank? :shape) data)) 58870 user=> (count (filter (comp str/blank? :shape) data)) 2523 user=> (count data) 61393 user=> (float 2506/61137) 0.04098991
So 4 percent of the data does not have the shape
field set to meaningful data. Let's see what the most popular values for that field are:
user=> (def shape-freqs (frequencies (map str/trim (map :shape (remove (comp str/blank? :shape) data))))) #'user/shape-freqs user=> (pprint (take 10 (reverse (sort-by second shape-freqs)))) (["light" 12202] ["triangle" 6082] ["circle" 5271] ["disk" 4825] ["other" 4593] ["unknown" 4490] ["sphere" 3637] ["fireball...