Maintaining consistency with synonym maps
One common problem with data is inconsistency. Sometimes, a value is capitalized, while sometimes it is not. Sometimes it is abbreviated, and sometimes it is full. At times, there is a misspelling.
When it's an open domain, such as words in a free-text field, the problem can be quite difficult. However, when the data represents a limited vocabulary (such as US state names, for our example here) there's a simple trick that can help. While it's common to use full state names, standard postal codes are also often used. A mapping from common forms or mistakes to a normalized form is an easy way to fix variants in a field.
Getting ready
For the project.clj
file, we'll use a very simple configuration:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"]])
We just need to make sure that the clojure.string/upper-case
function is available to us:
(use '[clojure.string :only (upper...