Names de-duplication
As we were pulling entities from an NLP extraction process without any validation, the name we were able to retrieve may be written in many different ways. They can be written in different order, might contain middle names or initials, a salutation or a nobility title, nicknames, or even some typos and spelling mistakes. Although we do not aim to fully de-duplicate the content (such as learning that both Ziggy Stardust and David Bowie stand for the same person), we will be introducing two simple techniques used to de-duplicate a large amount of data at a minimal cost by combining the concept MapReduce paradigm and functional programming.
Functional programming with Scalaz
This section is all about enriching data as part of an ingestion pipeline. We are therefore less interested in building the most accurate system using advanced machine learning techniques, but rather the most scalable and efficient one. We want to keep a dictionary of alternative names for each record...