Summary
Data science is not just about machine learning. In fact, machine learning is only a small portion of it. In our understanding of what modern data science is, the science often happens exactly here, at the data enrichment process. The real magic occurs when one can transform a meaningless dataset into a valuable set of information and get new insights out of it. In this section, we have been describing how to build a fully functional data insight system using nothing more than a simple collection of URLs (and a bit of elbow grease).
In this chapter, we demonstrated how to create an efficient web scraper with Spark using the Goose library and how to extract and de-duplicate features out of raw text using NLP techniques and the GeoNames database. We also covered some interesting design patterns such as mapPartitions and Bloom filters that will be discussed further in Chapter 14, Scalable Algorithms.
In the next chapter, we will be focusing on the people we were able to extract from all...