Summary
In Big Data, the HDFS filesystem solves the storage and distribution of data on multiple nodes. MapReduce solves the problem of distributing execution and takes advantage of data locality. NoSQL databases solve the problem of driving real-time applications and storing frequently updated data efficiently. Some NoSQL databases by design, however, do not provide the commonly requested feature of querying the data. Distributed search platforms provide this capability.
As presented, Scalding is capable of taping into multiple systems. The ubiquity and expressiveness of the language make it a valid technology for completing tasks such as transferring data between SQL, NoSQL, or search systems. Given that the taps are also testable components, there are practically unlimited use cases where Scalding can be used to integrate various distributed systems.
In the next chapter, we will look at some advanced statistical calculations using matrix calculations, and we will see how Scalding can be...