About R
We've dabbled a little bit in Chapter 2, Access, Speed, and Storage with Hadoop, with R programming, but in this chapter, we now formally introduce R as the tool to perform our data profiling exercises as well as adding perspectives (establish context) for data to be used in visualizations.
R is a language and environment easy to learn, very flexible in nature, and also very focused on statistical computing thus making it great for manipulating, cleaning, summarizing, producing probability statistics, and so on (as well as actually creating visualizations with your data), so it's a great choice for the exercises required for profiling, establishing context, and identifying additional perspectives.
In addition, here are a few more reasons to use R when profiling your big data:
R is used by a large number of academic statisticians, so it's a tool that is not going away.
R is pretty much platform independent, what you develop will run almost anywhere.
R has awesome help resources--just Google...