Code and Datasets for the rest of the book
The first order of business is to look at the code and Datasets that we will be using for the rest of the chapters.
Code
It is time for you to experiment with Spark APIs and wrangle with data. We have been using the Scala and Python shell in this book and you can continue to do so. You should also explore using an iPython notebook, which is an excellent way for data engineers and data scientists to experiment with data. The iPython notebooks and its Datasets are available at https://github.com/xsankar/fdps-v3. You'll have to download some of the data yourselves due to the restrictions in distributing them. We have provided the appropriate URL as and when the need to download data arises.
IDE
For this book, we will use scala-shell
and pyspark
. The Zeppelin IDE is another fine choice. Python is a better language for data scientists and has a tradition of strong scientific libraries. For those of you who prefer Scala, it is not that hard to map Python...