Loading a simple text file
Let's download a Dataset and do some experimentation. One of the (if not the best) books for machine learning is The Elements of Statistical Learning, Trevor Hastie, Jerome H. Friedman, Robert Tibshirani, Springer. The book site has an interesting set of Datasets. Let's grab the spam Dataset using the following command:
wget http://www-stat.stanford.edu/~tibs/ElemStatLearn/ datasets/spam.data
Alternatively, you can find the spam Dataset from the GitHub link at https://github.com/xsankar/fdps-v3.
Note
All the examples assume that you have downloaded the repository in the fdps-v3
directory in your home folder, that is, ~/fdps-v3/
. Please adjust the directory name if you have downloaded them somewhere else.
Now, load it as a text file into Spark with the following commands inside your Spark shell:
scala> val inFile = sc.textFile("data/spam.data") scala> inFile.count()
This loads the spam.data
file into Spark with each line being a separate entry...