Interactively loading data from S3
Now let's try another exercise with the Spark shell. As part of Amazon's EMR Spark support, they have handily provided some sample data of Wikipedia traffic statistics in S3, in the format that Spark can use. To access the data, you first need to set your AWS access credentials as shell params. For instructions on signing up for EC2 and setting up the shell parameters, see the Running Spark on EC2 with the scripts section in Chapter 1, Installing Spark and Setting Up Your Cluster (S3 access requires additional keys such as fs.s3n.awsAccessKeyId/awsSecretAccessKey
or the use of the s3n://user:pw@
syntax). You can also set the shell parameters as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. We will leave the AWS configuration out of this discussion, but it needs to be completed.
Tip
This is a slightly advanced topic and needs a few S3 configurations (which we won't cover here). The Stack Overflow has two good links on this, namely http://stackoverflow...