Using S3 with Azure Databricks
To work with S3, we will suppose that you already have a bucket in AWS S3 that contains the objects we want to access and that you have already set up an access key ID and an access secret key. We will store those access keys in the notebook as variables and use them to access our files in S3 directly, using Spark dataframes.
Connecting to S3
To make a connection to S3, store your AWS credentials in two variables named aws_access_key_id
and aws_secret_access_key
into the Hadoop configuration environment. Use the following commands, which assume that you have already saved your credentials as variables:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key_id) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_access_key)
After this is set, we can access our bucket directly by reading the location of our file.
Loading data into a Spark DataFrame
After our credentials have...