For this recipe, we will create an RDD by reading a local file in PySpark. To create RDDs in Apache Spark, you will need to first install Spark as noted in the previous chapter. You can use the PySpark shell and/or Jupyter notebook to run these code samples. Note that while this recipe is specific to reading local files, a similar syntax can be applied for Hadoop, AWS S3, Azure WASBs, and/or Google Cloud Storage:
Storage type | Example |
Local files | sc.textFile('/local folder/filename.csv') |
Hadoop HDFS | sc.textFile('hdfs://folder/filename.csv') |
AWS S3 ( | sc.textFile('s3://bucket/folder/filename.csv') |
Azure WASBs ( | sc.textFile('wasb://bucket/folder/filename... |