Reading Data in Spark from Different Data Sources
One of the advantages of Spark is the ability to read data from various data sources. However, this is not consistent and keeps changing with each Spark version. This section of the chapter will explain how to read files in CSV and JSON.
Exercise 47: Reading Data from a CSV File Using the PySpark Object
To read CSV data, you have to write the spark.read.csv("the file name with .csv") function. Here, we are reading the bank data that was used in the earlier chapters.
Note
The sep function is used here.
We have to ensure that the right sep function is used based on how the data is separated in the source data.
Now let's perform the following steps to read the data from the bank.csv file:
First, let's import the required packages into the Jupyter notebook:
import os import pandas as pd import numpy as np import collections from sklearn.base import TransformerMixin import random import pandas_profiling
Next, import all the required libraries, as illustrated...