Scaling out EDA
EDA is a data science process that involves analysis of a given dataset to understand its main characteristics, sometimes graphically using visualizations and other times just by aggregating and slicing data. You have already learned some visual EDA techniques in Chapter 11, Data Visualization with PySpark. In this section, we will explore non-graphical EDA using pandas and compare it with the same process using PySpark and Koalas.
EDA using pandas
Typical EDA in standard Python involves using pandas for data manipulation and matplotlib
for data visualization. Let's take a sample dataset that comes with scikit-learn and perform some basic EDA steps on it, as shown in the following code example:
import pandas as pd from sklearn.datasets import load_boston boston_data = datasets.load_boston() boston_pd = pd.DataFrame(boston_data.data, ...