Exploring descriptive statistics
Descriptive statistics are the most fundamental measures you can calculate on your data. In this recipe, we will learn how easy it is to get familiar with our dataset in PySpark.
Getting ready
To execute this recipe, you need to have a working Spark environment. Also, we will be working off of the no_outliers
DataFrame we created in the Handling outliers recipe so we assume you have followed the steps to handle duplicates, missing observations, and outliers.
No other prerequisites are required.
How to do it...
Calculating the descriptive statistics for your data is extremely easy in PySpark. Here's how:
descriptive_stats = no_outliers.describe(features)
That's it!
How it works...
The preceding code barely needs an explanation. The .describe(...)
method takes a list of columns you want to calculate the descriptive statistics on and returns a DataFrame with basic descriptive statistics: count, mean, standard deviation, minimum value, and maximum value.
Note
You can specify...