Performing aggregations with Apache Spark
In this recipe, we will discuss how to perform aggregations on DataFrames in Apache Spark. We will use Python as our primary programming language and the PySpark API and go over the various techniques for aggregating your data in Apache Spark.
How to do it...
- Import the libraries: Import the required libraries and create a
SparkSession
object:from pyspark.sql import SparkSession
from pyspark.sql.functions import col, max, count, min, approx_count_distinct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
spark = (SparkSession.builder
.appName("perform-aagregations")
.master("spark://spark-master:7077")
.config("spark.executor.memory", "512m")
.getOrCreate())
spark.sparkContext.setLogLevel("ERROR")
- Read file: Read the
netfix_titles.csv
file using theread...