Interactive development with Spark and Hudi
Our EMR cluster and notebook are now ready for use. Let's learn how to do interactive development using an EMR notebook.
For interactive development, we are considering a use case where we will integrate the Hudi framework with Spark to do UPSERT (update/merge) operations on top of an S3 data lake.
Let's navigate to our EMR notebook to get started.
Creating a PySpark notebook for development
To get started, in Jupyter Notebook, choose New and then PySpark, as shown in the following screenshot:
This will create a new PySpark notebook. In every cell, you can write scripts and execute them line by line for easy development or debugging.
Next, we will learn how to integrate Hudi libraries with the notebook.
Integrating Hudi with our PySpark notebook
By default, Hudi libraries are not available in our EMR notebook. To make them...