Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Ingestion with Python Cookbook

You're reading from   Data Ingestion with Python Cookbook A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process

Arrow left icon
Product type Paperback
Published in May 2023
Publisher Packt
ISBN-13 9781837632602
Length 414 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Gláucia Esppenchutz Gláucia Esppenchutz
Author Profile Icon Gláucia Esppenchutz
Gláucia Esppenchutz
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Part 1: Fundamentals of Data Ingestion
2. Chapter 1: Introduction to Data Ingestion FREE CHAPTER 3. Chapter 2: Principals of Data Access – Accessing Your Data 4. Chapter 3: Data Discovery – Understanding Our Data before Ingesting It 5. Chapter 4: Reading CSV and JSON Files and Solving Problems 6. Chapter 5: Ingesting Data from Structured and Unstructured Databases 7. Chapter 6: Using PySpark with Defined and Non-Defined Schemas 8. Chapter 7: Ingesting Analytical Data 9. Part 2: Structuring the Ingestion Pipeline
10. Chapter 8: Designing Monitored Data Workflows 11. Chapter 9: Putting Everything Together with Airflow 12. Chapter 10: Logging and Monitoring Your Data Ingest in Airflow 13. Chapter 11: Automating Your Data Ingestion Pipelines 14. Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime 15. Index 16. Other Books You May Enjoy

Installing PySpark

To process, clean, and transform vast amounts of data, we need a tool that provides resilience and distributed processing, and that’s why PySpark is a good fit. It gets an API over the Spark library that lets you use its applications.

Getting ready

Before starting the PySpark installation, we need to check our Java version in our operational system:

  1. Here, we check the Java version:
    $ java -version

You should see output similar to this:

openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)

If everything is correct, you should see the preceding message as the output of the command and the OpenJDK 18 version or higher. However, some systems don’t have any Java version installed by default, and to cover this, we need to proceed to step 2.

  1. Now, we download the Java Development Kit (JDK).

Go to https://www.oracle.com/java/technologies/downloads/, select your OS, and download the most recent version of JDK. At the time of writing, it is JDK 19.

The download page of the JDK will look as follows:

Figure 1.3 – The JDK 19 downloads official web page

Figure 1.3 – The JDK 19 downloads official web page

Execute the downloaded application. Click on the application to start the installation process. The following window will appear:

Note

Depending on your OS, the installation window may appear slightly different.

Figure 1.4 – The Java installation wizard window

Figure 1.4 – The Java installation wizard window

Click Next for the following two questions, and the application will start the installation. You don’t need to worry about where the JDK will be installed. By default, the application is configured, as standard, to be compatible with other tools’ installations.

  1. Next, we again check our Java version. When executing the command again, you should see the following version:
    $ java -version
    openjdk version "1.8.0_292"
    OpenJDK Runtime Environment (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10)
    OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)

How to do it…

Here are the steps to perform this recipe:

  1. Install PySpark from PyPi:
    $ pip install pyspark

If the command runs successfully, the installation output’s last line will look like this:

Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.2
  1. Execute the pyspark command to open the interactive shell. When executing the pyspark command in your command line, you should see this message:
    $ pyspark
    Python 3.8.10 (default, Jun 22 2022, 20:18:18)
    [GCC 9.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    22/10/08 15:06:11 WARN Utils: Your hostname, DESKTOP-DVUDB98 resolves to a loopback address: 127.0.1.1; using 172.29.214.162 instead (on interface eth0)
    22/10/08 15:06:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    22/10/08 15:06:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
          /_/
    Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
    Spark context Web UI available at http://172.29.214.162:4040
    Spark context available as 'sc' (master = local[*], app id = local-1665237974112).
    SparkSession available as 'spark'.
    >>>

You can observe some interesting messages here, such as the Spark version and the Python used from PySpark.

  1. Finally, we exit the interactive shell as follows:
    >>> exit()
    $

How it works…

As seen at the beginning of this recipe, Spark is a robust framework that runs on top of the JVM. It is also an open source tool for creating resilient and distributed processing output from vast data. With the growth in popularity of the Python language in the past few years, it became necessary to have a solution that adapts Spark to run alongside Python.

PySpark is an interface that interacts with Spark APIs via Py4J, dynamically allowing Python code to interact with the JVM. We first need to have Java installed on our OS to use Spark. When we install PySpark, it already comes with Spark and Py4J components installed, making it easy to start the application and build the code.

There’s more…

Anaconda is a convenient way to install PySpark and other data science tools. This tool encapsulates all manual processes and has a friendly interface for interacting with and installing Python components, such as NumPy, pandas, or Jupyter:

  1. To install Anaconda, go to the official website and select Products | Anaconda Distribution: https://www.anaconda.com/products/distribution.
  2. Download the distribution according to your OS.

For more detailed information about how to install Anaconda and other powerful commands, refer to https://docs.anaconda.com/.

Using virtualenv with PySpark

It is possible to configure and use virtualenv with PySpark, and Anaconda does it automatically if you choose this type of installation. However, for the other installation methods, we need to make some additional steps to make our Spark cluster (locally or on the server) run it, which includes indicating the virtualenv /bin/ folder and where your PySpark path is.

See also

There is a nice article about this topic, Using VirtualEnv with PySpark, by jzhang, here: https://community.cloudera.com/t5/Community-Articles/Using-VirtualEnv-with-PySpark/ta-p/245932.

You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023
Publisher: Packt
ISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image