Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Building ETL Pipelines with Python
Building ETL Pipelines with Python

Building ETL Pipelines with Python: Create and deploy enterprise-ready ETL pipelines by employing modern methods

Arrow left icon
Profile Icon Brij Kishore Pandey Profile Icon Emily Ro Schoof
Arrow right icon
$34.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (6 Ratings)
Paperback Sep 2023 246 pages 1st Edition
eBook
$18.99 $27.99
Paperback
$34.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Brij Kishore Pandey Profile Icon Emily Ro Schoof
Arrow right icon
$34.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (6 Ratings)
Paperback Sep 2023 246 pages 1st Edition
eBook
$18.99 $27.99
Paperback
$34.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$18.99 $27.99
Paperback
$34.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Building ETL Pipelines with Python

A Primer on Python and the Development Environment

Whether your production environment caters to only one data pipeline at a time or a whole multitude of overlapping systems, the core tenets of data environment management remain the same.

We have dedicated this chapter to breaking down the foundational roots of all successful applications by discussing the basic principles of the Python programming language and how utilizing package management applications can create clean, flexible, and reproducible development environments. We will walk you through a step-by-step tutorial on how to install and establish a basic Git-tracked development environment that will prevent future confounding modular incompatibilities from impacting the successful deployment of your data pipelines in production.

By the end of this chapter, you will have a strong understanding of why Python is a powerful tool that can be used to develop highly-customized and powerful data transformation ecosystems. We will cover the following topics:

  • Python fundamentals
  • Using Python attributes to build an application’s foundation
  • Key attributes of an effective development environment
  • Downloading and installing a local integrated development environment (IDE)
  • Creating and cloning a Git-tracked repository into your IDE
  • Managing project packages and circular dependencies with a module management system (MMS)

Introducing Python fundamentals

Since Python is the language in which we will design pipelines, it is a good idea to go through Python’s core fundamentals.

Python is a general-purpose, dynamically typed programming language with a highly versatile nature that can be powerfully used for scripting, object-oriented, procedural, or functional programming. Built on top of C, a low-memory but complex imperative procedural language that efficiently maps machine instructions with minimal runtime support, this human-readable language has become one of the most popular programming languages of the 21st century.

Python’s ubiquitous nature is echoed by its vast online community and well-supported open source libraries.

In this section, we are going to cover the following topics:

  • Python data structures
  • if…else conditions
  • Looping techniques
  • Python functions
  • Object-oriented programming using Python
  • Working with files in Python

To brush up on the key concepts necessary for ETL pipelines, let’s take a look at each of these topics in detail so that we can understand them better.

An overview of Python data structures

Here are the data structures available natively in Python:

  • List: This is like a one-dimensional dynamic array but can hold heterogeneous or homogeneous elements separated by commas. Python lists are mutable and ordered. Like arrays, the index of the first item is 0. The index of the last item in a list starts from -1 and counts arguments toward the beginning in descending order. Lists are represented by []. An index can be used to get a value from a list. In the following section, we will discuss some of the most important indexing operations on lists:
    sample_list = [1, "another", "list", "a", [3,4]]print(sample_list[0]) # 1print(sample_list[1]) # "another",print(sample_list[-1]) # [3,4]]print(sample_list[1:4])# ["another", "list", "a"]print(sample_list[:4])# [1, "another", "list", "a"]print(sample_list[3:])# ["a",[3,4]]]

    We can work with lists in Python using a variety of methods that are available for working with lists.

    Python list methods can be found in the following documentation: https://docs.python.org/3/tutorial/datastructures.html.

  • Dictionary: Dictionaries are like HashMaps in Python. Dictionaries are represented by key-value comma-separated pairs inside curly braces. Dictionary keys are unique and can be of any Python immutable data type except for a few. The main purpose of a dictionary is to store a value with some key and return the value for that key when needed. Like lists, there are various methods for dictionaries:
    sample_dict = {"Key1": "Value1", 2: 3, "Age": 23}print(sample_dict["Age"]) # 23print(sample_dict[2]) # 3

    You can refer to the following documentation for more details on dictionary methods: https://docs.python.org/3/tutorial/datastructures.html#dictionaries.

  • Tuple: Tuples are an ordered collection of elements. Like lists, they can hold both heterogeneous and homogenous data. The only major difference between a tuple and a list is that tuples are immutable while lists aren’t.

    Tuples are represented by parentheses or round brackets:

    sample_tuple = (1,"2","brij")print(sample_tuple [0]) # 1

    Refer to the following documentation for more details on tuples: https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences.

  • Sets: Python sets are collections of items that are unordered, changeable, and do not allow duplicate values. It is useful to use sets when you need to store a collection of elements, but do not care about order or duplicates.

    Like dictionaries, sets are represented by curly braces:

    sample_set = {5,9}sample_set.add(4)print(sample_set) # {5,9,4}sample_set.remove(9)print(sample_set) # {5, 4}

In this section, we covered Python data structures. Other data structures, such as the frozen set, are not very frequently used, so we will skip them.

In the next section, we will discuss if…else conditions in Python.

Python if…else conditions or conditional statements

We often need decision-making capabilities in programming languages to execute a block of code based on certain conditions. These are known as if…else conditions or conditional statements.

A conditional statement in Python allows you to specify a certain action to be taken only if a certain condition is met. This is useful because it allows you to make your code more efficient by only performing certain actions when they are necessary.

For example, if you wanted to check if a number is divisible by 3, you could use a conditional statement to only perform the division operation if the number is actually divisible by 3. This would save time and resources because you would not have to perform unnecessary calculations.

The following is an example of a sample if…else condition:

'''In this program,we check if a number is divisible by 3 and
display an appropriate message'''
num = 9
# Try below two as well:
# num = 7
# num = 6
if num  == 0:
    print("This is a Zero")
elif num%3 ==0:
    print("Number is divisible by 3")
else:
    print("Number is not divisible by 3")

Those of you who are familiar with Python will be aware that the else-if condition in Python is symbolized by elif. The next section will be dedicated to looping techniques, so let’s get started!

Python looping techniques

A loop iterates over a sequence or block of code until a condition is met or until a fixed number of iterations are performed.

The Python language supports various looping techniques. They are very useful in programming and in maintaining code structures. Unlike traditional loops, they also save time and memory by avoiding the need to declare extra variables. There are two types of loops in Python:

  • for loop: Python for loops are used to traverse through sequences such as strings, lists, sets, and so on. Once started, the loop continues until it reaches the last element of the sequence. The break statements terminate the iteration prematurely:
        # Program to return a list of even number from a given list# Given listnumbers = [7,4,3,5,8,9,8,6,14]# Declare an empty listeven_numbers = []# iterate over the listfor num in numbers:    #Add the even numbers to even_numbers list    if num%2 == 0:     even_numbers.append(num)print("The Even number list is ", numbers)
  • while loop: while loops in Python are used to iterate through sequences until a given condition is satisfied. The while loop will continue indefinitely if a condition is not given:
    count = 0while (count < 11):   print ('The current number is :', count)   count = count + 1print('While loop terminates here.')

Now that we have gained an understanding of Python loops, let’s move on to discussing functions. Functions are a vital part of Python programming as they allow us to create reusable blocks of code that can be easily executed multiple times. By utilizing functions, we can write more efficient and organized code, making it easier to debug and maintain. In the upcoming section, we will discuss how to create and utilize functions in our programs.

Python functions

A function is a reusable block of code that can be used to execute certain tasks. A Python function starts with the def keyword. The Python function can be thought of as a recipe in a cookbook – it tells the computer what steps to follow to complete a certain task. You can give a function a name, and then use that name to call the function whenever you need to perform that task. Functions can also accept input (arguments) and can return a result (return values). Functions are useful because they allow you to reuse code, making your programs more efficient and easier to read.

The following is an example of a function in Python:

def div_of_numbers(num1, num2):    """This function returns the division of num1 by num2"""
     # This function takes two parameters namely num1 and num2.
     if num2==0:
     #Below is a return statement.
     return 'num2 is zero.'
     else:
     #Below is another return statement.
     return num1/num2
#This is how a python function is called.
print(div_of_numbers(8,0))
print(div_of_numbers(8,4))

We just saw how to write a sample function in Python and call it with arguments. The next step will be to see how a function can be used as a method within a Python class.

Object-oriented programming with Python

Python is a programming language that allows the use of multiple programming paradigms, including object-oriented programming. In object-oriented programming, everything is considered an object, which can have attributes and behaviors. This means that in Python, every element in the language is considered an object. Additionally, Python supports the concept of multiple inheritance, which allows an object to inherit characteristics and behaviors from multiple parent objects. This allows greater flexibility and customization within the language. Overall, Python’s support for object-oriented programming and multiple inheritance make it a powerful and versatile language. Let’s take a look at this in more detail:

  • Class: A class is the blueprint of objects. Python uses the Class keyword to create a class:
    Class DataPipeline():    first_tool = "AirFlow"

    The following is how we can create an instance (or object) of this class:

    datapipeline = DataPipeline()

    We can pass parameters while creating an instance of the class. These parameters are collected in an initializer method known as __init__. This method is called as soon as the object is created.

    We can also write functions inside a class. A function inside a class is known as a method.

    A method can be called using a dot operator on the instance of the class. The first parameter is self. We can use any names other than self but it is discouraged as it will impair the readability of the code.

  • Inheritance: Inheritance is a fundamental aspect of object-oriented programming that allows developers to create new classes based on the properties and behaviors of existing classes.

    By utilizing inheritance, we can create a new class that inherits the data and methods of an existing class without having to modify the original class in any way. This means that we can create a new class that has all the same features as the original class but with the added ability to modify or extend the inherited data and methods to fit the needs of the new class. Inheritance is a powerful tool that enables developers to create complex and reusable code, and it is a key aspect of object-oriented programming languages such as Java, C++, and Python.

    The following example illustrates inheritance:

    # parent classclass DataPipelineBook:    def __init__(self):        print("This book is very hot in market")        self.pages =300    def what_is_this(self):        print("Book")    def pages(self):        return self.pages

    The preceding code defines a class called DataPipelineBook in Python. A class is a template for creating objects, and an object is an instance of a class.

    The __init__ method is a special method in Python classes that is called when an object is created from a class. It is commonly known as the constructor. In this case, the __init__ method prints a message to the console and sets the value of the pages attribute to 300.

    The what_is_this method is a regular method that is defined within the class. It simply prints the Book string to the console.

    The pages method is also defined within the class. It returns the value of the pages attribute, which is set to 300 in the __init__ method.

    To use this class, you would need to create an object from it and then call its methods. The next step is to create a child class. An example of creating a child class is shown here:

    # child classclass PythonDataPipelineBook(DataPipelineBook):    def __init__(self):        # call super() function        super().__init__()        print("Create Data Pipeline with Python")    def what_technology_is_used(self):        return "Python"pipeline = PythonDataPipelineBook()pipeline.what_is_this()peggy.what_technology_is_used()

The preceding code is defining a child class called PythonDataPipelineBook that inherits from a parent class called DataPipelineBook. The child class has a constructor method (init) that calls the super() function to initialize the parent class. It also has a method called what_technology_is_used that returns the Python string.

The code then creates an object of the PythonDataPipelineBook class called pipeline and calls the what_is_this method from the parent class on it. Finally, it calls the what_technology_is_used method on the pipeline object.

Next, we’ll talk about how Python handles files.

Working with files in Python

The Python language is powerful and makes handling files a breeze. In Python, several operations can be performed on files, including opening, reading, writing, and appending. Let’s take a look at how each of these operations can be performed in Python:

  • Open: To read or write to a file, you first need to open it. Here’s how you can open a file in Python:
    f = open("yellowtaxidata.txt")
  • Read, write, or append: When you open a file, you can specify the mode in which you want to open it. The three most common modes are r (for reading), w (for writing), and a (for appending

    Here’s an example of how you can open a file in each of these modes:

    f = open("test_file.txt", "r")  #This file is opened in read mode.f = open("test_file.txt", "w") #This file is opened in read mode.f = open("test_file.txt", "a") #This file is opened in append mode.
  • Close: Did you know that every time you open a file, it’s important to close it once you’re finished with it? It’s easy to do – just use the f.close() function.

    But there’s an even better way to ensure your files are closed properly: using a context manager. When you open a file using a context manager, the file is automatically closed once you’re done with it. This helps prevent any potential issues or errors that could arise from leaving a file open.

    So, the next time you’re working with files, remember to close them properly by using either f.close() or a context manager. It’s a simple step that can save you a lot of headaches in the long run!

Are you tired of constantly forgetting to close your file after writing to it? Don’t worry – Python has a solution for that! Using a context manager, you can easily make sure that your file is closed after performing a write operation.

Take a look at the following code:

with open("test_file.txt", "w") as f:f.write("This is test data")

By using the with keyword, you can ensure that the test_file.txt file will be closed automatically after the write operation is completed. No more worrying about leaving your file open and causing issues down the line!

For more information on file handling in Python, check out the official Python documentation.

In this section, we introduced Python and its various capabilities as a programming language. We saw that Python can be used for scripting, object-oriented programming, functional programming, and creating machine learning models and interactive data visualizations. We also discussed that Python has a large online community and many well-supported open source libraries. We also covered several fundamental Python concepts, including data structures (lists and dictionaries), if-else conditions, looping techniques, functions, object-oriented programming, and working with files.

In the next section, we will learn how to set up a development environment for Python, which will allow us to easily start working with the language. This includes installing the necessary software and tools and configuring our system to support Python development.

Establishing a development environment

Before you hit the ground running creating an exciting project with Python, it is essential to create a development environment with a strong foundation in system integrity.

A unique way to think of a Python project is to think in terms of a lab experiment. When designing a lab experiment, a scientist first starts by jotting down the purpose of the experiment and all possible expected outcomes of the experiment. Why are we creating this experiment? and What outputs do we reasonably expect to get from this experiment? This frame of reference is important to maintain because it leads into the next, and arguably the most important, perspective: How can we limit confounding factors from impacting the results? This is where the idea of a clean, sterile, experimental environment comes to the forefront; this idea is synonymous with the needs of a programming environment with a clear and reproducible workflow.

So, how do we design a development environment that not only limits both known and unknown confounding factors from impacting end-pipeline products but is also highly reproducible and shareable? In this section, we will review the primary building blocks of a highly effective and “sterile” development environment:

  • Version control with Git tracking
  • Making development easy with local integrated development environments (IDEs)
  • Documenting environment dependencies with requirements.txt
  • Utilizing module management systems (MMSs)

Let’s get started!

Version control with Git tracking

The first step of any programming project is to instantiate a version control repository unique to your environment. This keeps the project’s development and production environments in separate buckets.

Several version control systems use the Git version control software and protocol. They each offer similar functionality, but some key differences may make one a better fit for your needs than the others.

We will be using GitHub to track and store our data pipelines throughout this book.

One reason to choose GitHub is that it is the most popular platform for hosting and collaborating on Git repositories. It has a large user base, which means that it is well-supported and has a wealth of resources and documentation available. GitHub also has several features specifically designed for collaboration, such as pull requests, which allow users to propose changes to a repository and discuss them with other contributors before merging them into the main code base.

While there are several internet-hosting Git-tracking providers (GitLab and Bitbucket come to mind), we will be using GitHub to track and store our data pipelines throughout this book.

The importance of Git-tracking your code

You might have seen the term “Git” an uncountable number of times but as a general overview, “Git” is any distributed version control system (VCS) designed to track changes in source code and manage a code base. GitHub is one of these systems. When you’ve created and cloned a new GitHub repository onto your local device, any changes you make can be tracked, committed, and pushed up to your online storage location.

It’s best practice to get into the habit of always committing and pushing your changes frequently so that your work is backed up with Git; this way, you won’t lose all of your hard-earned code if you ever do something silly such as spill coffee all over your laptop... (like we have... more than once...). Additionally, your GitHub repositories serve as a portfolio to showcase your code projects publicly (or keep them private, if you prefer).

GitHub also allows others to contribute to your code base. Git version control enables collaboration without the fear of losing or overwriting changes since multiple developers can work on different branches, and changes can be reviewed and merged through pull requests. Code reviews ensure that the code quality of your projects remains high, and helps catch bugs or issues early in the development life cycle.

Additionally, GitHub can be integrated with various services, such as project management tools and various CI/CD tools, and allows automated testing and deployment of your code. We will learn more about project testing and CI/CD tools later in this book.

For now, we encourage you to leverage Git-tracking with GitHub to push code changes “early and often” so that your work is not only backed up online but you can also add your developing data engineering skills to a public code portfolio.

Before moving on to the next section, take a moment to fork the GitHub repository associated with this book so that it is available in your own personal GitHub profile: https://github.com/PacktPublishing/Building-ETL-Pipelines-with-Python.

Making development easy with IDEs

As Python programmers, you most likely have a preference in terms of local development environments. However, to avoid the risk of instilling any redundancies, we will walk you through our preferred local development landscape in case you want to mimic the same workflow we’ll follow for this tutorial.

iTerm2

We pride ourselves on being superfans of strategic laziness, where we set up programming landscapes that are not only easy on the eyes but also take little to nothing to maintain. As Mac programmers, the Terminal interface can be quite dull; that’s why we recommend installing iTerm2, which works well on Macs that run macOS 10.14 or newer. As stated on their website, “iTerm2 brings the Terminal into the modern age with features you never knew you always wanted.” Take some time to install and customize your new iTerm2 Terminal so that it’s aesthetically pleasing; it’s much easier to fall into the creativity of development design when your eyes are intrigued by your Terminal.

You can follow the instructions mentioned here to download and set up iTerm2: https://iterm2.com/downloads.html.

PyCharm

Next, we recommend using your newly remodeled Terminal to download our favorite IDE: PyCharm. For those of you unfamiliar with IDEs, you can think of an IDE in a similar fashion to iTerm: a visual interface that not only creates an aesthetically pleasing coding environment but also allows you to quickly, efficiently format and structure files with a few short commands. Our local PyCharm environment will be where we choose to clone the Git repository that we created in the previous section.

You can follow the instructions mentioned here to download and set up PyCharm: https://www.jetbrains.com/pycharm/download/#section=mac.

You will also need to register your GitHub account to your new PyCharm app by following these steps: https://www.jetbrains.com/help/pycharm/github.html.

Jupyter Notebook

Lastly, since we will be working with data, visualizing sections of DataFrames can be quite difficult in a standard Python script without a bit of finagling. Staying with the theme of strategic laziness, we recommend downloading the beautiful and user-friendly Jupyter Notebook for easy data visualization. As a word of warning, Jupyter Notebooks is an amazing tool for development, but we stress that it is not recommended that you deploy Jupyter scripts in a production environment. Jupyter’s friendly UI interface and easy visualization of code are due to its memory- and processing-heavy framework that is inevitably quite clunky and slow in a pipeline.

You can follow the instructions mentioned here to download and set up Jupyter Notebook: https://jupyter.org/install.

Next, we will document the environmental dependencies using a requirements.txt file.

Documenting environment dependencies with requirements.txt

Creating and maintaining a requirements.txt document is a standard practice in Python application development. Future updates or major changes to dependencies could potentially break the application, but developers can always install the recorded previous versions, ensuring smooth execution of the code without errors. By freezing the application to specific versions of dependencies, it ensures that, given the correct requirements, your project will maintain its original state. This approach proves beneficial, providing a win-win situation for developers and the application’s reliability.

Let’s look at how to install dependencies using the requirements.txt file:

(base) usr@project%   pip install -r requirements.txtex: requirements.txt
pip==3.9
python==3.9.4
pandas==1.4.2
requests==2.28.0

Additionally, you can update and store the new package imports and versions with the following command to keep the requirements.txt file up to date:

(base) usr@project %   pip freeze >> requirements.txt

That is how we can collect dependencies in the requirements.txt file. The next section will review some key concepts that are essential to know before we start building data pipelines.

Accounting for circular dependencies

The concept of circular dependency is not always talked about when first learning Python, but it’s a concept where one or more modules depend on each other:

Figure 1.1:  A circular dependency

Figure 1.1: A circular dependency

While there are many useful aspects of this interdependency, underlying second and third-degree inconsistencies, such as one Python module version being incompatible with another Python module version, can result in a cascading effect of uncontrolled errors that lead to a smorgasbord of application failures. Alluding back to our initial analogy about a development project being similar to a laboratory experiment, this is why system sterility comes into play. To create an internally consistent environment, versions of the dependencies must be flexibly adjusted to account for the circular interdependencies of imports. This magic of MMS begins!

Utilizing module management systems (MMSs)

MMSs are like special folders that only work in certain environments. They do this by changing sys.prefix and sys.exec_prefix so that they point to the base directory of the virtual environment. This is helpful because it lets developers create “clean” applications and also makes sure that all the different parts of the project work well together.

There are many different module management systems to choose from, but Anaconda is the most popular. However, it doesn’t always have the most up-to-date packages for data engineers, and pip, the regular package manager for Python, doesn’t work well with Anaconda. That’s why we’re using pipenv in this book. It’s a virtual environment and package management system that uses Pipfile and Pipfile.lock, similar to a requirements.txt file.

Instigating a virtual MMS environment within your local IDE

Creating a virtual MMS environment within your local IDE can be a helpful way to test and run your code before you implement it in a larger system. This virtual environment allows you to simulate different scenarios and conditions to ensure that your code is working properly and efficiently. It can also help you identify and fix any errors or bugs that may arise during the development process.

Overall, setting up a virtual MMS environment within your local IDE can be a valuable tool for streamlining your coding workflow and ensuring that your projects are successful.

Configuring a Pipenv environment in PyCharm

In Python development, managing project environments is crucial to keep your project’s dependencies organized and controlled. One way to achieve this is by using pipenv. Let’s start the process by installing Pipenv. Open your Terminal and execute the following command:

(base) usr@project %   pip install --user pipenv

This command instructs pip (a Python package manager) to install Pipenv in your user space. The --user option ensures that Pipenv is installed in the user install directory for your platform.

After successful installation, this is what your Terminal should look like:

Figure 1.2: Command-line view of installing pipenv

Figure 1.2: Command-line view of installing pipenv

Once installed, remember to activate the pipenv environment before you begin to work on your new project. This way, the entirety of your project is developed within the isolated virtual environment.

Each time you activate pipenv, the command line will display the following:

(base) usr@project %   pipenv shellCreating a virtualenv for this project...
Pipfile: /Users/usr/project/Pipfile
Using /Users/usr/.pyenv/versions/3.10.4/bin/python3 (3.10.4) to create virtualenv...
 Creating virtual environment...created virtual environment CPython3.10.4.final.0-64 in 903ms
  creator CPython3Posix(dest=/Users/usr/.local/share/virtualenvs/ project-dGXB4pbM, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/usr/Library/Application Support/virtualenv)
    added seed packages: pip==22.1.2, setuptools==62.6.0, wheel==0.37.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
 Successfully created virtual environment!
Virtualenv location: /Users/usr/.local/share/virtualenvs/TestFiles-dGXB4pbM
Launching subshell in virtual environment...
. /Users/usr/.local/share/virtualenvs/project-dGXB4pbM/bin/activate
(Project) usr@project %%  . /Users/usr/.local/share/virtualenvs/project
-dGXB4pbM/bin/activate

Now that we have learned how to activate a virtual environment using pip, we can move on to installing packages within that environment.

Installing packages

pip- packages can be added or removed from the environment via simple $ pipenv install or $ pipenv uninstall commands since activating the pipenv environment is designed to replace the need for the pip- tag in the command line.

Pipfile and Pipfile.lock

When a pipenv environment is initiated, an empty Pipfile is automatically created. As mentioned previously, Pipfile is synonymous with the requirements.txt file.

Pipfile.lock is created to specify which version of the dependencies referenced in Pipfile should be used to avoid automatic upgrades of packages that depend on each other. You can run the $ pipenv lock command to update the Pipfile.lock file with the currently used versions of all the dependencies within your virtual environment. However, pipenv takes care of updating the Pipfile and Pipfile.lock files with each package installation.

The following example shows how we can use Pipfile and Pipfile.lock:

(Project) usr@project %%  pipenv install numbaInstalling numba...
Adding numba to Pipfile's [packages]...
 Installation Succeeded
Pipfile.lock (aa8734) out of date, updating to (d71de2)...
Locking [dev-packages] dependencies...
Locking [packages] dependencies...
Building requirements...
Resolving dependencies...
 Success!
Updated Pipfile.lock (d71de2)!
Installing dependencies from Pipfile.lock (d71de2)...
     1/1 — 00

Now, let’s summarize what we have learned in this chapter in the next section.

Summary

In this chapter, we learned about various data structures in Python, including if-else conditions and looping techniques. We also learned about functions and object-oriented programming in Python. Then, we covered working with files in Python, including version control with Git tracking and documenting environment dependencies with requirements.txt. Additionally, we learned about utilizing module management systems and instigating a virtual MMS environment within a local IDE. Finally, we covered installing packages and working with Pipfile and Pipfile.lock.

In the next chapter, we will discuss the concept of data pipelines and how to create robust ones, including the process of automating ETL pipelines and how to ensure that data is consistently and accurately moved and transformed. See you in Chapter 2!

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand how to set up a Python virtual environment with PyCharm
  • Learn functional and object-oriented approaches to create ETL pipelines
  • Create robust CI/CD processes for ETL pipelines
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing. In this book, you’ll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You’ll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you’ll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments. By the end of this book, you’ll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python.

Who is this book for?

If you are a data engineer or software professional looking to create enterprise-level ETL pipelines using Python, this book is for you. Fundamental knowledge of Python is a prerequisite.

What you will learn

  • Explore the available libraries and tools to create ETL pipelines using Python
  • Write clean and resilient ETL code in Python that can be extended and easily scaled
  • Understand the best practices and design principles for creating ETL pipelines
  • Orchestrate the ETL process and scale the ETL pipeline effectively
  • Discover tools and services available in AWS for ETL pipelines
  • Understand different testing strategies and implement them with the ETL process
Estimated delivery fee Deliver to Ukraine

Economy delivery 10 - 13 business days

$6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 29, 2023
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781804615256
Vendor :
Amazon
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Ukraine

Economy delivery 10 - 13 business days

$6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Publication date : Sep 29, 2023
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781804615256
Vendor :
Amazon
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 124.97
Data Engineering with dbt
$49.99
Building ETL Pipelines with Python
$34.99
Data Ingestion with Python Cookbook
$39.99
Total $ 124.97 Stars icon

Table of Contents

21 Chapters
Part 1:Introduction to ETL, Data Pipelines, and Design Principles Chevron down icon Chevron up icon
Chapter 1: A Primer on Python and the Development Environment Chevron down icon Chevron up icon
Chapter 2: Understanding the ETL Process and Data Pipelines Chevron down icon Chevron up icon
Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines Chevron down icon Chevron up icon
Part 2:Designing ETL Pipelines with Python Chevron down icon Chevron up icon
Chapter 4: Sourcing Insightful Data and Data Extraction Strategies Chevron down icon Chevron up icon
Chapter 5: Data Cleansing and Transformation Chevron down icon Chevron up icon
Chapter 6: Loading Transformed Data Chevron down icon Chevron up icon
Chapter 7: Tutorial – Building an End-to-End ETL Pipeline in Python Chevron down icon Chevron up icon
Chapter 8: Powerful ETL Libraries and Tools in Python Chevron down icon Chevron up icon
Part 3:Creating ETL Pipelines in AWS Chevron down icon Chevron up icon
Chapter 9: A Primer on AWS Tools for ETL Processes Chevron down icon Chevron up icon
Chapter 10: Tutorial – Creating an ETL Pipeline in AWS Chevron down icon Chevron up icon
Chapter 11: Building Robust Deployment Pipelines in AWS Chevron down icon Chevron up icon
Part 4:Automating and Scaling ETL Pipelines Chevron down icon Chevron up icon
Chapter 12: Orchestration and Scaling in ETL Pipelines Chevron down icon Chevron up icon
Chapter 13: Testing Strategies for ETL Pipelines Chevron down icon Chevron up icon
Chapter 14: Best Practices for ETL Pipelines Chevron down icon Chevron up icon
Chapter 15: Use Cases and Further Reading Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(6 Ratings)
5 star 83.3%
4 star 0%
3 star 0%
2 star 0%
1 star 16.7%
Filter icon Filter
Top Reviews

Filter reviews by




Data Science Leader Oct 18, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Whether you're a data scientist, data engineer, or machine learning engineer, this book is a goldmine. It breaks down the world of ETL in a way that's easy to get. Brij Kishore Pandey and Emily Schoof make it simple:They explain the difference between ETL and ELT.They share ETL design patterns.They give tips for every ETL step.I loved how they focus on doing things. They show you how to set up an ETL pipeline with Python and also with AWS tools like S3, Lambda, and Step Functions. Plus, they cover how to test your pipelines and keep everything running smooth.With so many companies using cloud tools, the bits about AWS in this book are a big win. You'll feel way more confident using these tools at work.All in all, this book is top-notch for anyone in the data world looking to step up their game. Don't miss out!
Amazon Verified review Amazon
Amir Esmaeili Nov 15, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Einfach Super!
Amazon Verified review Amazon
H2N Oct 18, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
An insightful read for data scientists, engineers, and data enthusiasts. This book provides a comprehensive overview of ETL data pipelines in Python, distinguishing between ETL and ELT. It delves deep into the design principles essential for scalable and robust pipelines and offers strategies for data sourcing and extraction. The hands-on tutorials present an end-to-end ETL pipeline, both in Python and AWS, shedding light on the associated AWS tools. What sets it apart is its discussion on the limitations of ETL pipelines and effective scaling techniques to meet growing demands. A must-read for those eager to enhance their data pipeline knowledge.
Amazon Verified review Amazon
Deblina Dec 02, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you are looking for a book on how to build enterprise-ready ETL pipelines with python, you should check out this book.This book is a comprehensive guide that covers everything from data extraction, transformation, loading, testing, and deployment. You will learn how to use popular python libraries such as pandas, sqlalchemy, airflow, and luigi to create scalable and reliable data pipelines. You will also get practical tips and best practices on how to design, document, and monitor your pipelines.The book is full of examples and exercises that will help you apply the concepts to real-world scenarios. Whether you are a beginner or an expert in ETL, this book will help you take your skills to the next level.
Amazon Verified review Amazon
Om S Oct 17, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book guides you through the process of creating data pipelines using Python for real-world applications. You'll learn how to set up a Python environment, use both functional and object-oriented approaches, and build robust CI/CD processes.The book covers the fundamentals of ETL pipelines, from data extraction and cleansing to loading data. It also delves into various Python libraries and tools, including AWS services, for creating efficient pipelines. You'll discover testing strategies and best practices to ensure reliability.If you're a data engineer or software professional with basic Python knowledge, this book is for you. It provides practical insights and hands-on examples to develop high-performance ETL pipelines in Python.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela