Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Polars Cookbook

You're reading from   Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781805121152
Length 394 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Yuki Kakegawa Yuki Kakegawa
Author Profile Icon Yuki Kakegawa
Yuki Kakegawa
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: Getting Started with Python Polars FREE CHAPTER 2. Chapter 2: Reading and Writing Files 3. Chapter 3: An Introduction to Data Analysis in Python Polars 4. Chapter 4: Data Transformation Techniques 5. Chapter 5: Handling Missing Data 6. Chapter 6: Performing String Manipulations 7. Chapter 7: Working with Nested Data Structures 8. Chapter 8: Reshaping and Tidying Data 9. Chapter 9: Time Series Analysis 10. Chapter 10: Interoperability with Other Python Libraries 11. Chapter 11: Working with Common Cloud Data Sources 12. Chapter 12: Testing and Debugging in Polars 13. Index 14. Other Books You May Enjoy

Joining DataFrames

A join operation is used to merge rows from two or more datasets by utilizing a shared column that establishes a relationship between them. You may already be familiar with the use and concept of joining, but it’s commonly used in any data processing tools such as SQL and other DataFrame libraries such as pandas and Spark.

In this recipe, we’ll look at how to apply join operations in Polars DataFrames.

Getting ready

We’ll continuously use the same data we’ve used in previous recipes in this chapter. Execute the following code to do the same process and rename the DataFrame accordingly:

from polars import selectors as cs
academic_df = (
    pl.read_csv('../data/academic.csv')
    .select(
        pl.col('year').alias('academic_year'),
        cs.numeric().cast(pl.Int64)
    )
    .filter(pl.col('academic_year').str.slice(0,4).cast(pl.Int32)>=2018)
)

Also, we’ll use another set of data to join onto the other DataFrame. Run the following code to read the data into a DataFrame:

status_df = (
    pl.read_csv('../data/status.csv')
    .with_columns(
        cs.float().cast(pl.Int64)
    )
)
status_df

The DataFrame contains the following data:

Figure 8.11 – The status_df DataFrame

Figure 8.11 – The status_df DataFrame

This status.csv file contains student counts broken down by attributes such as gender, marital status, and visa status. Notice it’s in a wide format rather than a long format.

How to do it...

Here’s how to join data:

  1. Use the .join() method to combine two DataFrames. We’ll join status_df onto academic_df, leaving only columns relating to visa types:
    joined_df = (
        academic_df
        .join(
            status_df,
            left_on='academic_year',
            right_on='year',
            how='inner'
        )
        .select(
            'academic_year',
            'students',
            cs.contains('visa')
        )
    )
    joined_df.head()

    The preceding code will return the following output:

Figure 8.12 – The joined DataFrame

Figure 8.12 – The joined DataFrame

  1. Now that we have brought in student counts for each visa type, we can visualize the data and find some trends. To analyze the data by visa types, we’ll need to convert the DataFrame into a long format first so that we can treat all visa types altogether in one column, which can then be used to slice the data. We’ll look at the percent of total for each visa type:
    import plotly.express as px
    viz_df = (
        joined_df
        .unpivot(
            index=['academic_year', 'students'],
            on=cs.contains('visa'),
            variable_name='visa_type',
            value_name='count'
        )
        .with_columns(
            (pl.col('count') / pl.col('students')).alias('percent_of_total')
        )
    )
    fig = px.bar(
        x=viz_df['academic_year'],
        y=viz_df['percent_of_total'],
        color=viz_df['visa_type'],
        barmode = 'stack',
        text_auto='.1%',
        title='International Student Count by Visa Type',
        labels={
            'x': 'Year',
            'y': 'Student Count'
        },
    )
    fig.update_layout(
        autosize=True,
        uniformtext_minsize=10,
        uniformtext_mode='hide',
        yaxis_tickformat = '0%'
    )
    fig.update_traces(textposition='inside')
    fig.show()

    The preceding code will return the following output:

Figure 8.13 – International student counts percent of total by visa types

Figure 8.13 – International student counts percent of total by visa types

  1. Utilize the validate parameter to ensure the cardinality of columns you join on.

    Let’s say status_df comes in a long format. The following DataFrame will have the level of grain where each row represents data per year per visa type. There will be multiple rows per year value:

    status_long_df = (
        status_df
        .unpivot(
            index='year',
            on=cs.contains('visa'),
            variable_name='visa_type',
            value_name='count'
        )
    )
    status_long_df.head()

    The preceding code will produce the following output:

Figure 8.14 – status_df in a long format

Figure 8.14 – status_df in a long format

  1. Now, let’s join two DataFrames with the validate parameter:
    (
        academic_df
        .join(
            status_long_df,
            left_on='academic_year',
            right_on='year',
            how='inner',
            validate='1:m'
        )
        .select(
            'academic_year',
            'students',
            'visa_type',
            'count'
        )
    )

    The preceding code will return the following output:

Figure 8.15 – The joined DataFrame with status_df in a long format

Figure 8.15 – The joined DataFrame with status_df in a long format

The following code will raise an error since the join type is specified as 1:1, resulting in a validation failure:

(
    academic_df
    .join(
        status_long_df,
        left_on='academic_year',
        right_on='year',
        how='inner',
        validate='1:1'
    )
    .select(
        'academic_year',
        'students',
        'visa_type',
        'count'
    )
)

The preceding code will return the following output:

>> ComputeError: the join keys did not fulfil 1:1 validation

How it works...

The .join() method provides parameters to specify columns and additional configurations for your join. I used the left_on and right_on parameters to specify join columns as they had different names across two DataFrames. I could’ve used the on parameter instead if they had the same name, which is usually the case if your data is cleaned and validated in upstream processes.

There are several join types available, such as inner, left, outer, semi, anti, cross, and outer_coalesce. To learn more about join types, please refer to the resources in the See also section.

The validate parameter helps check the uniqueness of the join keys/columns. As shown in step 2, it’ll raise an error if the validation fails.

Important

As of the time of writing, the validate parameter is not supported in the streaming engine. Also, it only works when joined with single columns.

You can apply the join operation in LazyFrames as well as in DataFrames.

Note

The coalesce parameter for a left join decides whether to keep the join column(s) from the right table or not. Only when you set it to False, you'll get the join column with the _right suffix.

There’s more...

You can apply an asof join in Polars. It is like a left join, but instead of matching on equal keys, it matches based on the closest key. Here’s a simple example of how an asof join works:

a = pl.DataFrame({
    'int': [1,2,3], 'value': [10,20,30]
}).set_sorted('int')
b = pl.DataFrame({
    'int': [4,5,6]
}).set_sorted('int')
b.join_asof(a, on='int', strategy='backward')

The preceding code will return the following output:

Figure 8.16 – The joined DataFrame using an asof join

Figure 8.16 – The joined DataFrame using an asof join

Note that you need to set the sorting for each DataFrame for this to work. For more information, please refer to the Polars documentation page linked in the See also section.

See also

To learn more about joins in Polars, please use the following resources:

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image