You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781805121152

Length 394 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Yuki Kakegawa

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Getting Started with Python Polars FREE CHAPTER

2. Chapter 2: Reading and Writing Files

3. Chapter 3: An Introduction to Data Analysis in Python Polars

4. Chapter 4: Data Transformation Techniques

5. Chapter 5: Handling Missing Data

6. Chapter 6: Performing String Manipulations

7. Chapter 7: Working with Nested Data Structures

8. Chapter 8: Reshaping and Tidying Data

9. Chapter 9: Time Series Analysis

10. Chapter 10: Interoperability with Other Python Libraries

11. Chapter 11: Working with Common Cloud Data Sources

12. Chapter 12: Testing and Debugging in Polars

13. Index

Why subscribe?

14. Other Books You May Enjoy

Joining DataFrames

A join operation is used to merge rows from two or more datasets by utilizing a shared column that establishes a relationship between them. You may already be familiar with the use and concept of joining, but it’s commonly used in any data processing tools such as SQL and other DataFrame libraries such as pandas and Spark.

In this recipe, we’ll look at how to apply join operations in Polars DataFrames.

Getting ready

We’ll continuously use the same data we’ve used in previous recipes in this chapter. Execute the following code to do the same process and rename the DataFrame accordingly:

from polars import selectors as cs
academic_df = (
    pl.read_csv('../data/academic.csv')
    .select(
        pl.col('year').alias('academic_year'),
        cs.numeric().cast(pl.Int64)
    )
    .filter(pl.col('academic_year').str.slice(0,4).cast(pl.Int32)>=2018)
)

Also, we’ll use another set of data to join onto the other DataFrame. Run the following code to read the data into a DataFrame:

status_df = (
    pl.read_csv('../data/status.csv')
    .with_columns(
        cs.float().cast(pl.Int64)
    )
)
status_df

The DataFrame contains the following data:

Figure 8.11 – The status_df DataFrame

This status.csv file contains student counts broken down by attributes such as gender, marital status, and visa status. Notice it’s in a wide format rather than a long format.

How to do it...

Here’s how to join data:

Use the .join() method to combine two DataFrames. We’ll join status_df onto academic_df, leaving only columns relating to visa types:

joined_df = (
    academic_df
    .join(
        status_df,
        left_on='academic_year',
        right_on='year',
        how='inner'
    )
    .select(
        'academic_year',
        'students',
        cs.contains('visa')
    )
)
joined_df.head()

The preceding code will return the following output:

Figure 8.12 – The joined DataFrame

Now that we have brought in student counts for each visa type, we can visualize the data and find some trends. To analyze the data by visa types, we’ll need to convert the DataFrame into a long format first so that we can treat all visa types altogether in one column, which can then be used to slice the data. We’ll look at the percent of total for each visa type:

import plotly.express as px
viz_df = (
    joined_df
    .unpivot(
        index=['academic_year', 'students'],
        on=cs.contains('visa'),
        variable_name='visa_type',
        value_name='count'
    )
    .with_columns(
        (pl.col('count') / pl.col('students')).alias('percent_of_total')
    )
)
fig = px.bar(
    x=viz_df['academic_year'],
    y=viz_df['percent_of_total'],
    color=viz_df['visa_type'],
    barmode = 'stack',
    text_auto='.1%',
    title='International Student Count by Visa Type',
    labels={
        'x': 'Year',
        'y': 'Student Count'
    },
)
fig.update_layout(
    autosize=True,
    uniformtext_minsize=10,
    uniformtext_mode='hide',
    yaxis_tickformat = '0%'
)
fig.update_traces(textposition='inside')
fig.show()

The preceding code will return the following output:

Figure 8.13 – International student counts percent of total by visa types

Utilize the validate parameter to ensure the cardinality of columns you join on.
Let’s say status_df comes in a long format. The following DataFrame will have the level of grain where each row represents data per year per visa type. There will be multiple rows per year value:
```
status_long_df = (
    status_df
    .unpivot(
        index='year',
        on=cs.contains('visa'),
        variable_name='visa_type',
        value_name='count'
    )
)
status_long_df.head()
```
The preceding code will produce the following output:

Figure 8.14 – status_df in a long format

Now, let’s join two DataFrames with the validate parameter:

(
    academic_df
    .join(
        status_long_df,
        left_on='academic_year',
        right_on='year',
        how='inner',
        validate='1:m'
    )
    .select(
        'academic_year',
        'students',
        'visa_type',
        'count'
    )
)

The preceding code will return the following output:

Figure 8.15 – The joined DataFrame with status_df in a long format

The following code will raise an error since the join type is specified as 1:1, resulting in a validation failure:

(
    academic_df
    .join(
        status_long_df,
        left_on='academic_year',
        right_on='year',
        how='inner',
        validate='1:1'
    )
    .select(
        'academic_year',
        'students',
        'visa_type',
        'count'
    )
)

The preceding code will return the following output:

>> ComputeError: the join keys did not fulfil 1:1 validation

How it works...

The .join() method provides parameters to specify columns and additional configurations for your join. I used the left_on and right_on parameters to specify join columns as they had different names across two DataFrames. I could’ve used the on parameter instead if they had the same name, which is usually the case if your data is cleaned and validated in upstream processes.

There are several join types available, such as inner, left, outer, semi, anti, cross, and outer_coalesce. To learn more about join types, please refer to the resources in the See also section.

The validate parameter helps check the uniqueness of the join keys/columns. As shown in step 2, it’ll raise an error if the validation fails.

Important

As of the time of writing, the validate parameter is not supported in the streaming engine. Also, it only works when joined with single columns.

You can apply the join operation in LazyFrames as well as in DataFrames.

Note

The coalesce parameter for a left join decides whether to keep the join column(s) from the right table or not. Only when you set it to False, you'll get the join column with the _right suffix.

There’s more...

You can apply an asof join in Polars. It is like a left join, but instead of matching on equal keys, it matches based on the closest key. Here’s a simple example of how an asof join works:

a = pl.DataFrame({
    'int': [1,2,3], 'value': [10,20,30]
}).set_sorted('int')
b = pl.DataFrame({
    'int': [4,5,6]
}).set_sorted('int')
b.join_asof(a, on='int', strategy='backward')

The preceding code will return the following output:

Figure 8.16 – The joined DataFrame using an asof join

Note that you need to set the sorting for each DataFrame for this to work. For more information, please refer to the Polars documentation page linked in the See also section.

You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Table of Contents (15) Chapters

Joining DataFrames

Getting ready

How to do it...

How it works...

There’s more...

See also

Authors (1)

Personalised recommendations for you

You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Table of Contents (15) Chapters

Joining DataFrames

Getting ready

How to do it...

How it works...

There’s more...

See also

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you