Joining DataFrames
A join operation is used to merge rows from two or more datasets by utilizing a shared column that establishes a relationship between them. You may already be familiar with the use and concept of joining, but it’s commonly used in any data processing tools such as SQL and other DataFrame libraries such as pandas and Spark.
In this recipe, we’ll look at how to apply join operations in Polars DataFrames.
Getting ready
We’ll continuously use the same data we’ve used in previous recipes in this chapter. Execute the following code to do the same process and rename the DataFrame accordingly:
from polars import selectors as cs academic_df = ( pl.read_csv('../data/academic.csv') .select( pl.col('year').alias('academic_year'), cs.numeric().cast(pl.Int64) ) .filter(pl.col('academic_year').str.slice(0,4).cast(pl.Int32)>=2018) )
Also, we’ll use another set of data to join onto the other DataFrame. Run the following code to read the data into a DataFrame:
status_df = ( pl.read_csv('../data/status.csv') .with_columns( cs.float().cast(pl.Int64) ) ) status_df
The DataFrame contains the following data:
Figure 8.11 – The status_df DataFrame
This status.csv
file contains student counts broken down by attributes such as gender, marital status, and visa status. Notice it’s in a wide format rather than a long format.
How to do it...
Here’s how to join data:
- Use the
.join()
method to combine two DataFrames. We’ll joinstatus_df
ontoacademic_df
, leaving only columns relating to visa types:joined_df = ( academic_df .join( status_df, left_on='academic_year', right_on='year', how='inner' ) .select( 'academic_year', 'students', cs.contains('visa') ) ) joined_df.head()
The preceding code will return the following output:
Figure 8.12 – The joined DataFrame
- Now that we have brought in student counts for each visa type, we can visualize the data and find some trends. To analyze the data by visa types, we’ll need to convert the DataFrame into a long format first so that we can treat all visa types altogether in one column, which can then be used to slice the data. We’ll look at the percent of total for each visa type:
import plotly.express as px viz_df = ( joined_df .unpivot( index=['academic_year', 'students'], on=cs.contains('visa'), variable_name='visa_type', value_name='count' ) .with_columns( (pl.col('count') / pl.col('students')).alias('percent_of_total') ) ) fig = px.bar( x=viz_df['academic_year'], y=viz_df['percent_of_total'], color=viz_df['visa_type'], barmode = 'stack', text_auto='.1%', title='International Student Count by Visa Type', labels={ 'x': 'Year', 'y': 'Student Count' }, ) fig.update_layout( autosize=True, uniformtext_minsize=10, uniformtext_mode='hide', yaxis_tickformat = '0%' ) fig.update_traces(textposition='inside') fig.show()
The preceding code will return the following output:
Figure 8.13 – International student counts percent of total by visa types
- Utilize the
validate
parameter to ensure the cardinality of columns you join on.Let’s say
status_df
comes in a long format. The following DataFrame will have the level of grain where each row represents data per year per visa type. There will be multiple rows per year value:status_long_df = ( status_df .unpivot( index='year', on=cs.contains('visa'), variable_name='visa_type', value_name='count' ) ) status_long_df.head()
The preceding code will produce the following output:
Figure 8.14 – status_df in a long format
- Now, let’s join two DataFrames with the
validate
parameter:( academic_df .join( status_long_df, left_on='academic_year', right_on='year', how='inner', validate='1:m' ) .select( 'academic_year', 'students', 'visa_type', 'count' ) )
The preceding code will return the following output:
Figure 8.15 – The joined DataFrame with status_df in a long format
The following code will raise an error since the join type is specified as 1:1
, resulting in a validation failure:
( academic_df .join( status_long_df, left_on='academic_year', right_on='year', how='inner', validate='1:1' ) .select( 'academic_year', 'students', 'visa_type', 'count' ) )
The preceding code will return the following output:
>> ComputeError: the join keys did not fulfil 1:1 validation
How it works...
The .join()
method provides parameters to specify columns and additional configurations for your join. I used the left_on
and right_on
parameters to specify join columns as they had different names across two DataFrames. I could’ve used the on
parameter instead if they had the same name, which is usually the case if your data is cleaned and validated in upstream processes.
There are several join types available, such as inner
, left
, outer
, semi
, anti
, cross
, and outer_coalesce
. To learn more about join types, please refer to the resources in the See also section.
The validate
parameter helps check the uniqueness of the join keys/columns. As shown in step 2, it’ll raise an error if the validation fails.
Important
As of the time of writing, the validate
parameter is not supported in the streaming engine. Also, it only works when joined with single columns.
You can apply the join operation in LazyFrames as well as in DataFrames.
Note
The coalesce
parameter for a left join decides whether to keep the join column(s) from the right table or not. Only when you set it to False, you'll get the join column with the _right suffix.
There’s more...
You can apply an asof
join in Polars. It is like a left join, but instead of matching on equal keys, it matches based on the closest key. Here’s a simple example of how an asof
join works:
a = pl.DataFrame({ 'int': [1,2,3], 'value': [10,20,30] }).set_sorted('int') b = pl.DataFrame({ 'int': [4,5,6] }).set_sorted('int') b.join_asof(a, on='int', strategy='backward')
The preceding code will return the following output:
Figure 8.16 – The joined DataFrame using an asof join
Note that you need to set the sorting for each DataFrame for this to work. For more information, please refer to the Polars documentation page linked in the See also section.
See also
To learn more about joins in Polars, please use the following resources:
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join_asof.html
- https://docs.pola.rs/user-guide/transformations/joins/