You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781805121152

Length 394 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Yuki Kakegawa

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Getting Started with Python Polars FREE CHAPTER

2. Chapter 2: Reading and Writing Files

3. Chapter 3: An Introduction to Data Analysis in Python Polars

4. Chapter 4: Data Transformation Techniques

5. Chapter 5: Handling Missing Data

6. Chapter 6: Performing String Manipulations

7. Chapter 7: Working with Nested Data Structures

8. Chapter 8: Reshaping and Tidying Data

9. Chapter 9: Time Series Analysis

10. Chapter 10: Interoperability with Other Python Libraries

11. Chapter 11: Working with Common Cloud Data Sources

12. Chapter 12: Testing and Debugging in Polars

13. Index

Why subscribe?

14. Other Books You May Enjoy

Creating, modifying, and deleting columns

The key methods we’ll cover in this recipe are .select(), .with_columns(), and .drop(). We’ve seen in the previous recipe that both .select() and .with_columns() are essential for column selection in Polars.

In this recipe, you’ll learn how to leverage those methods to create, modify, and delete columns using Polars’ expressions.

Getting ready

This recipe requires the titanic dataset. Read it into your code by typing the following:

df = pl.read_csv('../data/titanic_dataset.csv')

How to do it...

Let’s dive into the recipe. Here are the steps:

Create a column based on another column:
```
df.with_columns(
    pl.col('Fare').max().alias('Max Fare')
).head()
```
The preceding code will return the following output:

Figure 1.25 – A DataFrame with a new column

We added a new column called max_fare. Its value is the max of the Fare column. We’ll cover aggregations in more detail in a later chapter.

You can name your column without using .alias(). You’ll need to specify the name at the beginning of your expression. Note that you won’t be able to use spaces in the column name with this approach:

df.with_columns(
    max_fare=pl.col('Fare').max()
).head()

The preceding code will return the following output:

Figure 1.26 – A different way to name a new column

If you don’t specify a new column name, then the base column will be overwritten:

df.with_columns(
    pl.col('Fare').max()
).head()

The preceding code will return the following output:

Figure 1.27 – A new column with the same name as the base column

To demonstrate how you can use multiple expressions for a column, let’s add another logic to this column:

df.with_columns(
    (pl.col('Fare').max() - pl.col('Fare').mean()).alias('Max Fare - Avg Fare')
).head()

The preceding code will return the following output:

Figure 1.28 – A new column with more complex expressions

We added a column that calculates the max and mean of the Fare column and does a subtraction. This is just one example of how you can use Polars’ expressions.

Create a column with a literal value using the pl.lit() method:
```
df.with_columns(pl.lit('Titanic')).head()
```
The preceding code will return the following output:

Figure 1.29 – The output with literal values

Add a row count with .with_row_index():
```
df.with_row_index().head()
```
The preceding code will return the following output:

Figure 1.30 – The output with a row number

Modify values in a column:
```
df.with_columns(pl.col('Sex').str.to_titlecase()).head()
```
The preceding code will return the following output:

Figure 1.31 – The output of the modified column

We transformed the Sex column into title case .str is what gives you access to string methods in Polars, which we’ll cover in Chapter 6, Performing String Manipulations.

You can delete a column with the help of the following code:
```
df.drop(['Pclass', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']).head()
```
The preceding code will return the following output:

Figure 1.32 – The output after dropping columns

You can use .select() instead to choose the columns that you want to keep:
```
df.select(['PassengerId', 'Survived', 'Sex', 'Age', 'Fare']).head()
```
The preceding code will return the following output:

Figure 1.33 – DataFrame with selected columns

How it works...

The pl.lit() method can be used whenever you want to specify a literal or constant value. You can use not only a string value but also various data types such as integer, boolean, list, and so on.

When creating or adding a new column, there are three ways you can name it:

Use .alias().
Define the column name at the beginning of your expression, like the one you saw earlier: max_fare=pl.col('Fare').max(). You can’t use spaces in your column name.
Don’t specify the column name, which would replace the existing column if the new column were created based on another column. Alternatively, the column will be named literal when using pl.lit().

Both the.select() and .with_columns() methods can create and modify columns. The difference is in whether you keep the unspecified columns or drop them. Essentially, you can use the .select() method for dropping columns while adding new columns. That way, you may avoid using both the.with_columns() and .drop() methods in combination when .select() alone can do the job.

Also, note that new or modified columns don’t persist when using the .select() or .with_columns() methods. You’ll need to store the result into a variable if needed:

df = df.with_columns(
    pl.col('Fare').max()
)

There’s more...

For best practice, you should put all your expressions into one method where possible instead of using multiple .with_columns(), for example. This makes sure that expressions are executed in parallel, whereas if you use multiple .with_columns(), then Polars’s engine might not recognize that they run in parallel.

You should write your code like this:

best_practice = (
    df.with_columns(
        pl.col('Fare').max().alias('Max Fare'),
        pl.lit('Titanic'),
        pl.col('Sex').str.to_titlecase()
    )
)

Avoid writing your code like this:

not_so_good_practice = (
    df
    .with_columns(pl.col('Fare').max().alias('Max Fare'))
    .with_columns(pl.lit('Titanic'))
    .with_columns(pl.col('Sex').str.to_titlecase())
)

Both of the preceding queries produce the following output:

Figure 1.34 – The output with new columns added

Note

You won’t be able to add a new column on top of another new column you’re trying to define in the same method (such as the .with_columns() method). The only time when you’ll need to use multiple methods is when your new column depends on another new column in your dataset that doesn’t yet exist.

You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Table of Contents (15) Chapters

Creating, modifying, and deleting columns

Getting ready

How to do it...

How it works...

There’s more...

See also

Authors (1)

Personalised recommendations for you