Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Polars Cookbook

You're reading from   Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781805121152
Length 394 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Yuki Kakegawa Yuki Kakegawa
Author Profile Icon Yuki Kakegawa
Yuki Kakegawa
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: Getting Started with Python Polars FREE CHAPTER 2. Chapter 2: Reading and Writing Files 3. Chapter 3: An Introduction to Data Analysis in Python Polars 4. Chapter 4: Data Transformation Techniques 5. Chapter 5: Handling Missing Data 6. Chapter 6: Performing String Manipulations 7. Chapter 7: Working with Nested Data Structures 8. Chapter 8: Reshaping and Tidying Data 9. Chapter 9: Time Series Analysis 10. Chapter 10: Interoperability with Other Python Libraries 11. Chapter 11: Working with Common Cloud Data Sources 12. Chapter 12: Testing and Debugging in Polars 13. Index 14. Other Books You May Enjoy

Creating, modifying, and deleting columns

The key methods we’ll cover in this recipe are .select(), .with_columns(), and .drop(). We’ve seen in the previous recipe that both .select() and .with_columns() are essential for column selection in Polars.

In this recipe, you’ll learn how to leverage those methods to create, modify, and delete columns using Polars’ expressions.

Getting ready

This recipe requires the titanic dataset. Read it into your code by typing the following:

df = pl.read_csv('../data/titanic_dataset.csv')

How to do it...

Let’s dive into the recipe. Here are the steps:

  1. Create a column based on another column:
    df.with_columns(
        pl.col('Fare').max().alias('Max Fare')
    ).head()

    The preceding code will return the following output:

Figure 1.25 – A DataFrame with a new column

Figure 1.25 – A DataFrame with a new column

We added a new column called max_fare. Its value is the max of the Fare column. We’ll cover aggregations in more detail in a later chapter.

You can name your column without using .alias(). You’ll need to specify the name at the beginning of your expression. Note that you won’t be able to use spaces in the column name with this approach:

df.with_columns(
    max_fare=pl.col('Fare').max()
).head()

The preceding code will return the following output:

Figure 1.26 – A different way to name a new column

Figure 1.26 – A different way to name a new column

If you don’t specify a new column name, then the base column will be overwritten:

df.with_columns(
    pl.col('Fare').max()
).head()

The preceding code will return the following output:

Figure 1.27 – A new column with the same name as the base column

Figure 1.27 – A new column with the same name as the base column

To demonstrate how you can use multiple expressions for a column, let’s add another logic to this column:

df.with_columns(
    (pl.col('Fare').max() - pl.col('Fare').mean()).alias('Max Fare - Avg Fare')
).head()

The preceding code will return the following output:

Figure 1.28 – A new column with more complex expressions

Figure 1.28 – A new column with more complex expressions

We added a column that calculates the max and mean of the Fare column and does a subtraction. This is just one example of how you can use Polars’ expressions.

  1. Create a column with a literal value using the pl.lit() method:
    df.with_columns(pl.lit('Titanic')).head()

    The preceding code will return the following output:

Figure 1.29 – The output with literal values

Figure 1.29 – The output with literal values

  1. Add a row count with .with_row_index():
    df.with_row_index().head()

    The preceding code will return the following output:

Figure 1.30 – The output with a row number

Figure 1.30 – The output with a row number

  1. Modify values in a column:
    df.with_columns(pl.col('Sex').str.to_titlecase()).head()

    The preceding code will return the following output:

Figure 1.31 – The output of the modified column

Figure 1.31 – The output of the modified column

We transformed the Sex column into title case .str is what gives you access to string methods in Polars, which we’ll cover in Chapter 6, Performing String Manipulations.

  1. You can delete a column with the help of the following code:
    df.drop(['Pclass', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']).head()

    The preceding code will return the following output:

Figure 1.32 – The output after dropping columns

Figure 1.32 – The output after dropping columns

  1. You can use .select() instead to choose the columns that you want to keep:
    df.select(['PassengerId', 'Survived', 'Sex', 'Age', 'Fare']).head()

    The preceding code will return the following output:

Figure 1.33 – DataFrame with selected columns

Figure 1.33 – DataFrame with selected columns

How it works...

The pl.lit() method can be used whenever you want to specify a literal or constant value. You can use not only a string value but also various data types such as integer, boolean, list, and so on.

When creating or adding a new column, there are three ways you can name it:

  • Use .alias().
  • Define the column name at the beginning of your expression, like the one you saw earlier: max_fare=pl.col('Fare').max(). You can’t use spaces in your column name.
  • Don’t specify the column name, which would replace the existing column if the new column were created based on another column. Alternatively, the column will be named literal when using pl.lit().

Both the.select() and .with_columns() methods can create and modify columns. The difference is in whether you keep the unspecified columns or drop them. Essentially, you can use the .select() method for dropping columns while adding new columns. That way, you may avoid using both the.with_columns() and .drop() methods in combination when .select() alone can do the job.

Also, note that new or modified columns don’t persist when using the .select() or .with_columns() methods. You’ll need to store the result into a variable if needed:

df = df.with_columns(
    pl.col('Fare').max()
)

There’s more...

For best practice, you should put all your expressions into one method where possible instead of using multiple .with_columns(), for example. This makes sure that expressions are executed in parallel, whereas if you use multiple .with_columns(), then Polars’s engine might not recognize that they run in parallel.

You should write your code like this:

best_practice = (
    df.with_columns(
        pl.col('Fare').max().alias('Max Fare'),
        pl.lit('Titanic'),
        pl.col('Sex').str.to_titlecase()
    )
)

Avoid writing your code like this:

not_so_good_practice = (
    df
    .with_columns(pl.col('Fare').max().alias('Max Fare'))
    .with_columns(pl.lit('Titanic'))
    .with_columns(pl.col('Sex').str.to_titlecase())
)

Both of the preceding queries produce the following output:

Figure 1.34 – The output with new columns added

Figure 1.34 – The output with new columns added

Note

You won’t be able to add a new column on top of another new column you’re trying to define in the same method (such as the .with_columns() method). The only time when you’ll need to use multiple methods is when your new column depends on another new column in your dataset that doesn’t yet exist.

See also

Please refer to these resources for more information:

You have been reading a chapter from
Polars Cookbook
Published in: Aug 2024
Publisher: Packt
ISBN-13: 9781805121152
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image