You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781805121152

Length 394 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Yuki Kakegawa

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Getting Started with Python Polars FREE CHAPTER

2. Chapter 2: Reading and Writing Files

3. Chapter 3: An Introduction to Data Analysis in Python Polars

4. Chapter 4: Data Transformation Techniques

5. Chapter 5: Handling Missing Data

6. Chapter 6: Performing String Manipulations

7. Chapter 7: Working with Nested Data Structures

8. Chapter 8: Reshaping and Tidying Data

9. Chapter 9: Time Series Analysis

10. Chapter 10: Interoperability with Other Python Libraries

11. Chapter 11: Working with Common Cloud Data Sources

12. Chapter 12: Testing and Debugging in Polars

13. Index

Why subscribe?

14. Other Books You May Enjoy

The Polars LazyFrame

One of the unique features that makes Polars even faster and more efficient is its lazy API. The lazy API uses lazy evaluation, a technique that delays the evaluation of an expression until its value is needed. That means your query is only executed when it’s needed. This allows Polars to apply query optimizations because Polars can look at and execute multiple transformation steps at once by looking at the computation graph as a whole only when you tell it to do so. On the other hand, with eager evaluation (another evaluation strategy you’d use with DataFrame), you process data every time per expression. Essentially, lazy evaluation gives you more efficient ways to process your data.

You can access the Polars lazy API by using what we call LazyFrame. As explained earlier, LazyFrame allows for automatic query optimizations and larger-than-RAM processing.

LazyFrame is the proffered way of using Polars simply because it has more features and abilities to handle your data better. In this recipe, you’ll learn how to create a LazyFrame as well as how to use useful methods and functions associated with LazyFrame.

How to do it...

We’ll explore a LazyFrame by creating it first. Here are the steps:

Create a LazyFrame from scratch:

data = {'name': ['Sarah', 'Mike', 'Bob', 'Ashley']}
lf = pl.LazyFrame(data)
type(lf)

The preceding code will return the following output:

>> polars.lazyframe.frame.LazyFrame

Use the .collect() method to instruct Polars to process data:
```
lf.collect().head()
```
The preceding code will return the following output:

Figure 1.9 – LazyFrame output

Create a LazyFrame from a .csv file using the .scan_csv() method:
```
lf = pl.scan_csv('../data/titanic_dataset.csv')
lf.head().collect()
```
The preceding code will return the following output:

Figure 1.10 – The output of using .scan_csv()

Convert a LazyFrame from a DataFrame with the .lazy() method:
```
df = pl.read_csv('../data/titanic_dataset.csv')
df.lazy().head(3).collect()
```
The preceding code will return the following output:

Figure 1.11 – Convert a DataFrame into a LazyFrame

Show the schema and width of LazyFrame:

lf.collect_schema()

The preceding code will return the following output:

>> Schema([('PassengerId', Int64), ('Survived', Int64), ('Pclass', Int64), ('Name', String), ('Sex', String), ('Age', Float64), ('SibSp', Int64), ('Parch', Int64), ('Ticket', String), ('Fare', Float64), ('Cabin', String), ('Embarked', String)])

lf.collect_schema().len()

The preceding code will return the following output:

>> 12

How it works...

The structure of LazyFrame is the same as that of DataFrame, but LazyFrame doesn’t process your query until it’s told to do so using .collect(). You can use this to trigger the execution of the computation graph or query of a LazyFrame. This operation materializes a LazyFrame into a DataFrame.

Note

You should keep in mind that some operations that are available in DataFrame are not available in LazyFrame (such as .pivot()). These operations require Polars to know the whole structure of the data, which LazyFrame is not capable of handling. However, once you use .collect() to materialize a DataFrame, you’ll be able to use all the available DataFrame methods on it.

The way in which you create a LazyFrame is similar to the method for creating a DataFrame. After you have created a LazyFrame, and once it’s been materialized with .collect(), LazyFrame is converted to DataFrame. That’s why you can call .head() on it after calling .collect().

Note

You may be aware of the .fetch() method that was available until Polars version 0.20.31. While it was useful for debugging purposes, there were some gotchas that were confusing to users. Since Polars version 1.0.0, this method is deprecated. It’s still available as ._fetch() for development purposes.

You will notice that when you read a .csv file or any other file in LazyFrame, you use scan instead of read. This allows you to read files in lazy mode, whereby your column selections and filtering get pushed down to the scan level. You essentially read only the data necessary for the operations you’re performing in your code. You can see that that’s much more efficient than reading the whole dataset first and then filtering it down. Again, reading and writing files will be covered in the next chapter.

LazyFrame has similar attributes to DataFrame. However, you’ll need to access those via the .collect_schema() method. Note that the same method is also available in DataFrame.

Note

Since Polars version 1.0.0, you’ll get a performance warning when using LazyFrame attributes such as .schema, .width, .dtypes, and .columns. The .collect_schema() method replaces those methods. With recent improvements and changes made to the lazy engine, resolving the schema is no longer free and it can be relatively expensive. To solve this, the .collect_schema() method was added.

The good news is that it’s easy to go back and forth between LazyFrame and DataFrame with .lazy() and .collect(). This allows you to use LazyFrame where possible and convert to DataFrame if certain operations are not available in the lazy API or if you don’t need features such as automatic query optimization and larger-than-RAM processing for your use case.

There’s more...

One unique feature of LazyFrame is the ability to inspect the query plan of your code. You can use either the .show_graph() or the .explain() method. The .show_graph() method visualizes the query plan, whereas the .explain() method simply prints it out using .show_graph():

(
    lf
    .select(pl.col('Name', 'Age'))
    .show_graph()
)

The preceding code will return the following output:

Figure 1.12 – A query execution plan

π (pi) indicates the column selection and σ (sigma) indicates the filtering conditions.

Note

I haven’t introduced the .filter() method yet, but just know that it’s used to filter data (it’s obvious, isn’t it?). We’ll cover it in a later recipe in this chapter: Selecting columns and filtering data.

By default, .show_graph() gives you the optimized query plan. You can customize its parameters to choose which optimization to apply. You can find more information on that here: https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.show_graph.html.

For now, here’s how to display the non-optimized version:

(
    lf
    .select(pl.col('Name', 'Age'))
    .show_graph(optimized=False)
)

The preceding code will return the following output:

Figure 1.13 – An optimized query execution plan

If you look carefully at both the optimized and the non-optimized version, you’ll notice that the former indicates two columns (π 2/12) whereas the latter indicates all columns (π */12).

Let’s try calling the .explain() method:

(
    lf
    .select(pl.col('Name', 'Age'))
    .explain()
)

The preceding code will return the following output:

Figure 1.14 – A query execution plan in text

You can tweak parameters with the .explain() method as well. You can find more information here: https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.explain.html.

The output of the .explain() method can be hard to read. To make it more readable, let’s try using Python’s built-in print() function with the separator specified:

print(
    lf
    .select(pl.col('Name', 'Age'))
    .explain()
    , sep='\n'
)

The preceding code will return the following output:

Figure 1.15 – A formatted query execution plan in text

We will dive more into inspecting and optimizing the query plan in Chapter 12, Testing and Debugging in Polars

You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Table of Contents (15) Chapters

The Polars LazyFrame

How to do it...

How it works...

There’s more...

See also

Authors (1)

Personalised recommendations for you