You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781805121152

Length 394 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Yuki Kakegawa

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Getting Started with Python Polars FREE CHAPTER

2. Chapter 2: Reading and Writing Files

3. Chapter 3: An Introduction to Data Analysis in Python Polars

4. Chapter 4: Data Transformation Techniques

5. Chapter 5: Handling Missing Data

6. Chapter 6: Performing String Manipulations

7. Chapter 7: Working with Nested Data Structures

8. Chapter 8: Reshaping and Tidying Data

9. Chapter 9: Time Series Analysis

10. Chapter 10: Interoperability with Other Python Libraries

11. Chapter 11: Working with Common Cloud Data Sources

12. Chapter 12: Testing and Debugging in Polars

13. Index

Why subscribe?

14. Other Books You May Enjoy

Processing larger-than-RAM datasets

One of the outstanding features of Polars is its streaming mode. It’s part of the lazy API and it allows you to process data that is larger than the memory available on your machine. With streaming mode, you let your machine handle huge data by processing it in batches. You would not be able to process such large data otherwise.

One thing to keep in mind is that not all lazy operations are supported in streaming mode, as it’s still in development. You can still use any lazy operation in your query, but ultimately, the Polars engine will determine whether the operation can be executed in streaming or not. If the answer is no, then Polars runs the query using non-streaming mode. We can expect that this feature will include more lazy operations and become more sophisticated over time.

In this recipe, we’ll demonstrate how streaming mode works by creating a simple query to read a .csv file that’s larger than the available RAM on a machine and process it using streaming mode.

Getting ready

You’d need a dataset that’s larger than the available RAM on your machine to test streaming mode. I’m using a taxi trips dataset, which has over 80 GB on disk. You can download the dataset from this website: https://data.cityofchicago.org/Transportation/Taxi-Trips-2013-2023-/wrvz-psew/about_data.

How to do it...

Here are the steps for the recipe.

Import the Polars library:
```
import polars as pl
```
Read the csv file in streaming mode by adding a streaming=True parameter inside .collect(). The file name string should specify where your file is located (mine is in my Downloads folder):
```
taxi_trips = (
    pl.scan_csv('~/Downloads/Taxi_Trips.csv')
    .collect(streaming=True)
)
```
Check the first five rows with .head() to see what the data looks like:
```
taxi_trips.head()
```
The preceding code will return the following output:

Figure 1.37 – The first five rows of the taxi trip dataset

How it works...

There are two things you should be aware of in the example code:

It uses .scan_read() instead of .read_csv()
A parameter is specified in .collect(). It becomes .collect(streaming=True).

We will enable streaming mode by setting streaming=True inside the .collect() method. In this specific example, I’m only reading a .csv file, nothing complex. I’m using the .scan_read() method to read with lazy mode.

In theory, without streaming mode, I wouldn’t be able to process this dataset. This is because my laptop has 64 GB of RAM (yes, my laptop has a decent amount of memory!), which is lower than the size of the dataset on disk, which is more than 80 GB.

It took about two minutes for my laptop to process the data in streaming mode. Without streaming mode, I would get an out-of-memory error. You can confirm this by running your code without streaming=True in the .collect() method.

There’s more...

If you’re doing other operations other than reading the data, such as aggregations and filtering, then Polars (with LazyFrame) might be able to optimize your query so that it doesn’t need to read the whole dataset in memory. This means that you might not even need to utilize streaming mode to work with data larger than your RAM. Aggregations and filtering essentially summarize the data or reduce the number of rows, which leads to not needing to read in the whole dataset.

Let’s say that you apply a simple group by and aggregation over a column like the one in the following code. You’ll see that you can run it without using streaming mode (depending on your chosen dataset and the available RAM on your machine):

trip_total_by_pay_type = (
    pl.scan_csv('~/Downloads/Taxi_Trips.csv')
    .group_by('Payment Type')
    .agg(pl.col('Trip Total').sum())
    .collect()
)
trip_total_by_pay_type.head()

The preceding code will return the following output:

Figure 1.38 – Trip total by payment type

With that said, it may still be a good idea to use streaming=True when there is a possibility that the size of the dataset goes over your available RAM or that data may grow in size over time.