You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781805121152

Length 394 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Yuki Kakegawa

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Getting Started with Python Polars FREE CHAPTER

2. Chapter 2: Reading and Writing Files

3. Chapter 3: An Introduction to Data Analysis in Python Polars

4. Chapter 4: Data Transformation Techniques

5. Chapter 5: Handling Missing Data

6. Chapter 6: Performing String Manipulations

7. Chapter 7: Working with Nested Data Structures

8. Chapter 8: Reshaping and Tidying Data

9. Chapter 9: Time Series Analysis

10. Chapter 10: Interoperability with Other Python Libraries

11. Chapter 11: Working with Common Cloud Data Sources

12. Chapter 12: Testing and Debugging in Polars

13. Index

Why subscribe?

14. Other Books You May Enjoy

Introducing key features in Polars

Polars is a blazingly fast DataFrame library that allows you to manipulate and transform your structured data. It is designed to work on a single machine utilizing all the available CPUs.

There are many other DataFrame libraries in Python including pandas and PySpark. Polars is one of the newest DataFrame libraries. It is performant and it has been gaining popularity at lightning speed.

A DataFrame is a two-dimensional structure that contains one or more Series. A Series is a one-dimensional structure, array, or list. You can think of a DataFrame as a table and a Series as a column. However, Polars is so much more. There are concepts and features that make Polars a fast and high-performant DataFrame library. It’s good to have at least some level of understanding of these key features to maximize your learning and effective use of Polars.

At a high level, these are the key features that make Polars unique:

Speed and efficiency
Expressions
The lazy API

Speed and efficiency

We know that Polars is fast and efficient. But what has contributed to making Polars the way it is today? There are a few main components that contribute to its speed and efficiency:

The Rust programming language
The Apache Arrow columnar format
The lazy API

Polars is written in Rust, a low-level programming language that gives a similar level of performance and full control over memory as C/C++. Because of the support for concurrency in Rust, Polars can execute many operations in parallel, utilizing all the CPUs available on your machine without any configuration. We call that embarrassingly parallel execution.

Also, Polars is based on Apache Arrow’s columnar memory format. That means that Polars can not only utilize the optimization of columnar memory but also share data between other Arrow-based tools for free without copying the data every time (using pointers to the original data, eliminating the need to copy data around).

Finally, the lazy API makes Polars even faster and more efficient by implementing several other query optimizations. We’ll cover that in a second under The lazy API.

These core components have essentially made it possible to implement the features that make Polars so fast and efficient.

Expressions

Expressions are what makes Polars’s syntax readable and easy to use. Its expressive syntax allows you to write complex logic in an organized, efficient fashion. Simply put, an expression takes a Series as an input and gives back a Series as an output (think of a Series like a column in a table or DataFrame). You can combine multiple expressions to build complex queries. This chain of expressions is the essence that makes your query even more powerful.

An expression takes a Series and gives back a Series as shown in the following diagram:

Figure 1.1 – The Polars expressions mechanism

Multiple expressions work on a Series one after another as shown in the following diagram:

Figure 1.2 – Chained Polars expressions

As it relates to expressions, context is an important concept. A context is essentially the environment in which an expression is evaluated. In other words, expressions can be used when you expose them within a context. Of the contexts you have access to in Polars, these are the three main ones:

Selection
Filtering
Group by/aggregation

We’ll look at specific examples and use cases of how you can utilize expressions in these contexts throughout the book. You’ll unlock the power of Polars as you learn to understand and use expressions extensively in your code.

Expressions are part of the clean and simple Polars API. This provides you with better ergonomics and usability for building your data transformation logic in Polars.

The lazy API

The lazy API makes Polars even faster and more efficient by applying additional optimizations such as predicate pushdown and projection pushdown. It also optimizes the query plan automatically, meaning that Polars figures out the most optimal way of executing your query. You can access the lazy API by using LazyFrame, which is a different variation of DataFrame.

The lazy API uses lazy evaluation, which is a strategy that involves delaying the evaluation of an expression until the resulting value is needed. With the lazy API, Polars processes your query end-to-end instead of processing it one operation at a time. You can see the full list of optimizations available with the lazy API in the Polars user guide here: https://pola-rs.github.io/polars/user-guide/lazy/optimizations/.

One other feature that’s available in the lazy API is streaming processing or the streaming API. It allows you to process data that’s larger than the amount of memory available on your machine. For example, if you have 16 GB of RAM on your laptop, you may be able to process 50 GB of data.

However, it’s good to keep in mind that there is a limitation. Although this larger-than-RAM processing feature is available on many of the operations, not all operations are available (as of the time of authoring the book).

Note

Eager evaluation is another evaluation strategy in which an expression is evaluated as soon as it is called. The Polars DataFrame and other DataFrame libraries like pandas use it by default.

You're reading from Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Table of Contents (15) Chapters

Introducing key features in Polars

Speed and efficiency

Expressions

The lazy API

See also

Authors (1)

Personalised recommendations for you