Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Polars Cookbook

You're reading from   Polars Cookbook Over 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.x

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781805121152
Length 394 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Yuki Kakegawa Yuki Kakegawa
Author Profile Icon Yuki Kakegawa
Yuki Kakegawa
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: Getting Started with Python Polars FREE CHAPTER 2. Chapter 2: Reading and Writing Files 3. Chapter 3: An Introduction to Data Analysis in Python Polars 4. Chapter 4: Data Transformation Techniques 5. Chapter 5: Handling Missing Data 6. Chapter 6: Performing String Manipulations 7. Chapter 7: Working with Nested Data Structures 8. Chapter 8: Reshaping and Tidying Data 9. Chapter 9: Time Series Analysis 10. Chapter 10: Interoperability with Other Python Libraries 11. Chapter 11: Working with Common Cloud Data Sources 12. Chapter 12: Testing and Debugging in Polars 13. Index 14. Other Books You May Enjoy

The Polars DataFrame

DataFrame is the base component of Polars. It is worth learning its basics as you begin your journey in Polars. DataFrame is like a table with rows and columns. It’s the fundamental structure that other Polars components are deeply interconnected with.

If you’ve used the pandas library before, you might be surprised to learn that Polars actually doesn’t have a concept of an index. In pandas, an index is a series of labels that identify each row. It helps you select and align rows of your DataFrame. This is also different from the indexes you might see in SQL databases in that an index in pandas is not meant to apply for a faster data retrieval performance.

You might’ve found index in pandas useful, but I bet that they also gave you some headaches. Polars avoids the complexity that comes with index. If you’d like to learn more about the differences in concepts between pandas and Polars, you can look at this page in the Polars documentation: https://pola-rs.github.io/polars/user-guide/migration/pandas.

In this recipe, we’ll cover some ways to create a Polars DataFrame, as well as useful methods to extract DataFrame attributes.

Getting ready

We’ll use a dataset stored in this GitHub repo: https://github.com/PacktPublishing/Polars-Cookbook/blob/main/data/titanic_dataset.csv. Also, make sure that you import the Polars library at the beginning of your code:

Import polars as pl

How to do it...

We’ll start by creating a DataFrame and exploring its attributes.:

  1. Create a DataFrame from scratch with a Python dictionary as the input:
    df = pl.DataFrame({
        'nums': [1,2,3,4,5],
        'letters': ['a','b','c','d','e']
    })
    df.head()

    The preceding code will return the following output:

Figure 1.3 – The output of an example DataFrame

Figure 1.3 – The output of an example DataFrame

  1. Create a DataFrame by reading a .csv file. Then take a peek at the dataset:
    df = pl.read_csv('../data/titanic_dataset.csv')
    df.head()

    The preceding code will return the following output:

Figure 1.4 – The first few rows of the titanic dataset

Figure 1.4 – The first few rows of the titanic dataset

Explore DataFrame attributes. .schemas gives you the combination of each column name and data type in Python dictionary. You can get column names and data types in separate lists with .columns and .dtypes:

df.schema

The preceding code will return the following output:

>> Schema([('PassengerId', Int64), ('Survived', Int64), ('Pclass', Int64), ('Name', String), ('Sex', String), ('Age', Float64), ('SibSp', Int64), ('Parch', Int64), ('Ticket', String), ('Fare', Float64), ('Cabin', String), ('Embarked', String)])
df.columns

The preceding code will return the following output:

>> ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
df.dtypes

The preceding code will return the following output:

>> [Int64, Int64, Int64, String, String, Float64, Int64, Int64, String, Float64, String, String]

You can get the height and width of your DataFrame with .shape. You can also get the height and width individually with .height and .width as well:

df.shape

The preceding code will return the following output:

>> (891, 12)
df.height

The preceding code will return the following output:

>> 891
df.width

The preceding code will return the following output:

>> 12
df.flags

The preceding code will return the following output:

>> {'PassengerId': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Survived': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Pclass': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Name': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Sex': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Age': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'SibSp': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Parch': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Ticket': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Fare': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Cabin': {'SORTED_ASC': False, 'SORTED_DESC': False},
 'Embarked': {'SORTED_ASC': False, 'SORTED_DESC': False}}

How it works...

Within pl.DataFrame(), I have added a Python dictionary as the data source. Its keys are strings, and its values are lists. Data types are auto-inferred unless you specify the schema.

The .head() method is handy in your analysis workflow. It shows the first n rows, where n is the number of rows you specify. The default value of n is set to 5.

pl.read_csv() is one of the common ways to read data into a DataFrame. It involves specifying the path of the file you want to read. It has many parameters that help you load data efficiently, tailored to your use case. We’ll cover the topic of reading and writing files in detail in the next chapter.

There’s more...

The Polars DataFrame can take many forms of data as its source, such as Python dictionaries, the Polars Series, NumPy array, pandas DataFrame, and so on. You can even utilize functions like pl.from_numpy() and pl.from_pandas() to import data directly from other structures instead of using pl.DataFrame().

Also, there are several parameters you can set when creating a DataFrame, including the schema. You can preset the schema of your dataset, or else it will be auto-inferred by Polars’s engine:

import numpy as np
numpy_arr = np.array([[1,1,1], [2,2,2]])
df = pl.from_numpy(numpy_arr, schema={'ones': pl.Float32, 'twos': pl.Int8}, orient='col')
df.head()

The preceding code will return the following output:

Figure 1.5 – A DataFrame created from a NumPy array

Figure 1.5 – A DataFrame created from a NumPy array

Both reading into a DataFrame and outputting to other structures such as pandas DataFrame and pyarrow.Table is possible. We’ll cover that in Chapter 10, Interoperability with Other Python Libraries.

You can basically categorize the data types in Polars into five categories:

  • Numeric
  • String/categorical
  • Date/time
  • Nested
  • Other (Boolean, Binary, and so forth)

We’ll look at working with specific types of data throughout this book, but it’s good to know what data types exist early on in the journey of learning about Polars.

You can see a complete list of data types on this Polars documentation page: https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html.

See also

Please refer to each section of the Polars documentation for additional information:

You have been reading a chapter from
Polars Cookbook
Published in: Aug 2024
Publisher: Packt
ISBN-13: 9781805121152
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image