DataFrame is the base component of Polars. It is worth learning its basics as you begin your journey in Polars. DataFrame is like a table with rows and columns. It’s the fundamental structure that other Polars components are deeply interconnected with.
If you’ve used the pandas library before, you might be surprised to learn that Polars actually doesn’t have a concept of an index. In pandas, an index is a series of labels that identify each row. It helps you select and align rows of your DataFrame. This is also different from the indexes you might see in SQL databases in that an index in pandas is not meant to apply for a faster data retrieval performance.
You might’ve found index in pandas useful, but I bet that they also gave you some headaches. Polars avoids the complexity that comes with index. If you’d like to learn more about the differences in concepts between pandas and Polars, you can look at this page in the Polars documentation: https://pola-rs.github.io/polars/user-guide/migration/pandas.
In this recipe, we’ll cover some ways to create a Polars DataFrame, as well as useful methods to extract DataFrame attributes.
Getting ready
We’ll use a dataset stored in this GitHub repo: https://github.com/PacktPublishing/Polars-Cookbook/blob/main/data/titanic_dataset.csv. Also, make sure that you import the Polars library at the beginning of your code:
Import polars as pl
How to do it...
We’ll start by creating a DataFrame and exploring its attributes.:
- Create a DataFrame from scratch with a Python dictionary as the input:
df = pl.DataFrame({
'nums': [1,2,3,4,5],
'letters': ['a','b','c','d','e']
})
df.head()
The preceding code will return the following output:
Figure 1.3 – The output of an example DataFrame
- Create a DataFrame by reading a
.csv
file. Then take a peek at the dataset:df = pl.read_csv('../data/titanic_dataset.csv')
df.head()
The preceding code will return the following output:
Figure 1.4 – The first few rows of the titanic dataset
Explore DataFrame attributes. .schemas
gives you the combination of each column name and data type in Python dictionary. You can get column names and data types in separate lists with .columns
and .dtypes
:
df.schema
The preceding code will return the following output:
>> Schema([('PassengerId', Int64), ('Survived', Int64), ('Pclass', Int64), ('Name', String), ('Sex', String), ('Age', Float64), ('SibSp', Int64), ('Parch', Int64), ('Ticket', String), ('Fare', Float64), ('Cabin', String), ('Embarked', String)])
df.columns
The preceding code will return the following output:
>> ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
df.dtypes
The preceding code will return the following output:
>> [Int64, Int64, Int64, String, String, Float64, Int64, Int64, String, Float64, String, String]
You can get the height and width of your DataFrame with .shape
. You can also get the height and width individually with .height
and .width
as well:
df.shape
The preceding code will return the following output:
>> (891, 12)
df.height
The preceding code will return the following output:
>> 891
df.width
The preceding code will return the following output:
>> 12
df.flags
The preceding code will return the following output:
>> {'PassengerId': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Survived': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Pclass': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Name': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Sex': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Age': {'SORTED_ASC': False, 'SORTED_DESC': False},
'SibSp': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Parch': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Ticket': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Fare': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Cabin': {'SORTED_ASC': False, 'SORTED_DESC': False},
'Embarked': {'SORTED_ASC': False, 'SORTED_DESC': False}}
How it works...
Within pl.DataFrame()
, I have added a Python dictionary as the data source. Its keys are strings, and its values are lists. Data types are auto-inferred unless you specify the schema.
The .head()
method is handy in your analysis workflow. It shows the first n rows, where n is the number of rows you specify. The default value of n is set to 5
.
pl.read_csv()
is one of the common ways to read data into a DataFrame. It involves specifying the path of the file you want to read. It has many parameters that help you load data efficiently, tailored to your use case. We’ll cover the topic of reading and writing files in detail in the next chapter.
There’s more...
The Polars DataFrame can take many forms of data as its source, such as Python dictionaries, the Polars Series, NumPy array, pandas DataFrame, and so on. You can even utilize functions like pl.from_numpy()
and pl.from_pandas()
to import data directly from other structures instead of using pl.DataFrame()
.
Also, there are several parameters you can set when creating a DataFrame, including the schema. You can preset the schema of your dataset, or else it will be auto-inferred by Polars’s engine:
import numpy as np
numpy_arr = np.array([[1,1,1], [2,2,2]])
df = pl.from_numpy(numpy_arr, schema={'ones': pl.Float32, 'twos': pl.Int8}, orient='col')
df.head()
The preceding code will return the following output:
Figure 1.5 – A DataFrame created from a NumPy array
Both reading into a DataFrame and outputting to other structures such as pandas DataFrame and pyarrow.Table is possible. We’ll cover that in Chapter 10, Interoperability with Other Python Libraries.
You can basically categorize the data types in Polars into five categories:
- Numeric
- String/categorical
- Date/time
- Nested
- Other (Boolean, Binary, and so forth)
We’ll look at working with specific types of data throughout this book, but it’s good to know what data types exist early on in the journey of learning about Polars.
You can see a complete list of data types on this Polars documentation page: https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html.
See also
Please refer to each section of the Polars documentation for additional information: