DataFrame
While pd.Series
is the building block, pd.DataFrame
is the main object that comes to mind for users of pandas. pd.DataFrame
is the primary and most commonly used object in pandas, and when people think of pandas, they typically envision working with a pd.DataFrame
.
In most analysis workflows, you will be importing your data from another source, but for now, we will show you how to construct a pd.DataFrame
directly (input/output will be covered in Chapter 4, The pandas I/O System).
How to do it
The most basic construction of a pd.DataFrame
happens with a two-dimensional sequence, like a list of lists:
pd.DataFrame([
[0, 1, 2],
[3, 4, 5],
[6, 7, 8],
])
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
With a list of lists, pandas will automatically number the row and column labels for you. Typically, users of pandas will at least provide labels for columns, as it makes indexing and selecting from a pd.DataFrame
much more intuitive (see Chapter 2, Selection and Assignment, for an introduction to indexing and selecting). To label your columns when constructing a pd.DataFrame
from a list of lists, you can provide a columns=
argument to the constructor:
pd.DataFrame([
[1, 2],
[4, 8],
], columns=["col_a", "col_b"])
col_a col_b
0 1 2
1 4 8
Instead of using a list of lists, you could also provide a dictionary. The keys of the dictionary will be used as column labels, and the values of the dictionary will represent the values placed in that column of the pd.DataFrame
:
pd.DataFrame({
"first_name": ["Jane", "John"],
"last_name": ["Doe", "Smith"],
})
first_name last_name
0 Jane Doe
1 John Smith
In the above example, our dictionary values were lists of strings, but the pd.DataFrame
does not strictly require lists. Any sequence will work, including a pd.Series
:
ser1 = pd.Series(range(3), dtype="int8", name="int8_col")
ser2 = pd.Series(range(3), dtype="int16", name="int16_col")
pd.DataFrame({ser1.name: ser1, ser2.name: ser2})
int8_col int16_col
0 0 0
1 1 1
2 2 2