A Series is a single column of data from a DataFrame. It is a single dimension of data, composed of just an index and the data.
Selecting a single column of data as a Series
Getting ready
This recipe examines two different syntaxes to select a Series, one with the indexing operator and the other using dot notation.
How to do it...
- Pass a column name as a string to the indexing operator to select a Series of data:
>>> movie = pd.read_csv('data/movie.csv')
>>> movie['director_name']
- Alternatively, you may use the dot notation to accomplish the same task:
>>> movie.director_name
- Inspect the Series anatomy:
- Verify that the output is a Series:
>>> type(movie['director_name'])
pandas.core.series.Series
How it works...
Python has several built-in objects for containing data, such as lists, tuples, and dictionaries. All three of these objects use the indexing operator to select their data. DataFrames are more powerful and complex containers of data, but they too use the indexing operator as the primary means to select data. Passing a single string to the DataFrame indexing operator returns a Series.
The visual output of the Series is less stylized than the DataFrame. It represents a single column of data. Along with the index and values, the output displays the name, length, and data type of the Series.
Alternatively, while not recommended and subject to error, a column of data may be accessed using the dot notation with the column name as an attribute. Although it works with this particular example, it is not best practice and is prone to error and misuse. Column names with spaces or special characters cannot be accessed in this manner. This operation would have failed if the column name was director name. Column names that collide with DataFrame methods, such as count, also fail to be selected correctly using the dot notation. Assigning new values or deleting columns with the dot notation might give unexpected results. Because of this, using the dot notation to access columns should be avoided with production code.
There's more...
Why would anyone ever use the dot notation syntax if it causes trouble? Programmers are lazy, and there are fewer characters to type. But mainly, it is extremely handy when you want to have the autocomplete intelligence available. For this reason, column selection by dot notation will sometimes be used in this book. The autocomplete intelligence is fantastic for helping you become aware of all the possible attributes and methods available to an object.
The intelligence will fail to work when attempting to chain an operation after use of the indexing operator from step 1 but will continue to work with the dot notation from step 2. The following screenshot shows the pop-up window that appears after the selection of the director_name with the dot notation. All the possible attributes and methods will appear in a list after pressing Tab following the dot:
Yet another reason to be aware of the dot notation is the proliferation of its use online at the popular question and answer site Stack Overflow. Also, notice that the old column name is now the name of the Series and has actually become an attribute:
>>> director = movie['director_name']
>>> director.name
'director_name'
It is possible to turn this Series into a one-column DataFrame with the to_frame method. This method will use the Series name as the new column name:
>>> director.to_frame()
See also
- To understand how Python objects gain the capability to use the indexing operator, see the Python documentation on the __getitem__ special method (http://bit.ly/2u5ISN6)
- Refer to the Selecting multiple DataFrame columns recipe from Chapter 2, Essential DataFrame operations