Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Pandas 1.x Cookbook

You're reading from   Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Arrow left icon
Product type Paperback
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Length 626 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Theodore Petrou Theodore Petrou
Author Profile Icon Theodore Petrou
Theodore Petrou
Matthew Harrison Matthew Harrison
Author Profile Icon Matthew Harrison
Matthew Harrison
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Pandas Foundations 2. Essential DataFrame Operations FREE CHAPTER 3. Creating and Persisting DataFrames 4. Beginning Data Analysis 5. Exploratory Data Analysis 6. Selecting Subsets of Data 7. Filtering Rows 8. Index Alignment 9. Grouping for Aggregation, Filtration, and Transformation 10. Restructuring Data into a Tidy Form 11. Combining Pandas Objects 12. Time Series Analysis 13. Visualization with Matplotlib, Pandas, and Seaborn 14. Debugging and Testing Pandas 15. Other Books You May Enjoy
16. Index

Selecting columns with methods

Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes and .filter methods are two useful methods to do this.

If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.

How to do it...

  1. Read in the movie dataset. Shorten the column names for display. Use the .get_dtype_counts method to output the number of columns with each specific data type:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return (
    ...         str(col)
    ...         .replace("facebook_likes", "fb")
    ...         .replace("_for_reviews", "")
    ...     )
    >>> movies = movies.rename(columns=shorten)
    >>> movies.dtypes.value_counts()
    float64    13
    int64       3
    object     12
    dtype: int64
    
  2. Use the .select_dtypes method to select only the integer columns:
    >>> movies.select_dtypes(include="int").head()
       num_voted_users  cast_total_fb  movie_fb
    0           886204           4834     33000
    1           471220          48350         0
    2           275868          11700     85000
    3          1144337         106759    164000
    4                8            143         0
    
  3. If you would like to select all the numeric columns, you may pass the string number to the include parameter:
    >>> movies.select_dtypes(include="number").head()
       num_critics  duration  ...  aspect_ratio  movie_fb
    0        723.0     178.0  ...         1.78      33000
    1        302.0     169.0  ...         2.35          0
    2        602.0     148.0  ...         2.35      85000
    3        813.0     164.0  ...         2.35     164000
    4          NaN       NaN  ...          NaN          0
    
  4. If we wanted integer and string columns we could do the following:
    >>> movies.select_dtypes(include=["int", "object"]).head()
       color        direc/_name  ... conte/ating movie_fb
    0  Color      James Cameron  ...       PG-13    33000
    1  Color     Gore Verbinski  ...       PG-13        0
    2  Color         Sam Mendes  ...       PG-13    85000
    3  Color  Christopher Nolan  ...       PG-13   164000
    4    NaN        Doug Walker  ...         NaN        0
    
  5. To exclude only floating-point columns, do the following:
    >>> movies.select_dtypes(exclude="float").head()
       color director_name  ... content_rating movie_fb
    0  Color  James Ca...   ...        PG-13      33000
    1  Color  Gore Ver...   ...        PG-13          0
    2  Color   Sam Mendes   ...        PG-13      85000
    3  Color  Christop...   ...        PG-13     164000
    4    NaN  Doug Walker   ...          NaN          0
    
  6. An alternative method to select columns is with the .filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all the Facebook columns or the names that contain the exact string, fb. The like parameter is checking for substrings in column names:
    >>> movies.filter(like="fb").head()
       director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
    0          0.0       855.0  ...       936.0     33000
    1        563.0      1000.0  ...      5000.0         0
    2          0.0       161.0  ...       393.0     85000
    3      22000.0     23000.0  ...     23000.0    164000
    4        131.0         NaN  ...        12.0         0
    
  7. The .filter method has more tricks (or parameters) up its sleeve. If you use the items parameters, you can pass in a list of column names:
    >>> cols = [
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ...     "director_name",
    ... ]
    >>> movies.filter(items=cols).head()
          actor_1_name  ...      director_name
    0      CCH Pounder  ...      James Cameron
    1      Johnny Depp  ...     Gore Verbinski
    2  Christoph Waltz  ...         Sam Mendes
    3        Tom Hardy  ...  Christopher Nolan
    4      Doug Walker  ...        Doug Walker
    
  8. The .filter method allows columns to be searched with regular expressions using the regex parameter. Here, we search for all columns that have a digit somewhere in their name:
    >>> movies.filter(regex=r"\d").head()
       actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
    0       855.0  Joel Dav...  ...    Wes Studi       936.0
    1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
    2       161.0  Rory Kin...  ...  Stephani...       393.0
    3     23000.0  Christia...  ...  Joseph G...     23000.0
    4         NaN   Rob Walker  ...          NaN        12.0
    

How it works...

Step 1 lists the frequencies of all the different data types. Alternatively, you may use the .dtypes attribute to get the exact data type for each column. The .select_dtypes method accepts either a list or single data type in its include or exclude parameters and returns a DataFrame with columns of just those given data types (or not those types if excluding columns). The list values may be either the string name of the data type or the actual Python object.

The .filter method selects columns by only inspecting the column names and not the actual data values. It has three mutually exclusive parameters: items, like, and regex, only one of which can be used at a time.

The like parameter takes a string and attempts to find all the column names that contain that exact string somewhere in the name. To gain more flexibility, you may use the regex parameter instead to select column names through a regular expression. This particular regular expression, r'\d', represents all digits from zero to nine and matches any string with at least a single digit in it.

The filter method comes with another parameter, items, which takes a list of exact column names. This is nearly an exact duplication of the index operation, except that a KeyError will not be raised if one of the strings does not match a column name. For instance, movies.filter(items=['actor_1_name', 'asdf']) runs without error and returns a single column DataFrame.

There's more...

One confusing aspect of .select_dtypes is its flexibility to take both strings and Python objects. The following list should clarify all the possible ways to select the many different column data types. There is no standard or preferred method of referring to data types in pandas, so it's good to be aware of both ways:

  • np.number, 'number' – Selects both integers and floats regardless of size
  • np.float64, np.float_, float, 'float64', 'float_', 'float' – Selects only 64-bit floats
  • np.float16, np.float32, np.float128, 'float16', 'float32', 'float128' – Respectively selects exactly 16, 32, and 128-bit floats
  • np.floating, 'floating' – Selects all floats regardless of size
  • np.int0, np.int64, np.int_, int, 'int0', 'int64', 'int_', 'int' – Selects only 64-bit integers
  • np.int8, np.int16, np.int32, 'int8', 'int16', 'int32' – Respectively selects exactly 8, 16, and 32-bit integers
  • np.integer, 'integer' – Selects all integers regardless of size
  • 'Int64' – Selects nullable integer; no NumPy equivalent
  • np.object, 'object', 'O' – Select all object data types
  • np.datetime64, 'datetime64', 'datetime' – All datetimes are 64 bits
  • np.timedelta64, 'timedelta64', 'timedelta' – All timedeltas are 64 bits
  • pd.Categorical, 'category' – Unique to pandas; no NumPy equivalent

Because all integers and floats default to 64 bits, you may select them by using the string 'int' or 'float' as you can see from the preceding bullet list. If you want to select all integers and floats regardless of their specific size, use the string 'number'.

You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020
Publisher: Packt
ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime