Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Time Series Analysis with Python Cookbook
Time Series Analysis with Python Cookbook

Time Series Analysis with Python Cookbook: Practical recipes for exploratory data analysis, data preparation, forecasting, and model evaluation

eBook
€21.99 €31.99
Paperback
€38.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Time Series Analysis with Python Cookbook

Reading Time Series Data from Files

In this chapter, we will use pandas, a popular Python library with a rich set of I/O tools, data wrangling, and date/time functionality to streamline working with time series data. In addition, you will explore several reader functions available in pandas to ingest data from different file types, such as Comma-Separated Value (CSV), Excel, and SAS. You will explore reading from files, whether they are stored locally on your drive or remotely on the cloud, such as an AWS S3 bucket.

Time series data is complex and can be in different shapes and formats. Conveniently, the pandas reader functions offer a vast number of arguments (parameters) to help handle such variety in the data.

The pandas library provides two fundamental data structures, Series and DataFrame, implemented as classes. The DataFrame class is a distinct data structure for working with tabular data (think rows and columns in a spreadsheet). The main difference between the two data structures is that a Series is one-dimensional (single column), and a DataFrame is two-dimensional (multiple columns). The relationship between the two is that you get a Series when you slice out a column from a DataFrame. You can think of a DataFrame as a side-by-side concatenation of two or more Series objects.

A particular feature of the Series and DataFrames data structures is that they both have a labeled axis called index. A specific type of index that you will often see with time series data is the DatetimeIndex which you will explore further in this chapter. Generally, the index makes slicing and dicing operations very intuitive. For example, to make a DataFrame ready for time series analysis, you will learn how to create DataFrames with an index of type DatetimeIndex.

We will cover the following recipes on how to ingest data into a pandas DataFrame:

  • Reading data from CSVs and other delimited files
  • Reading data from an Excel file
  • Reading data from URLs
  • Reading data from a SAS dataset

    Why DatetimeIndex?

    A pandas DataFrame with an index of type DatetimeIndex unlocks a large set of features and useful functions needed when working with time series data. You can think of it as adding a layer of intelligence or awareness to pandas to treat the DataFrame as a time series DataFrame.

Technical requirements

In this chapter and forward, we will be using pandas 1.4.2 (released April 2, 2022) extensively.

Throughout our journey, you will be installing additional Python libraries to use in conjunction with pandas. You can download the Jupyter notebooks from the GitHub repository (https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/code/Ch2/Chapter%202.ipynb) to follow along.

You can download the datasets used in this chapter from the GitHub repository using this link: https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./tree/main/datasets/Ch2.

Reading data from CSVs and other delimited files

In this recipe, you will use the pandas.read_csv() function, which offers a large set of parameters that you will explore to ensure the data is properly read into a time series DataFrame. In addition, you will learn how to specify an index column, parse the index to be of the type DatetimeIndex, and parse string columns that contain dates into datetime objects.

Generally, using Python, data read from a CSV file will be in string format (text). When using the read_csv method in pandas, it will try and infer the appropriate data types (dtype), and, in most cases, it does a great job at that. However, there are situations where you will need to explicitly indicate which columns to cast to a specific data type. For example, you will specify which column(s) to parse as dates using the parse_dates parameter in this recipe.

Getting ready

You will be reading a CSV file that contains hypothetical box office numbers for a movie. The file is provided in the GitHub repository for this book. The data file is in datasets/Ch2/movieboxoffice.csv.

How to do it…

You will ingest our CSV file using pandas and leverage some of the available parameters in read_csv:

  1. First, let's load the libraries:
    import pandas as pd
    from pathlib import Path
  2. Create a Path object for the file location:
    filepath =\
     Path('../../datasets/Ch2/movieboxoffice.csv')
  3. Read the CSV file into a DataFrame using the read_csv function and passing the filepath with additional parameters.

The first column in the CSV file contains movie release dates, and it needs to be set as an index of type DatetimeIndex (index_col=0 and parse_dates=['Date']). Specify which columns you want to include by providing a list of column names to usecols. The default behavior is that the first row includes the header (header=0):

ts = pd.read_csv(filepath,
                 header=0,
                 parse_dates=['Date'],
                 index_col=0,
                 infer_datetime_format=True,
                 usecols=['Date',
                          'DOW',
                          'Daily',
                          'Forecast',
                          'Percent Diff'])
ts.head(5)

This will output the following first five rows:

Figure 2.1 – The first five rows of the ts DataFrame using JupyterLab

Figure 2.1 – The first five rows of the ts DataFrame using JupyterLab

  1. Print a summary of the DataFrame to check the index and column data types:
    ts.info()
    >> <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 128 entries, 2021-04-26 to 2021-08-31
    Data columns (total 4 columns):
     #   Column        Non-Null Count  Dtype
    ---  ------        --------------  -----
     0   DOW           128 non-null    object
     1   Daily         128 non-null    object
     2   Forecast      128 non-null    object
     3   Percent Diff  128 non-null    object
    dtypes: object(4)
    memory usage: 5.0+ KB
  2. Notice that the Date column is now an index (not a column) of type DatetimeIndex. Additionally, both Daily and Forecast columns have the wrong dtype inference. You would expect them to be of type float. The issue is due to the source CSV file containing dollar signs ($) and thousand separators (,) in both columns. The presence of non-numeric characters will cause the columns to be interpreted as strings. A column with the dtype object indicates either a string column or a column with mixed dtypes (not homogeneous).

To fix this, you need to remove both the dollar sign ($) and thousand separators (,) or any non-numeric character. You can accomplish this using str.replace(), which can take a regular expression to remove all non-numeric characters but exclude the period (.) for the decimal place. Removing these characters does not convert the dtype, so you will need to cast those two columns as a float dtype using .astype(float):

clean = lambda x: x.str.replace('[^\d]','', regex=True)
c_df = ts[['Daily', 'Forecast']].apply(clean, axis=1)
ts[['Daily', 'Forecast']] = c_df.astype(float)

Print a summary of the updated DataFrame:

ts.info()
>> <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 128 entries, 2021-04-26 to 2021-08-31
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   DOW           128 non-null    object
 1   Daily         128 non-null    float64
 2   Forecast      128 non-null    float64
 3   Percent Diff  128 non-null    object
dtypes: float64(2), object(2)
memory usage: 5.0+ KB

Now, you have a DataFrame with DatetimeIndex and both Daily and Forecast columns are of dtype float64 (numeric fields).

How it works…

Using pandas for data transformation is fast since it loads the data into memory. For example, the read_csv method reads and loads the entire data into a DataFrame in memory. When requesting a DataFrame summary with the info() method, in addition to column and index data types, the output will display memory usage for the entire DataFrame. To get the exact memory usage for each column, including the index, you can use the memory_usage() method:

ts.memory_usage()
>> 
Index           1024
DOW             1024
Daily           1024
Forecast        1024
Percent Diff    1024
dtype: int64

The total will match what was provided in the DataFrame summary:

ts.memory_usage().sum()
>> 5120

So far, you have used a few of the available parameters when reading a CSV file using read_csv. The more familiar you become with the different options available in any of the pandas reader functions, the more upfront preprocessing you can do during data ingestion (reading).

You leveraged the built-in parse_dates argument, which takes in a list of columns (either specified by name or position).The combination of index_col=0 and parse_dates=[0] produced a DataFrame with an index of type DatetimeIndex.

Let's inspect the parameters used in this recipe as defined in the official pandas.read_csv() documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html):

  • filepath_or_buffer: This is the first positional argument and the only required field needed (at a minimum) to read a CSV file. Here, you passed the Python path object named filepath. This can also be a string that represents a valid file path such as '../../datasets/Ch2/movieboxoffice.csv' or a URL that points to a remote file location, such as an AWS S3 bucket (we will examine this later in the Reading data from URLs recipe in this chapter).
  • sep: This takes a string to specify which delimiter to use. The default is a comma delimiter (,) which assumes a CSV file. If the file is separated by another delimiter, such as a pipe (|) or semicolon (;), then the argument can be updated, such as sep="|" or sep=";".

Another alias to sep is delimiter, which can be used as well as a parameter name.

  • header: In this case, you specified that the first row (0) value contains the header information. The default value is infer, which usually works as-is in most cases. If the CSV does not contain a header, then you specify header=None. If the CSV has a header but you prefer to supply custom column names, then you need to specify header=0 and overwrite it by providing a list of new column names to the names argument.

Recall that you specified which columns to include by passing a list of column names to the usecols parameter. These names are based on the file header (the first row of the CSV file).

If you decide to provide custom header names, you cannot reference the original names in the usecols parameter; this will produce the following error: ValueError: Usecols do not match columns.

  • parse_dates: In the recipe, you provided a list of column positions using [0], which specified only the first column (by position) should be parsed. The parse_dates argument can take a list of column names, such as ["Date"], or a list of column positions, such as [0, 3], indicating the first and the fourth columns. If you only intend to parse the index column(s) specified in the index_col parameter, you only need to pass True (Boolean).
  • index_col: Here, you specified that the first column by position (index_col=0) will be used as the DataFrame index. Alternatively, you could provide the column name as a string (index_col='Date'). The parameter also takes in a list of integers (positional indices) or strings (column names), which would create a MultiIndex object.
  • usecols: The default value is None, which includes all the columns in the dataset. Limiting the number of columns to only those that are required results in faster parsing and overall lower memory usage, since you only bring in what is needed. The usecols arguments can take a list of column names, such as ['Date', 'DOW', 'Daily', 'Percent Diff', 'Forecast'] or a list of positional indices, such as [0, 1, 3, 7, 6], which would produce the same result.

There's more…

There are situations where parse_dates may not work (it just cannot parse the date). In such cases, the column(s) will be returned unchanged, and no error will be thrown. This is where the date_parser parameter can be useful.

For example, you can pass a lambda function that uses the to_datetime function in pandas to date_parser. You can specify the string representation for the date format inside to_datetime(), as demonstrated in the following code:

date_parser =lambda x: pd.to_datetime(x, format="%d-%b-%y")
ts = pd.read_csv(filepath,
                 parse_dates=[0],
                 index_col=0,
                 date_parser=date_parser,
                 usecols=[0,1,3, 7, 6])
ts.head()

The preceding code will print out the first five rows of the ts DataFrame displaying a properly parsed Date index.

Figure 2.2 – The first five rows of the ts DataFrame using JupyterLab

Figure 2.2 – The first five rows of the ts DataFrame using JupyterLab

Let's break it down. In the preceding code, you passed two arguments to the to_datetime function: the object to convert to datetime and an explicit format string. Since the date is stored as a string in the form 26-Apr-21, you passed "%d-%b-%y" to reflect that:

  • %d represents the day of the month, such as 01 or 02.
  • %b represents the abbreviated month name, such as Apr or May.
  • %y represents a two-digit year, such as 19 or 20.

Other common string codes include the following:

  • %Y represents the year as a four-digit number, such as 2020 or 2021.
  • %B represents the month's full name, such as January or February.
  • %m represents the month as a two-digit number, such as 01 or 02.

For more information on Python's string formats for representing dates, visit https://strftime.org.

See also

According to the pandas documentation, the infer_datetime_format parameter in read_csv() function can speed up the parsing by 5–10x. This is how you can add this to our original script:

ts = pd.read_csv(filepath,
                 header=0,
                 parse_dates=[0],
                 index_col=0,
                 infer_datetime_format= True,
                 usecols=['Date',
                          'DOW',
                          'Daily',
                          'Forecast',
                          'Percent Diff'])

Note that given the dataset is small, the speed improvement may be insignificant.

For more information, please refer to the pandas.read_csv documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

Reading data from an Excel file

To read data from an Excel file, you will need to use a different reader function from pandas. Generally, working with Excel files can be a challenge since the file can contain formatted multi-line headers, merged header cells, and images. They may also contain multiple worksheets with custom names (labels). Therefore, it is vital that you always inspect the Excel file first. The most common scenario is reading from an Excel file that contains data partitioned into multiple sheets, which is the focus of this recipe.

In this recipe, you will be using the pandas.read_excel() function and examining the various parameters available to ensure the data is read properly as a DataFrame with a DatetimeIndex for time series analysis. In addition, you will explore different options to read Excel files with multiple sheets.

Getting ready

To use pandas.read_excel(), you will need to install an additional library for reading and writing Excel files. In the read_excel() function, you will use the engine parameter to specify which library (engine) to use for processing an Excel file. Depending on the Excel file extension you are working with (for example, .xls or .xlsx), you may need to specify a different engine that may require installing an additional library.

The supported libraries (engines) for reading and writing Excel include xlrd, openpyxl, odf, and pyxlsb. When working with Excel files, the two most common libraries are usually xlrd and openpyxl.

The xlrd library only supports .xls files. So, if you are working with an older Excel format, such as .xls, then xlrd will do just fine. For newer Excel formats, such as .xlsx, we will need a different engine, and in this case, openpyxl would be the recommendation to go with.

To install openpyxl using conda, run the following command in the terminal:

>>> conda install openpyxl

To install using pip, run the following command:

>>> pip install openpyxl

We will be using the sales_trx_data.xlsx file, which you can download from the book's GitHub repository. See the Technical requirements section of this chapter. The file contains sales data split by year into two sheets (2017 and 2018), respectively.

How to do it…

You will ingest the Excel file (.xlsx) using pandas and openpyxl, and leverage some of the available parameters in read_excel():

  1. Import the libraries for this recipe:
    import pandas as pd
    from pathlib import Path
    filepath = \
    Path('../../datasets/Ch2/sales_trx_data.xlsx')
  2. Read the Excel (.xlxs) file using the read_excel()function. By default, pandas will only read from the first sheet. This is specified under the sheet_name parameter, which is set to 0 as the default value. Before passing a new argument, you can use pandas.ExcelFile first to inspect the file and determine the number of sheets available. The ExcelFile class will provide additional methods and properties, such as sheet_name, which returns a list of sheet names:
    excelfile = pd.ExcelFile(filepath)
    excelfile.sheet_names
    >> ['2017', '2018']

If you have multiple sheets, you can specify which sheets you want to ingest by passing a list to the sheet_name parameter in read_excel. The list can either be positional arguments, such as first, second, and fifth sheets with [0, 1, 4], sheet names with ["Sheet1", "Sheet2", "Sheet5"], or a combination of both, such as first sheet, second sheet, and a sheet named "Revenue" [0, 1, "Revenue"].

In the following code, you will use sheet positions to read both the first and second sheets (0 and 1 indexes). This will return a Python dictionary object with two DataFrames. Notet hat the returned dictionary (key-value pair) has numeric keys (0 and 1) representing the first and second sheets (positional index), respectively:

ts = pd.read_excel(filepath,
                   engine='openpyxl',
                   index_col=1,
                   sheet_name=[0,1],
                   parse_dates=True)
ts.keys()
>> dict_keys([0, 1])
  1. Alternatively, you can pass a list of sheet names. Notice that the returned dictionary keys are now strings and represent the sheet names as shown in the following code:
    ts = pd.read_excel(filepath,
                       engine='openpyxl',
                       index_col=1,
                       sheet_name=['2017','2018'],
                       parse_dates=True)
    ts.keys()
    >> dict_keys(['2017', '2018'])
  2. If you want to read from all the available sheets, you will pass None instead. The keys for the dictionary, in this case, will represent sheet names:
    ts = pd.read_excel(filepath,
                       engine='openpyxl',
                       index_col=1,
                       sheet_name=None,
                       parse_dates=True)
    ts.keys()
    >> dict_keys(['2017', '2018'])

The two DataFrames within the dictionary are identical (homogeneous-typed) in terms of their schema (column names and data types). You can inspect each DataFrame with ts['2017'].info() and ts['2018'].info().

They both have a DatetimeIndex object, which you specified in the index_col parameter. The 2017 DataFrame consists of 36,764 rows and the 2018 DataFrame consists of 37,360. In this scenario, you want to stack (combine) the two (think UNION in SQL) into a single DataFrame that contains all 74,124 rows and a DatetimeIndex that spans from 2017-01-01 to 2018-12-31.

To combine the two DataFrames along the index axis (stacked one on top of the other), you will use the pandas.concat() function. The default behavior of the concat() function is to concatenate along the index axis (axis=0). In the following code, you will explicitly specify which DataFrames to concatenate:

ts_combined = pd.concat([ts['2017'],ts['2018']])
ts_combined.info()
>> <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 74124 entries, 2017-01-01 to 2018-12-31
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Line_Item_ID        74124 non-null  int64
 1   Credit_Card_Number  74124 non-null  int64
 2   Quantity            74124 non-null  int64
 3   Menu_Item           74124 non-null  object
dtypes: int64(3), object(1)
memory usage: 2.8+ MB
  1. When you have multiple DataFrames returned (think multiple sheets), you can use the concat() function on the returned dictionary. In other words, you can combine the concat() and read_excel() functions in one statement. In this case, you will end up with a MultiIndex DataFrame where the first level is the sheet name (or number) and the second level is the DatetimeIndex. For example, using the ts dictionary, you will get a two-level index: MultiIndex([('2017', '2017-01-01'), ..., ('2018', '2018-12-31')], names=[None, 'Date'], length=74124).

To reduce the number of levels, you can use the droplevel(level=0) method to drop the first level after pandas .concat() shown as follows:

ts_combined = pd.concat(ts).droplevel(level=0)
  1. If you are only reading one sheet, the behavior is slightly different. By default, sheet_name is set to 0, which means it reads the first sheet. You can modify this and pass a different value (single value), either the sheet name (string) or sheet position (integer). When passing a single value, the returned object will be a pandas DataFrame and not a dictionary:
    ts = pd.read_excel(filepath,
                       index_col=1,
                       sheet_name='2018',
                       parse_dates=True)
    type(ts)
    >> pandas.core.frame.DataFrame

Do note though that if you pass a single value inside two brackets ([1]), then pandas will interpret this differently and the returned object will be a dictionary that contains one DataFrame.

Lastly, note that you did not need to specify the engine in the last example. The read_csv function will determine which engine to use based on the file extension. So, for example, suppose the library for that engine is not installed. In that case, it will throw an ImportError message, indicating that the library (dependency) is missing.

How it works…

The pandas.read_excel() function has many common parameters with the pandas.read_csv() function that you used earlier. The read_excel function can either return a DataFrame object or a dictionary of DataFrames. The dependency here is whether you are passing a single value (scalar) or a list to sheet_name.

In the sales_trx_data.xlsx file, both sheets had the same schema (homogeneous- typed). The sales data was partitioned (split) by year, where each sheet contained sales for a particular year. In this case, concatenating the two DataFrames was a natural choice. The pandas.concat() function is like the DataFrame.append() function, in which the second DataFrame was added (appended) to the end of the first DataFrame. This should be similar in behavior to the UNION clause for those coming from a SQL background.

There's more…

An alternative method to reading an Excel file is with the pandas.ExcelFile() class, which returns a pandas ExcelFile object. Earlier in this recipe, you used ExcelFile() to inspect the number of sheets in the Excel file through the sheet_name property.

The ExcelFile class has several useful methods, including the parse() method to parse the Excel file into a DataFrame, similar to the pandas.read_excel() function.

In the following example, you will use the ExcelFile class to parse the first sheet, assign the first column as an index, and print the first five rows:

excelfile = pd.ExcelFile(filepath)
excelfile.parse(sheet_name='2017',
                index_col=1,
                parse_dates=True).head()

You should see similar results for the first five rows of the DataFrame:

Figure 2.3 – The first five rows of the ts DataFrame using JupyterLab

Figure 2.3 – The first five rows of the DataFrame using JupyterLab

From Figure 2.3, it should become clear that ExcelFile.parse() is equivalent to pandas.read_excel().

See also

For more information on pandas.read_excel() and pandas.ExcelFile(), please refer to the official documentation:

Reading data from URLs

Files can be downloaded and stored locally on your machine, or stored on a remote server or cloud location. In the earlier two recipes, Reading from CSVs and other delimited files, and Reading data from an Excel file, both files were stored locally.

Many of the pandas reader functions can read data from remote locations by passing a URL path. For example, both read_csv() and read_excel() can take a URL to read a file that is accessible via the internet. In this recipe, you will read a CSV file using pandas.read_csv() and Excel files using pandas.read_excel() from remote locations, such as GitHub and AWS S3 (private and public buckets). You will also read data directly from an HTML page into a pandas DataFrame.

Getting ready

You will need to install the AWS SDK for Python (Boto3) for reading files from S3 buckets. Additionally, you will learn how to use the storage_options parameter available in many of the reader functions in pandas to read from S3 without the Boto3 library.

To use an S3 URL (for example, s3://bucket_name/path-to-file) in pandas, you will need to install the s3fs library. You will also need to install an HTML parser for when we use read_html(). For example, for the parsing engine (the HTML parser), you can install either lxml or html5lib; pandas will pick whichever is installed (it will first look for lxml, and if that fails, then for html5lib). If you plan to use html5lib you will need to install Beautiful Soup (beautifulsoup4).

To install using pip, you can use the following command:

>>> pip install boto3 s3fs lxml

To install using Conda, you can use:

>>> conda install boto3 s3fs lxml -y

How to do it…

This recipe will present you with different scenarios when reading data from online (remote) sources. Let's import pandas upfront since you will be using it throughout this recipe:

import pandas as pd

Reading data from GitHub

Sometimes, you may find useful public data on GitHub that you want to use and read directly (without downloading). One of the most common file formats on GitHub are CSV files. Let's start with the following steps:

  1. To read a CSV file from GitHub, you will need the URL to the raw content. If you copy the file's GitHub URL from the browser and use it as the file path, you will get a URL that looks like this: https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv. This URL is a pointer to the web page in GitHub and not the data itself; hence when using pd.read_csv(), it will throw an error:
    url = 'https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv'
    pd.read_csv(url)
    ParserError: Error tokenizing data. C error: Expected 1 fields in line 62, saw 2
  2. Instead, you will need the raw content, which will give you a URL that looks like this: https://raw.githubusercontent.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv:
Figure 2.4 – The GitHub page for the CSV file. Note the View raw button

Figure 2.4 – The GitHub page for the CSV file. Note the View raw button

  1. In Figure 2.4, notice that the values are not comma-separated (not a comma-delimited file); instead, the file uses semicolon (;) to separate the values.

The first column in the file is the Date column. You will need to parse (parse_date parameter) and convert it to DatetimeIndex (index_col parameter).

Pass the new URL to pandas.read_csv():

url = 'https://media.githubusercontent.com/media/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv'
date_parser = lambda x: pd.to_datetime(x, format="%d/%m/%Y")
df = pd.read_csv(url,
                 delimiter=';',
                 index_col='Date',
                 date_parser=date_parser)
df.iloc[:3,1:4]
>>
            CO(GT)  PT08.S1(CO)  NMHC(GT)
Date
2004-03-10     2.6      1360.00       150
2004-03-10     2.0      1292.25       112
2004-03-10     2.2      1402.00        88

We successfully ingested the data from the CSV file in GitHub into a DataFrame and printed the first three rows of select columns.

Reading data from a public S3 bucket

AWS supports virtual-hosted-style URLs such as https://bucket-name.s3.Region.amazonaws.com/keyname, path-style URLs such as https://s3.Region.amazonaws.com/bucket-name/keyname, and using S3://bucket/keyname. Here are examples of how these different URLs may look for our file:

In this example, you will be reading the AirQualityUCI.xlsx file, which has only one sheet. It contains the same data as AirQualityUCI.csv, which we read earlier from GitHub.

Note that in the URL, you do not need to specify the region as us-east-1. The us-east-1 region, which represents US East (North Virginia), is an exception and will not be the case for other regions:

url = 'https://tscookbook.s3.amazonaws.com/AirQualityUCI.xlsx'
df = pd.read_excel(url,
                   index_col='Date',
                   parse_dates=True)

Read the same file using the S3:// URL:

s3uri = 's3://tscookbook/AirQualityUCI.xlsx'
df = pd.read_excel(s3uri,
                   index_col='Date',
                   parse_dates=True)

You may get an error such as the following:

ImportError: Install s3fs to access S3

This indicates that either you do not have the s3fs library installed or possibly you are not using the right Python/Conda environment.

Reading data from a private S3 bucket

When reading files from a private S3 bucket, you will need to pass your credentials to authenticate. A convenient parameter in many of the I/O functions in pandas is storage_options, which allows you to send additional content with the request, such as a custom header or required credentials to a cloud service.

You will need to pass a dictionary (key-value pair) to provide the additional information along with the request, such as username, password, access keys, and secret keys to storage_options as in {"username": username, "password": password}.

Now, you will read the AirQualityUCI.csv file, located in a private S3 bucket:

  1. You will start by storing your AWS credentials in a config .cfg file outside your Python script. Then, use configparser to read the values and store them in Python variables. You do not want your credentials exposed or hardcoded in your code:
    # Example aws.cfg file
    [AWS]
    aws_access_key=your_access_key
    aws_secret_key=your_secret_key

You can load the aws.cfg file using config.read():

import configparser
config = configparser.ConfigParser()
config.read('aws.cfg')
AWS_ACCESS_KEY = config['AWS']['aws_access_key']
AWS_SECRET_KEY = config['AWS']['aws_secret_key']
  1. The AWS Access Key ID and Secret Access Key are now stored in AWS_ACCESS_KEY and AWS_SECRET_KEY. Use pandas.read_csv() to read the CSV file and update the storage_options parameter by passing your credentials, as shown in the following code:
    s3uri = "s3://tscookbook-private/AirQuality.csv"
    df = pd.read_csv(s3uri,
                     index_col='Date',
                     parse_dates=True,
                     storage_options= {
                             'key': AWS_ACCESS_KEY,
                             'secret': AWS_SECRET_KEY
                         })
    df.iloc[:3, 1:4]
    >>
               CO(GT)  PT08.S1(CO)  NMHC(GT)
    Date
    2004-10-03      2.6       1360.0     150.0
    2004-10-03      2.0       1292.0     112.0
    2004-10-03      2.2       1402.0      88.0
  2. Alternatively, you can use the AWS SDK for Python (Boto3) to achieve similar results. The boto3 Python library gives you more control and additional capabilities (beyond just reading and writing to S3). You will pass the same credentials stored earlier in AWS_ACCESS_KEY and AWS_SECRET_KEY and pass them to AWS, using boto3 to authenticate:
    import boto3
    bucket = "tscookbook-private"
    client = boto3.client("s3",
                      aws_access_key_id =AWS_ACCESS_KEY,
                      aws_secret_access_key = AWS_SECRET_KEY)

Now, the client object has access to many methods specific to the AWS S3 service for creating, deleting, and retrieving bucket information, and more. In addition, Boto3 offers two levels of APIs: client and resource. In the preceding example, you used the client API.

The client is a low-level service access interface that gives you more granular control, for example, boto3.client("s3"). The resource is a higher-level object-oriented interface (an abstraction layer), for example, boto3.resource("s3").

In Chapter 4, Persisting Time Series Data to Files, you will explore the resource API interface when writing to S3. For now, you will use the client interface.

  1. You will use the get_object method to retrieve the data. Just provide the bucket name and a key. The key here is the actual filename:
    data = client.get_object(Bucket=bucket, Key='AirQuality.csv')
    df = pd.read_csv(data['Body'],
                     index_col='Date',
                     parse_dates=True)
         
    df.iloc[:3, 1:4]
    >>
               CO(GT)  PT08.S1(CO)  NMHC(GT)
    Date
    2004-10-03    2,6       1360.0     150.0
    2004-10-03      2       1292.0     112.0
    2004-10-03    2,2       1402.0      88.0

When calling the client.get_object() method, a dictionary (key-value pair) is returned, as shown in the following example:

{'ResponseMetadata': {
'RequestId':'MM0CR3XX5QFBQTSG',
'HostId':'vq8iRCJfuA4eWPgHBGhdjir1x52Tdp80ADaSxWrL4Xzsr
VpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
'HTTPStatusCode':200,
'HTTPHeaders': {'x-amz-id-2': 'vq8iRCJfuA4eWPgHBGhdjir1x52
Tdp80ADaSxWrL4XzsrVpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
   'x-amz-request-id': 'MM0CR3XX5QFBQTSG',
   'date': 'Tue, 06 Jul 2021 01:08:36 GMT',
   'last-modified': 'Mon, 14 Jun 2021 01:13:05 GMT',
   'etag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
   'accept-ranges': 'bytes',
   'content-type': 'binary/octet-stream',
   'server': 'AmazonS3',
   'content-length': '1012427'},
   'RetryAttempts': 0},
   'AcceptRanges': 'bytes',
 'LastModified': datetime.datetime(2021, 6, 14, 1, 13, 5, tzinfo=tzutc()),
 'ContentLength': 1012427,
 'ETag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
 'ContentType': 'binary/octet-stream',
 'Metadata': {},
 'Body': <botocore.response.StreamingBody at 0x7fe9c16b55b0>}

The content you are interested in is in the response body under the Body key. You passed data['Body'] to the read_csv() function, which loads the response stream (StreamingBody) into a DataFrame.

Reading data from HTML

pandas offers an elegant way to read HTML tables and convert the content into a pandas DataFrame using the pandas.read_html() function:

  1. In the following recipe, we will extract HTML tables from Wikipedia for COVID-19 pandemic tracking cases by country and by territory (https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory):
    url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
    results = pd.read_html(url)
    print(len(results))
    >> 71
  2. pandas.read_html() returned a list of DataFrames, one for each HTML table found in the URL. Keep in mind that the website's content is dynamic and gets updated regularly, and the results may vary. In our case, it returned 71 DataFrames. The DataFrame at index 15 contains summary on COVID-19 cases and deaths by region. Grab the DataFrame (at index 15) and assign it to the df variable, and print the returned columns:
    df = results[15]
    df.columns
    >> Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
           'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
           'Population millions', 'Vaccinated %[29]'],
          dtype='object')
  3. Display the first five rows for Total cases, Total deaths, and the Cases per million columns.
    df[['Total cases', 'Total deaths', 'Cases per million']].head()
    >>
         Total cases    Total deaths    Cases per million
    0    139300788      1083815         311412
    1    85476396       1035884         231765
    2    51507114       477420          220454
    3    56804073       1270477         132141
    4    21971862       417507          92789

How it works…

Most of the pandas reader functions accept a URL as a path. Examples include the following:

  • pandas.read_csv()
  • pandas.read_excel()
  • pandas.read_parquet()
  • pandas.read_table()
  • pandas.read_pickle()
  • pandas.read_orc()
  • pandas.read_stata()
  • pandas.read_sas()
  • pandas.read_json()

The URL needs to be one of the valid URL schemes that pandas supports, which includes http and https, ftp, s3, gs, or the file protocol.

The read_html() function is great for scraping websites that contain data in HTML tables. It inspects the HTML and searches for all the <table> elements within the HTML. In HTML, table rows are defined with the <tr> </tr> tags and headers with the <th></th> tags. The actual data (cell) is contained within the <td> </td> tags. The read_html() function looks for <table>, <tr>, <th>, and <td> tags and converts the content into a DataFrame, and assigns the columns and rows as they were defined in the HTML. If an HTML page contains more than one <table></table> tag, read_html will return them all and you will get a list of DataFrames.

The following code demonstrates how pandas.read_html()works:

import pandas as pd
html = """
 <table>
   <tr>
     <th>Ticker</th>
     <th>Price</th>
   </tr>
   <tr>
     <td>MSFT</td>
     <td>230</td>
   </tr>
   <tr>
     <td>APPL</td>
     <td>300</td>
   </tr>
     <tr>
     <td>MSTR</td>
     <td>120</td>
   </tr>
 </table>
 </body>
 </html>
 """
   
df = pd.read_html(html)
df[0]
>>
  Ticker  Price
0   MSFT    230
1   APPL    300
2   MSTR    120

In the preceding code, the read_html() function parsed the HTML code and converted the HTML table into a pandas DataFrame. The headers between the <th> and </th> tags represent the column names of the DataFrame, and the content between the <tr><td> and </td></tr> tags represent the row data of the DataFrame. Note that if you go ahead and delete the <table> and </table> table tags, you will get the ValueError: No tables found error.

There's more…

The read_html() function has an optional attr argument, which takes a dictionary of valid HTML <table> attributes, such as id or class. For example, you can use the attr parameter to narrow down the tables returned to those that match the class attribute sortable as in <table class="sortable">. The read_html function will inspect the entire HTML page to ensure you target the right set of attributes.

In the previous exercise, you used the read_html function on the COVID-19 Wikipedia page, and it returned 71 tables (DataFrames). The number of tables will probably increase as time goes by as Wikipedia gets updated. You can narrow down the result set and guarantee some consistency by using the attr option. First, start by inspecting the HTML code using your browser. You will see that several of the <table> elements have multiple classes listed, such as sortable. You can look for other unique identifiers.

<table class="wikitable sortable mw-datatable covid19-countrynames jquery-tablesorter" id="thetable" style="text-align:right;">

Note, if you get the error html5lib not found, please install it you will need to install both html5lib and beautifulSoup4.

To install using conda, use the following:

conda install html5lib beautifulSoup4

To install using pip, use the following:

pip install html5lib beautifulSoup4

Now, let's use the sortable class and request the data again:

url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
df = pd.read_html(url, attrs={'class': 'sortable'})
len(df)
>>  7
df[3].columns
>>
Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
       'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
       'Population millions', 'Vaccinated %[29]'],
      dtype='object')

The list returned a smaller subset of tables (from 71 down to 7).

See also

For more information, please refer to the official pandas.read_html documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html.

Reading data from a SAS dataset

In this recipe, you will read a SAS data file and, more specifically, a file with the SAS7BDAT extension. SAS is commercial statistical software that provides data mining, business intelligence, and advanced analytics capabilities. Many large organizations in various industries rely on SAS, so it is very common to encounter the need to read from a SAS dataset.

Getting ready

In this recipe, you will be using pandas to read a .sas7bdat file. These files can be extremely large, and you will be introduced to different ways to read such files more efficiently.

To get ready, you can download the SAS sample dataset from http://support.sas.com/kb/61/960.html. You will be reading the DCSKINPRODUCT.sas7bdat file.

The SAS data file is also provided in the GitHub repository for this book.

How to do it…

You will use the pandas.read_sas() function, which can be used to read both SAS XPORT (.xpt) and SAS7BDAT file formats. However, there is no SAS writer function in pandas:

  1. Start by importing pandas and creating the path variable to the file. This file is not large (14.7 MB) compared to a typical SAS file, which can be 100+ GB:
    import pandas as pd
    path = '../../datasets/Ch2/DCSKINPRODUCT.sas7bdat'
  2. One of the advantages of using pandas is that it provides data structures for in-memory analysis, hence the performance advantage when analyzing data. On the other hand, this can also be a constraint when loading large datasets into memory. Generally, the amount of data you can load is limited by the amount of memory available. However, this can be an issue if the dataset is too large and exceeds the amount of memory.

One way to tackle this issue is by using the chunksize parameter. The chunksize parameter is available in many reader and writer functions, including read_sas. The DCSKINPRODUCT.sas7bdat file contains 152130 records, so you will use a chunksize parameter to read 10000 records at a time:

df = pd.read_sas(path, chunksize=10000)
type(df)
>> pandas.io.sas.sas7bdat.SAS7BDATReader
  1. The returned object is not a DataFrame but a SAS7BDATReader object. You can think of this as an iterator object that you can iterate through. At each iteration or chunk, you get a DataFrame of 10,000 rows at a time. You can retrieve the first chunk using the next() method that is, df.next(). Every time you use the next() method, it will retrieve the next batch or chunk (the next 10,000 rows). You can also loop through the chunks, for example, to do some computations. This can be helpful when the dataset is too large to fit in memory, allowing you to iterate through manageable chunks to do some heavy aggregations. The following code demonstrates this concept:
    results = []
    for chunk in df:
        results.append(
            chunk)
    len(results)
    >> 16
    df = pd.concat(results)
    df.shape
    >> (152130, 5)

There were 16 chunks (DataFrames) in total; each chunk or DataFrame contained 10000 records. Using the concat function, you can combine all 16 DataDrames into a large DataFrame of 152130 records.

  1. Reread the data in chunks, and this time group by DATE and aggregate using sum and count, as shown in the following:
    df = pd.read_sas(path, chunksize=10000)
    results = []
    for chunk in df:
        results.append(
            chunk.groupby('DATE')['Revenue']
                 .agg(['sum', 'count']))
  2. The results object is now a list of DataFrames. Now, let's examine the result set:
    results[0].loc['2013-02-10']
    >>
    sum      923903.0
    count        91.0
    Name: 2013-02-10 00:00:00, dtype: float64
    results[1].loc['2013-02-10']
    >>
    sum      8186392.0
    count         91.0
    Name: 2013-02-10 00:00:00, dtype: float64
    results[2].loc['2013-02-10']
    >>
    sum      5881396.0
    count         91.0
    Name: 2013-02-10 00:00:00, dtype: float64
  3. From the preceding output, you can observe that we have another issue to solve. Notice that the observations for 2013-02-10 got split. This is a common issue when chunking since it splits the data disregarding their order or sequence.

You can resolve this by combining the results in a meaningful way. For example, you can use the reduce function in Python. The reduce function allows you to perform a rolling computation (also known as folding or reducing) based on some function you provide. The following code demonstrates how this can be implemented:

from functools import reduce
final = reduce(lambda x1, x2: x1.add(x2, fill_value=0), results)
type(final)
>> pandas.core.frame.DataFrame
final.loc['2013-02-10']
>>
sum      43104420.0
count        1383.0
Name: 2013-02-10 00:00:00, dtype: float64
final.shape
>> (110, 2)

From the preceding output, the 16 chunks or DataFrames were reduced to a single value per row (index). We leveraged the pandas.DataFrame.add() function to add the values and use zero (0) as a fill value when the data is missing.

How it works…

Using the chunksize parameter in the read_sas() function will not return a DataFrame but rather an iterator (a SAS7BDATReader object). The chunksize parameter is available in most reader functions in pandas, such as read_csv, read_hdf, and read_sql, to name a few. Similarly, using the chunkize parameter with those functions will also return an iterator.

If chunksize is not specified, the returned object would be a DataFrame of the entire dataset. This is because the default value is None in all the reader functions.

Chunking is great when the operation or workflow is simple and not sequential. An operation such as groupby can be complex and tricky when chunking, which is why we added two extra steps:

  • Stored the resulting DataFrame to a list.
  • Used Python's reduce() function, which takes two arguments, a function and an iterator. It then applies the function element-wise and does it cumulatively from left to right to reduce down to a one result set. We also leveraged the DataFrame's add() method, which matches DataFrame indices to perform an element-wise addition.

There's more…

There are better options when working with large files than using pandas, especially if you have memory constraints and cannot fit the entire data into the memory. Chunking is a great option, but it still has an overhead and relies on memory. The pandas library is a single-core framework and does not offer parallel computing capabilities. Instead, there are specialized libraries and frameworks for parallel processing designed to work with big data. Such frameworks do not rely on loading everything into memory and instead can utilize multiple CPU cores, disk usage, or expand into multiple worker nodes (think multiple machines). For example, Dask chunks your data, creates a computation graph, and parallelizes the smaller tasks (chunks) behind the scenes, thus speeding the overall processing time and reducing memory overhead.

These frameworks are great but will require you to spend time learning the framework and rewriting your code to leverage these capabilities. So, there is a steep learning curve initially. Luckily, this is where the Modin project comes into play. For example, the Modin library acts as a wrapper or, more specifically, an abstraction on top of Dask or Ray that uses a similar API to pandas. Modin makes optimizing your pandas' code much more straightforward without learning another framework, and all it takes is a single line of code.

Before installing any library, it is highly advised that you create a separate virtual environment, for example, using conda. The concept and purpose behind creating virtual environments were discussed in detail in Chapter 1, Getting Started with Time Series Analysis, with multiple examples.

To install Modin using Conda (with a Dask backend), run the following:

>> conda install -c conda-forge modin-dask

To install with Pip, use the following:

>> pip install modin[dask]

You will measure the time and memory usage using pandas and again using Modin. To measure memory usage, you will need to install the memory_profiler library.

>> pip install memory_profiler

The memory_profiler library provides IPython and Jupyter magics such as %memit and %mprun, similar to known magics such as %timeit and %time.

Start by loading the required libraries:

import memory_profiler 
import pandas as pd
%load_ext memory_profiler
path = '../../datasets/Ch2/large_file.csv'

You will start by using pandas to read the file large_file.csv:

%%time
%memit pd.read_csv(path).groupby('label_source').count()

The preceding code should output something similar to the following:

peak memory: 161.35 MiB, increment: 67.34 MiB
CPU times: user 364 ms, sys: 95.2 ms, total: 459 ms
Wall time: 1.03 s

Now, you will load Modin and specify Dask as the engine:

from modin.config import Engine
Engine.put("dask")  # Modin will use Dask
import modin.pandas as pd
from distributed import Client
client = Client()

Notice that in the preceding code that Modin has a pandas implementation. This way, you can leverage your existing code without modification. You will now rerun the same code:

%%time
%memit pd.read_csv(path).groupby('label_source').count()

The preceding code should produce an output similar to the following:

peak memory: 137.12 MiB, increment: 9.34 MiB
CPU times: user 899 ms, sys: 214 ms, total: 1.11 s
Wall time: 1.91 s

Observe how the peak memory was reduced from 160 MiB to 137.12 MiB using Modin (Dask). Most importantly, notice how the memory increment went down from 67 MiB to 9 MiB with Modin. Overall, with Modin, you got lower memory usage. However, Modin (Dask) will show more significant advantages with more extensive operations on larger datasets.

See also

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Explore forecasting and anomaly detection techniques using statistical, machine learning, and deep learning algorithms
  • Learn different techniques for evaluating, diagnosing, and optimizing your models
  • Work with a variety of complex data with trends, multiple seasonal patterns, and irregularities

Description

Time series data is everywhere, available at a high frequency and volume. It is complex and can contain noise, irregularities, and multiple patterns, making it crucial to be well-versed with the techniques covered in this book for data preparation, analysis, and forecasting. This book covers practical techniques for working with time series data, starting with ingesting time series data from various sources and formats, whether in private cloud storage, relational databases, non-relational databases, or specialized time series databases such as InfluxDB. Next, you’ll learn strategies for handling missing data, dealing with time zones and custom business days, and detecting anomalies using intuitive statistical methods, followed by more advanced unsupervised ML models. The book will also explore forecasting using classical statistical models such as Holt-Winters, SARIMA, and VAR. The recipes will present practical techniques for handling non-stationary data, using power transforms, ACF and PACF plots, and decomposing time series data with multiple seasonal patterns. Later, you’ll work with ML and DL models using TensorFlow and PyTorch. Finally, you’ll learn how to evaluate, compare, optimize models, and more using the recipes covered in the book.

Who is this book for?

This book is for data analysts, business analysts, data scientists, data engineers, or Python developers who want practical Python recipes for time series analysis and forecasting techniques. Fundamental knowledge of Python programming is required. Although having a basic math and statistics background will be beneficial, it is not necessary. Prior experience working with time series data to solve business problems will also help you to better utilize and apply the different recipes in this book.

What you will learn

  • Understand what makes time series data different from other data
  • Apply various imputation and interpolation strategies for missing data
  • Implement different models for univariate and multivariate time series
  • Use different deep learning libraries such as TensorFlow, Keras, and PyTorch
  • Plot interactive time series visualizations using hvPlot
  • Explore state-space models and the unobserved components model (UCM)
  • Detect anomalies using statistical and machine learning methods
  • Forecast complex time series with multiple seasonal patterns

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 30, 2022
Length: 630 pages
Edition : 1st
Language : English
ISBN-13 : 9781801071260
Vendor :
Google
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning

Product Details

Publication date : Jun 30, 2022
Length: 630 pages
Edition : 1st
Language : English
ISBN-13 : 9781801071260
Vendor :
Google
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 120.97
Machine Learning with PyTorch and Scikit-Learn
€41.99
Modern Time Series Forecasting with Python
€39.99
Time Series Analysis with Python Cookbook
€38.99
Total 120.97 Stars icon

Table of Contents

17 Chapters
Chapter 1: Getting Started with Time Series Analysis Chevron down icon Chevron up icon
Chapter 2: Reading Time Series Data from Files Chevron down icon Chevron up icon
Chapter 3: Reading Time Series Data from Databases Chevron down icon Chevron up icon
Chapter 4: Persisting Time Series Data to Files Chevron down icon Chevron up icon
Chapter 5: Persisting Time Series Data to Databases Chevron down icon Chevron up icon
Chapter 6: Working with Date and Time in Python Chevron down icon Chevron up icon
Chapter 7: Handling Missing Data Chevron down icon Chevron up icon
Chapter 8: Outlier Detection Using Statistical Methods Chevron down icon Chevron up icon
Chapter 9: Exploratory Data Analysis and Diagnosis Chevron down icon Chevron up icon
Chapter 10: Building Univariate Time Series Models Using Statistical Methods Chevron down icon Chevron up icon
Chapter 11: Additional Statistical Modeling Techniques for Time Series Chevron down icon Chevron up icon
Chapter 12: Forecasting Using Supervised Machine Learning Chevron down icon Chevron up icon
Chapter 13: Deep Learning for Time Series Forecasting Chevron down icon Chevron up icon
Chapter 14: Outlier Detection Using Unsupervised Machine Learning Chevron down icon Chevron up icon
Chapter 15: Advanced Techniques for Complex Time Series Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Most Recent
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8
(11 Ratings)
5 star 81.8%
4 star 18.2%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Most Recent

Filter reviews by




David Knickerbocker Oct 23, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I'm currently about halfway through the book, and up to this point, it's been mostly about getting data, cleaning it, dealing with missing data, and handling outliers. The fact that halfway through the book, they author is still discussing wrangling and outlier analysis is a good sign, to me. GIGO. Don't skip the fundamentals. The author goes slow, shows his work, and explains things very well.I'm finding the format and style of writing to be very clean and concise. The example code is very clean as well, which is not always the case in data science books.My favorite part so far has been the chapter on outlier analysis. With a background in security, that'll always be of interest to me.This is all very well explained. The title include the word "cookbook". I tend to have less use for cookbooks (and cheatsheets) as they usually cover things that can be found on StackOverflow. However, this book discusses approaches and feels more like a series of well-organized lessons on various important topics relating to time series analysis.I'm only about halfway through and am definitely going to keep going. I'll always have use for time series analysis, and there is always more to learn. This is a winner. I'm definitely enjoying this and have use for it in my work.
Amazon Verified review Amazon
Dr Abed M Hammoud Aug 23, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I have read several good books on time series analysis; I find this book to have an excellent mix of theory and code. Very readable and full of ready-to-use code snippets. It is highly recommended.
Amazon Verified review Amazon
Maribel Aug 12, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I've read a few ML time-series books. One thing I look for is practicality as well as explaining difficult and advanced concepts. This book does a pretty good job for walking somebody through the process. I appreciate the visuals, the code, as well as all the technical depth in all the different models. I would definitely recommend this to an experienced practicioner to increase their skills.
Amazon Verified review Amazon
E. Leonard Aug 04, 2022
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I reviewed the electronic version of this book, I cannot comment on the printed version, but the last few Packtpub printed books I have had have been without issue. I use python for work and hobby but my time-series experience before reading this book was in completely different stack, KX and Oracle.This book uses Jupyter to work through the tutorials and projects, JupyterLab specifically, this brings familiarity as well as the same limitations as with other Jupyter based walkthroughs. Reading time-series data from DBs and writing to DBs covers a multitude of engines. This leads to skipping sections of the book for persistence engines you don’t work with. However on the plus side this does actually add to the reference value of the title should your situation change.The book is structured around the basics up to chapter 7 where the depth and complexity of the book increases considerably. Make no mistake time series is not a trivial subject and here you will jump to quite a bit of depth on using your datasets for data analysis, diagnostics, forecasting, utilising statistical methods, outlier detection as well as topics like VAR and GARCH. Thankfully the title follows a template of ‘what is it’, ‘how to do it’ formula. This helps for compartmentalising a lot of the information being presented.If you have experience with math notation and ML, more specifically Scikit-learn or PyTorch then you’ll have an easier ride on the latter, meatier sections of the title. Many will assume the title is based around the construction and consumption of time series datasets however this is a ML book at the end of the day. If you have no ML experience you should expect to do a LOT of supplementary learning/reading to get the most of the title as there’s no way all the concepts covered could be taken from zero-to-depth without the book becoming several thousand pages.All in I spent a lot of challenging hours with this title across a month. There’s an awful lot to digest here and you won’t get much of anything by hurrying. I would expect to re-read this title or revisit sections to truly get the most out of it myself. At times I was awkward and frustrated but I feel I did learn a fair bit about the subject in the context of using a python stack for time series projects. By taking on this book it's worth remembering this is not an introductory text and shouldn’t be your first dance with python or databases. Intermediate and upward developers can add/refresh key skills with the aid of the title. Some chapters I would give 3.5 stars but as a whole body of work I would edge up to 3.75-4 stars. Ambitious stuff.
Amazon Verified review Amazon
Vandita Bothra Jul 27, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The author has done an amazing job with this book. When I use the terms comprehensive and hands on to describe this book, I 100% mean it. It does presume you have basic python knowledge but nonetheless provides you with a crash course link to brush up your skills. It has the code snippets required for smooth learning and links to dataset on GitHub to play around.As a Data analyst, I am familiar with stats/python/modeling to an extent but this book has really helped me to deliver on my project the past few months. In a way, this book is written to guide you through a project step by step. Starts with how to read the data, handle missing values, Outlier detection, EDA, multiple modeling techniques depending on your data and complexity. The mathematics are presented briefly and appropriately for each topics, not overloading your brain with information.And that is why I would recommend this book as it’s a widely used topics and this book can be used not only as a starting point but also as a reference point while working on a time series project.PS- It explains how to set up a virtual environment
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.