You're reading from The Data Analysis Workshop Solve business problems with state-of-the-art data analysis models, developing expert data analysis skills along the way

Product type Paperback

Published in Jul 2020

Publisher Packt

ISBN-13 9781839211386

Length 626 pages

Edition 1st Edition

Languages

Python

Tools

Jupyter

Concepts

Data Analysis

Authors (3):

Konstantin Palagachev

Gururajan Govindan

Shubhangi Hora

View More author details

Table of Contents (12) Chapters

Preface

1. Bike Sharing Analysis

2. Absenteeism at Work FREE CHAPTER

3. Analyzing Bank Marketing Campaign Data

4. Tackling Company Bankruptcy

5. Analyzing the Online Shopper's Purchasing Intention

6. Analysis of Credit Card Defaulters

7. Analyzing the Heart Disease Dataset

8. Analyzing Online Retail II Dataset

9. Analysis of the Energy Consumed by Appliances

10. Analyzing Air Quality

Appendix

Understanding the Data

In this first part, we load the data and perform an initial exploration of it.

Note

You can download the data either from the original source (https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#) or from the GitHub repository of this book (https://packt.live/2XpHW81).

The main goal of the presented steps is to acquire some basic knowledge about the data, how the various features are distributed, and whether there are missing values in it.

First import the relevant Python libraries and the data itself for the analysis. Note that we are using Python 3.7. Furthermore, we directly load the data from the GitHub repository of the book:

# imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
# load hourly data
hourly_data = pd.read_csv('https://raw.githubusercontent.com/'\
                          'PacktWorkshops/'\
                          'The-Data-Analysis-Workshop/'\
                          'master/Chapter01/data/hour.csv')

Note

The # symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic. Also, watch out for the slashes in the string above. The backslashes ( \ ) are used to split the code across multiple lines, while the forward slashes ( / ) are part of the URL.

A good practice is to check the size of the data we are loading, the number of missing values of each column, and some general statistics about the numerical columns:

# print some generic statistics about the data
print(f"Shape of data: {hourly_data.shape}")
print(f"Number of missing values in the data:\
{hourly_data.isnull().sum().sum()}")

Note

The code snippet shown here uses a backslash ( \ ) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

The output is as follows:

Shape of data: (17379, 17)
Number of missing values in the data: 0

In order to get some simple statistics on the numerical columns, such as the mean, standard deviation, minimum and maximum values, and their percentiles, we can use the describe() function directly on a pandas.Dataset object:

# get statistics on the numerical columns
hourly_data.describe().T

The output should be as follows:

Figure 1.1: Output of the describe() method

Note that the T character after the describe() method gets the transpose of the resulting dataset, hence the columns become rows and vice versa.

According to the description of the original data, provided in the Readme.txt file, we can split the columns into three main groups:

temporal features: This contains information about the time at which the record was registered. This group contains the dteday, season, yr, mnth, hr, holiday, weekday, and workingday columns.
weather related features: This contains information about the weather conditions. The weathersit, temp, atemp, hum, and windspeed columns are included in this group.
record related features: This contains information about the number of records for the specific hour and date. This group includes the casual, registered, and cnt columns.

Note that we did not include the first column, instant, in any of the previously mentioned groups. The reason for this is that it is an index column and will be excluded from our analysis, as it does not contain any relevant information for our analysis.

The rest of the chapter is locked

You're reading from The Data Analysis Workshop Solve business problems with state-of-the-art data analysis models, developing expert data analysis skills along the way

Table of Contents (12) Chapters

Understanding the Data

Authors (3)

Other recommended products

Personalised recommendations for you

You're reading from The Data Analysis Workshop Solve business problems with state-of-the-art data analysis models, developing expert data analysis skills along the way

Table of Contents (12) Chapters

Understanding the Data

Unlock this book and the full library FREE for 7 days

Authors (3)

Other recommended products

Personalised recommendations for you