Packt+ | Advance your knowledge in tech

You're reading from Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789959413

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Science

Authors (3):

Tommy Blanchard

Debasish Behera

Pranshu Bhatnagar

View More author details

Table of Contents (12) Chapters

Data Science for Marketing Analytics

Preface

1. Data Preparation and Cleaning

2. Data Exploration and Visualization FREE CHAPTER

3. Unsupervised Learning: Customer Segmentation

4. Choosing the Best Segmentation Approach

5. Predicting Customer Revenue Using Linear Regression

6. Other Regression Techniques and Tools for Evaluation

7. Supervised Learning: Predicting Customer Churn

8. Fine-Tuning Classification Algorithms

9. Modeling Customer Choice

Appendix

Chapter 1: Data Preparation and Cleaning

Activity 1: Addressing Data Spilling

Import pandas and copy into the console, as follows:
```
import pandas as pd
import copy
```
Use the read_excel function to read the xlsx file and the head function to look at the first few rows:
```
sales = pd.read_excel("sales.xlsx")
sales.head()
```
Look at the data types of sales and see if they make more sense:
```
sales.dtypes
```
You should get the following output:
Figure 1.57: Looking at the datatype of sales.xlsx
We can iterate through the DataFrame rows to understand how the data is in the first row and how it should be:
```
forlabel,content in sales.iteritems():
    print label, content[1]
```
This gives the following output:
Figure 1.58: Iterating through the first row

From the preceding output, you can infer that the data should actually contain column names starting with a capital letter. Add the following code to get an overview of what the data should actually look like:

d = {"Year": 2004, "Product line": "Camping Equipment", "Product type":"Cooking Gear", "Product":"TrailChef Water Bag", 
     "Order method type":"Telephone", "Retailer Country":"Canada", "Revenue":13444.68, "Planned revenue":14313.48, 
     "Product cost":6298.8, "Quantity":2172, "Unit cost":2.9, "Unit price":6.59, "Gross Profit":7145.88, "Unit sale price":6.19}

You should get the following output:

Figure 1.59: Looking at how the data should be structured

The Year column seems to be in its right place, so we can start by creating the Product line column. Let's see how many values it has leaked into:
```
sales.groupby(['Product','line'])['Year'].count()
```
Your output will be as follows:
Figure 1.60: Seeing how values are distributed across the columns
We are not sure whether the spillage is only restricted to these two columns, so let's look at the next column too, to be sure:
```
sales.groupby(['Product','line', 'Product.1'])['Year'].count()
```
Your output will be as follows:
Figure 1.61: Looking at data to get the number of columns required
Let's resolve the Product line column and see if we were able to do anything:
```
sales['Product line'] = sales.apply(lambda x: x['Product'] +''+ x['line'], axis = 1)
sales = sales.drop(['Product', 'line'], axis = 1)
sales.head()
```
The DataFrame should now look as follows:
Figure 1.62: Collecting data from multiple columns into the correct field
Instead of directly changing the sales DataFrame, let's create a copy of it, called tmp, and make changes to it. Once we have finalized the filtering procedure, we don't want to lose the original DataFrame. Let's continue looking at the next column of interest, that is, Product type:
```
tmp = copy.deepcopy(sales)
tmp.groupby(['Product.1','type', 'Product.2'])['Year'].count()
```
This gives the following output:
Figure 1.63: Seeing variation in number of words required to represent product type
We can see that some fields are not the fields we are interested in. Let's limit our view to only those that we are interested in.
As we are only interested in the Climbing Accessories, Cooking Gear, First Aid, Golf Accessories, Insect Repellents, and Sleeping Bags product types, so let's filter them out. Using a similar logic as before, we store these columns in a new DataFrame, tmp1:
```
tmp1 = copy.deepcopy(tmp[tmp['Product.1'].isin(['Climbing', 'Cooking', 'First', 'Golf', 'Insect', 'Sleeping'])])
tmp1.head()
```
This gives the following output:
Figure 1.64: Filtering out the categories we need
Then perform the following groupby operations as well:
```
tmp1.groupby(['Product.1')['Year'].count()
tmp1.groupby(['Product.1', 'type', 'Product.2'])['Year'].count()
```
This gives the following output:
Figure 1.65: Looking for the variation in the required product types categories
We can see that—because of our choice of fields—luckily, we only have to append two columns to get the attribute we need.
Now that we know tmp1 requires only two words for Product type, we are done:
```
tmp1['Product type'] = tmp1['Product.1'] + ''+ tmp1['type']
tmp1 = tmp1.drop(['Product.1', 'type'], axis = 1)
tmp1.head()
```
This gives the following output:
Figure 1.66: Joining data with these required categories
The next column we have to worry about is Product. Let's see how many columns we need for that:
```
tmp1.groupby(['Product.2', 'Order'])['Year'].count()
```
This gives the following output:
Figure 1.67: Looking at the variation in the product category
This column has some values that are one word and some that are more than that.
Let's create another variable, tmp2, for values containing more than one word:
```
tmp2 = copy.deepcopy(tmp1[~tmp1['Order'].isin(['E-mail', 'Fax', 'Mail', 'Sales', 'Special', 'Telephone', 'Web'])])
tmp2.head()
```
This gives the following output:
Figure 1.68: Filtering out observations with values containing more than one word
We look at the variation in tmp2 and see that, while most of the columns have two words, some columns have more than two words:
```
tmp2.groupby(['Product.2','Order', 'method'])['Year'].count()
```
This gives the following output:
Figure 1.69: Distribution in the next few fields
We keep on performing this procedure of repeatedly storing values that have more leakage into another variable, until we have exhausted all the columns and got a structured dataset by the last step:
```
tmp8 = copy.deepcopy(tmp7[tmp7['Product.3'].isin(['Kingdom', 'States'])])
tmp8.head()
```
This gives the following output:
Figure 1.70: Distribution after structuring the longest parts
After structuring the fields with the longest names for Products, Order method type, and Retailer country, we can see that there is no spillage in the columns containing numerical values. Let's handle the remaining cases directly.

We finish structuring the last part of the data directly and store the final structured data separately:

tmp8['Retail country'] = tmp8['revenue'] + '' + tmp8['Product.3']
tmp8 = tmp8.drop(['revenue', 'Product.3'], axis = 1)
tmp8["Revenue"] = tmp8['cost']
tmp8 = tmp8.drop(['cost'], axis = 1)
tmp8["Planned revenue"] = tmp8['Quantity'] 
tmp8 = tmp8.drop(['Quantity'], axis = 1)
tmp8["Product cost"] = tmp8['Unit'] 
tmp8 = tmp8.drop(['Unit'], axis = 1)
tmp8["Quantity"] = tmp8['cost.1'] 
tmp8 = tmp8.drop(['cost.1'], axis = 1)
tmp8["Unit cost"] = tmp8['Unit.1']
tmp8 = tmp8.drop(['Unit.1'], axis = 1)
tmp8["Unit price"] = tmp8['price']
tmp8 = tmp8.drop(['price'], axis = 1)
tmp8["Gross profit"] = tmp8['Gross']
tmp8 = tmp8.drop(['Gross'], axis = 1)
tmp8["Unit sale price"] = tmp8['profit']
tmp8 = tmp8.drop(['profit', 'Unit.2', 'sale', 'price.1'], axis = 1)
tmp8.head()

This gives the following output:

Figure 1.71: Fully structuring the data for the longest data

We store the structured dataset separately as str1:

str1 = tmp8

Once we structure the last layer of the data completely like this, it is easier for us to structure the layer just before it, as we can use the knowledge of the previous layer and reduce structuring the current layer to a problem we have already solved before:
```
temp = copy.deepcopy(tmp7[~tmp7.index.isin(tmp8.index.values)])
temp.head()
```
You should get the following output:
Figure 1.72: Structuring data with the second longest data fields

Now, we keep on going backward and structure the entire dataset with the help of the correctly structured preceding layers:

temp1['Retailer country'] = temp1['method']
temp1 = temp1.drop(['method'], axis = 1)
\temp1["Planned revenue"] = temp1['Retailer'] 
temp1 = temp1.drop(['Retailer'], axis = 1)
temp1["Product cost"] = temp1['country'] 
temp1 = temp1.drop(['country'], axis = 1)
temp1["Unit cost"] = temp1['Planned'] 
temp1 = temp1.drop(['Planned'], axis = 1)
temp1["Unit price"] = temp1['revenue'] 
temp1 = temp1.drop(['revenue'], axis = 1)
temp1["Gross profit"] = temp1['Product.3']
temp1 = temp1.drop(['Product.3'], axis = 1)
temp1["Unit sale price"] = temp1['cost']
temp1 = temp1.drop(['cost'], axis = 1)
temp1["Quantity"]  = temp1['Revenue']
temp1 = temp1.drop(['Revenue'], axis = 1)
temp1['Revenue'] = temp1['type.1']
temp1 = temp1.drop('type.1', axis = 1)
temp1 = temp1[['Year', 'Product line', 'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 
'Planned revenue', 'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]

temp1.head()

This gives the following output:

Figure 1.73: Iteratively going backward while stabilizing the longest fields

Finally, we just make sure to combine our data and we are done:

df = pd.concat([str1, str2, str3, str4, str5, str6, str7, str8, str9, str10, str11, str12, str13, str14, 
               str15, str16, str17, str18, str19, str20, str21, str22], sort = True)[['Year', 'Product line',
'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 'Planned revenue',
'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]
df.groupby('Product type').count()

This gives the following output: