Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science for Marketing Analytics

You're reading from   Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789959413
Length 420 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Tommy Blanchard Tommy Blanchard
Author Profile Icon Tommy Blanchard
Tommy Blanchard
Debasish Behera Debasish Behera
Author Profile Icon Debasish Behera
Debasish Behera
Pranshu Bhatnagar Pranshu Bhatnagar
Author Profile Icon Pranshu Bhatnagar
Pranshu Bhatnagar
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Data Science for Marketing Analytics
Preface
1. Data Preparation and Cleaning 2. Data Exploration and Visualization FREE CHAPTER 3. Unsupervised Learning: Customer Segmentation 4. Choosing the Best Segmentation Approach 5. Predicting Customer Revenue Using Linear Regression 6. Other Regression Techniques and Tools for Evaluation 7. Supervised Learning: Predicting Customer Churn 8. Fine-Tuning Classification Algorithms 9. Modeling Customer Choice Appendix

Chapter 1: Data Preparation and Cleaning


Activity 1: Addressing Data Spilling

  1. Import pandas and copy into the console, as follows:

    import pandas as pd
    import copy
  2. Use the read_excel function to read the xlsx file and the head function to look at the first few rows:

    sales = pd.read_excel("sales.xlsx")
    sales.head()
  3. Look at the data types of sales and see if they make more sense:

    sales.dtypes

    You should get the following output:

    Figure 1.57: Looking at the datatype of sales.xlsx

  4. We can iterate through the DataFrame rows to understand how the data is in the first row and how it should be:

    forlabel,content in sales.iteritems():
        print label, content[1]

    This gives the following output:

    Figure 1.58: Iterating through the first row

  5. From the preceding output, you can infer that the data should actually contain column names starting with a capital letter. Add the following code to get an overview of what the data should actually look like:

    d = {"Year": 2004, "Product line": "Camping Equipment", "Product type":"Cooking Gear", "Product":"TrailChef Water Bag", 
         "Order method type":"Telephone", "Retailer Country":"Canada", "Revenue":13444.68, "Planned revenue":14313.48, 
         "Product cost":6298.8, "Quantity":2172, "Unit cost":2.9, "Unit price":6.59, "Gross Profit":7145.88, "Unit sale price":6.19}

    You should get the following output:

    Figure 1.59: Looking at how the data should be structured

  6. The Year column seems to be in its right place, so we can start by creating the Product line column. Let's see how many values it has leaked into:

    sales.groupby(['Product','line'])['Year'].count()

    Your output will be as follows:

    Figure 1.60: Seeing how values are distributed across the columns

  7. We are not sure whether the spillage is only restricted to these two columns, so let's look at the next column too, to be sure:

    sales.groupby(['Product','line', 'Product.1'])['Year'].count()

    Your output will be as follows:

    Figure 1.61: Looking at data to get the number of columns required

  8. Let's resolve the Product line column and see if we were able to do anything:

    sales['Product line'] = sales.apply(lambda x: x['Product'] +''+ x['line'], axis = 1)
    sales = sales.drop(['Product', 'line'], axis = 1)
    sales.head()

    The DataFrame should now look as follows:

    Figure 1.62: Collecting data from multiple columns into the correct field

  9. Instead of directly changing the sales DataFrame, let's create a copy of it, called tmp, and make changes to it. Once we have finalized the filtering procedure, we don't want to lose the original DataFrame. Let's continue looking at the next column of interest, that is, Product type:

    tmp = copy.deepcopy(sales)
    tmp.groupby(['Product.1','type', 'Product.2'])['Year'].count()

    This gives the following output:

    Figure 1.63: Seeing variation in number of words required to represent product type

    We can see that some fields are not the fields we are interested in. Let's limit our view to only those that we are interested in.

  10. As we are only interested in the Climbing Accessories, Cooking Gear, First Aid, Golf Accessories, Insect Repellents, and Sleeping Bags product types, so let's filter them out. Using a similar logic as before, we store these columns in a new DataFrame, tmp1:

    tmp1 = copy.deepcopy(tmp[tmp['Product.1'].isin(['Climbing', 'Cooking', 'First', 'Golf', 'Insect', 'Sleeping'])])
    tmp1.head()

    This gives the following output:

    Figure 1.64: Filtering out the categories we need

    Then perform the following groupby operations as well:

    tmp1.groupby(['Product.1')['Year'].count()
    tmp1.groupby(['Product.1', 'type', 'Product.2'])['Year'].count()

    This gives the following output:

    Figure 1.65: Looking for the variation in the required product types categories

    We can see that—because of our choice of fields—luckily, we only have to append two columns to get the attribute we need.

  11. Now that we know tmp1 requires only two words for Product type, we are done:

    tmp1['Product type'] = tmp1['Product.1'] + ''+ tmp1['type']
    tmp1 = tmp1.drop(['Product.1', 'type'], axis = 1)
    tmp1.head()

    This gives the following output:

    Figure 1.66: Joining data with these required categories

  12. The next column we have to worry about is Product. Let's see how many columns we need for that:

    tmp1.groupby(['Product.2', 'Order'])['Year'].count()

    This gives the following output:

    Figure 1.67: Looking at the variation in the product category

    This column has some values that are one word and some that are more than that.

  13. Let's create another variable, tmp2, for values containing more than one word:

    tmp2 = copy.deepcopy(tmp1[~tmp1['Order'].isin(['E-mail', 'Fax', 'Mail', 'Sales', 'Special', 'Telephone', 'Web'])])
    tmp2.head()

    This gives the following output:

    Figure 1.68: Filtering out observations with values containing more than one word

  14. We look at the variation in tmp2 and see that, while most of the columns have two words, some columns have more than two words:

    tmp2.groupby(['Product.2','Order', 'method'])['Year'].count()

    This gives the following output:

    Figure 1.69: Distribution in the next few fields

  15. We keep on performing this procedure of repeatedly storing values that have more leakage into another variable, until we have exhausted all the columns and got a structured dataset by the last step:

    tmp8 = copy.deepcopy(tmp7[tmp7['Product.3'].isin(['Kingdom', 'States'])])
    tmp8.head()

    This gives the following output:

    Figure 1.70: Distribution after structuring the longest parts

    After structuring the fields with the longest names for Products, Order method type, and Retailer country, we can see that there is no spillage in the columns containing numerical values. Let's handle the remaining cases directly.

  16. We finish structuring the last part of the data directly and store the final structured data separately:

    tmp8['Retail country'] = tmp8['revenue'] + '' + tmp8['Product.3']
    tmp8 = tmp8.drop(['revenue', 'Product.3'], axis = 1)
    tmp8["Revenue"] = tmp8['cost']
    tmp8 = tmp8.drop(['cost'], axis = 1)
    tmp8["Planned revenue"] = tmp8['Quantity'] 
    tmp8 = tmp8.drop(['Quantity'], axis = 1)
    tmp8["Product cost"] = tmp8['Unit'] 
    tmp8 = tmp8.drop(['Unit'], axis = 1)
    tmp8["Quantity"] = tmp8['cost.1'] 
    tmp8 = tmp8.drop(['cost.1'], axis = 1)
    tmp8["Unit cost"] = tmp8['Unit.1']
    tmp8 = tmp8.drop(['Unit.1'], axis = 1)
    tmp8["Unit price"] = tmp8['price']
    tmp8 = tmp8.drop(['price'], axis = 1)
    tmp8["Gross profit"] = tmp8['Gross']
    tmp8 = tmp8.drop(['Gross'], axis = 1)
    tmp8["Unit sale price"] = tmp8['profit']
    tmp8 = tmp8.drop(['profit', 'Unit.2', 'sale', 'price.1'], axis = 1)
    tmp8.head()

    This gives the following output:

    Figure 1.71: Fully structuring the data for the longest data

    We store the structured dataset separately as str1:

    str1 = tmp8
  17. Once we structure the last layer of the data completely like this, it is easier for us to structure the layer just before it, as we can use the knowledge of the previous layer and reduce structuring the current layer to a problem we have already solved before:

    temp = copy.deepcopy(tmp7[~tmp7.index.isin(tmp8.index.values)])
    temp.head()

    You should get the following output:

    Figure 1.72: Structuring data with the second longest data fields

  18. Now, we keep on going backward and structure the entire dataset with the help of the correctly structured preceding layers:

    temp1['Retailer country'] = temp1['method']
    temp1 = temp1.drop(['method'], axis = 1)
    \temp1["Planned revenue"] = temp1['Retailer'] 
    temp1 = temp1.drop(['Retailer'], axis = 1)
    temp1["Product cost"] = temp1['country'] 
    temp1 = temp1.drop(['country'], axis = 1)
    temp1["Unit cost"] = temp1['Planned'] 
    temp1 = temp1.drop(['Planned'], axis = 1)
    temp1["Unit price"] = temp1['revenue'] 
    temp1 = temp1.drop(['revenue'], axis = 1)
    temp1["Gross profit"] = temp1['Product.3']
    temp1 = temp1.drop(['Product.3'], axis = 1)
    temp1["Unit sale price"] = temp1['cost']
    temp1 = temp1.drop(['cost'], axis = 1)
    temp1["Quantity"]  = temp1['Revenue']
    temp1 = temp1.drop(['Revenue'], axis = 1)
    temp1['Revenue'] = temp1['type.1']
    temp1 = temp1.drop('type.1', axis = 1)
    temp1 = temp1[['Year', 'Product line', 'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 
    'Planned revenue', 'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]
    
    temp1.head()

    This gives the following output:

    Figure 1.73: Iteratively going backward while stabilizing the longest fields

  19. Finally, we just make sure to combine our data and we are done:

    df = pd.concat([str1, str2, str3, str4, str5, str6, str7, str8, str9, str10, str11, str12, str13, str14, 
                   str15, str16, str17, str18, str19, str20, str21, str22], sort = True)[['Year', 'Product line',
    'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 'Planned revenue',
    'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]
    df.groupby('Product type').count()

    This gives the following output:

    Figure 1.74: Combining Data Sources

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime