Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Data Science for Marketing Analytics
Data Science for Marketing Analytics

Data Science for Marketing Analytics: A practical guide to forming a killer marketing strategy through data analysis with Python , Second Edition

Arrow left icon
Profile Icon Mirza Rahim Baig Profile Icon Gururajan Govindan Profile Icon Vishwesh Ravi Shrimali
Arrow right icon
₱579.99 ₱1591.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (203 Ratings)
eBook Sep 2021 636 pages 2nd Edition
eBook
₱579.99 ₱1591.99
Paperback
₱1989.99
Subscription
Free Trial
Arrow left icon
Profile Icon Mirza Rahim Baig Profile Icon Gururajan Govindan Profile Icon Vishwesh Ravi Shrimali
Arrow right icon
₱579.99 ₱1591.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (203 Ratings)
eBook Sep 2021 636 pages 2nd Edition
eBook
₱579.99 ₱1591.99
Paperback
₱1989.99
Subscription
Free Trial
eBook
₱579.99 ₱1591.99
Paperback
₱1989.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Data Science for Marketing Analytics

2. Data Exploration and Visualization

Objectives

In this chapter, you will learn to explore, analyze, and reshape your data so that you can shed light on the attributes of your data that are important to the business – a key skill in a marketing analyst's repertoire. You will discover functions that will help you derive summary and descriptive statistics from your data. You will build pivot tables and perform comparative tests and analyses to discover hidden relationships between various data points. Later, you will create impactful visualizations by using two of the most popular Python packages, Matplotlib and seaborn.

Introduction

"How does this data make sense to the business?" It's a critical question you'll need to ask every time you start working with a new, raw dataset. Even after you clean and prepare raw data, you won't be able to derive actionable insights from it just by scanning through thousands of rows and columns. To be able to present the data in a way that it provides value to the business, you may need group similar rows, re-arrange the columns, generate detailed charts, and more. Manipulating and visualizing the data to uncover insights that stakeholders can easily understand and implement is a key skill in a marketing analyst's toolbox. This chapter is all about learning that skill.

In the last chapter, you learned how you can transform raw data with the help of pandas. You saw how to clean the data and handle the missing values after which the data can be structured into a tabular form. The structured data can be further analyzed so that meaningful information can be extracted from it.

In this chapter, you'll discover the functions and libraries that help you explore and visualize your data in greater detail. You will go through techniques to explore and analyze data through solving some problems critical for businesses, such as identifying attributes useful for marketing, analyzing key performance indicators, performing comparative analyses, and generating insights and visualizations. You will use the pandas, Matplotlib, and seaborn libraries in Python to solve these problems.

Let us begin by first understanding how we can identify the attributes that will help us derive insights from our data.

Identifying and Focusing on the Right Attributes

Whether you're trying to meet your marketing goals or solve problems in an existing marketing campaign, a structured dataset may comprise numerous rows and columns. In such cases, deriving actionable insights by studying each attribute might prove challenging and time-consuming. The better your data analysis skills are, the more valuable the insights you'll be able to generate in a shorter amount of time. A prudent marketing analyst would use their experience to determine the attributes that are not needed for the study; then, using their data analysis skills, they'd remove such attributes from the dataset. It is these data analysis skills that you'll be building in this chapter.

Learning to identify the right attributes is one aspect of the problem, though. Marketing goals and campaigns often involve target metrics. These metrics may depend on domain knowledge and business acumen and are known as key performance indicators (KPIs). As a marketing analyst, you'll need to be able to analyze the relationships between the attributes of your data to quantify your data in line with these metrics. For example, you may be asked to derive any of the following commonly used KPIs from a dataset comprising 30 odd columns:

  • Revenue: What are the total sales the company is generating in a given time frame?
  • Profit: What is the money a company makes from its total sales after deducting its expenses?
  • Customer Life Cycle Value: What is the metric that indicates the total revenue the company can expect from a customer throughout his association?
  • Customer Acquisition cost: What is the amount of money a company spends to convert a prospective lead to a customer?
  • Conversion rate: How many of all the people you have targeted for a marketing campaign have actually bought the product or used the company's services?
  • Churn: How many customers have stopped using the company's services or have stopped buying the product?

    Note

    These are only the commonly used KPIs. You'll find that there are numerous more in the marketing domain.

You will of course have to get rid of columns that are not needed, and then use your data analysis tools to calculate the needed metrics. pandas, the library we encountered in the previous chapter, provides quite a lot of such tools in the form of functions and collections that help generate insights and summaries. One such useful function is groupby.

The groupby(  ) Function

Take a look at the following data stored in an Excel sheet:

Figure 2.1: Sample sales data in an Excel sheet

Figure 2.1: Sample sales data in an Excel sheet

You don't need to study all the attributes of the data as you'll be working with it later on. But suppose, using this dataset, you wanted to individually calculate the sums of values stored in the following columns: Revenue, Planned Revenue, Product Cost, Quantity (all four columns annotated by 2), and Gross profit (annotated by 3). Then, you wanted to segregate the sums of values for each column by the countries listed in column Retailer_country (annotated by 1). Before making any changes to it, your first task would be to visually analyze the data and find patterns in it.

Looking at the data, you can see that country names are repeated quite a lot. Now instead of calculating how many uniquely occurring countries there are, you can simply group these rows together using pivot tables as follows:

Figure 2.2: Pivot of sales data in Excel

Figure 2.2: Pivot of sales data in Excel

In the preceding figure, not only is the data grouped by various countries, the values for Planned Revenue, Revenue, Product cost, and Gross profit are summed as well. That solves our problem. But what if this data was stored in a DataFrame?

This is where the groupby function comes in handy.

groupby(col)[cols].agg_func

This function groups rows with similar values in a column denoted by col and applies the aggregator function agg_func to the column of the DataFrame denoted by cols.

For example, think of the sample sales data as a DataFrame df. Now, you want to find out the total planned revenue by country. For that, you will need to group the countries and focusing on the Planned revenue column, sum all the values that correspond to the grouped countries. In the following command, we are grouping the values present in the Retailer country column and adding the corresponding values of the Planned revenue column.

df.groupby('Retailer country')['Planned revenue'].sum()

You'd get an output similar to the following:

Figure 2.3: Calculating total Planned Revenue Using groupby()

Figure 2.3: Calculating total Planned Revenue Using groupby()

The preceding output shows the sum of planned revenue grouped by country. To help us better compare the results we got here with the results of our pivot data in Figure 2.2, we can also chain another command as follows:

df.groupby('Retailer country')['Planned revenue'].sum().round()

You should get the following output:

Figure 2.4: Chaining multiple functions to groupby

Figure 2.4: Chaining multiple functions to groupby

If we compare the preceding output to what we got in Figure 2.2, we can see that we have achieved the same results as the pivot functionality in Excel but using a built-in pandas function.

Some of the most common aggregator functions that can be used alongside groupby function are as follows:

  • count(): This function returns the total number of values present in the column.
  • min(): This function returns the minimum value present in the column.
  • max(): This function returns the maximum value present in the column.
  • mean(): This function returns the mean of all values in the column.
  • median(): This function returns the median value in the column.
  • mode(): This function returns the most frequently occurring value in the column.

Next, let us look into another pandas function that can help us derive useful insights.

The unique(  ) function

While you know how to group data based on certain attributes, there are times when you don't know which attribute to choose to group the data by (or even make other analyses). For example, suppose you wanted to focus on customer life cycle value to design targeted marketing campaigns. Your dataset has attributes such as recency, frequency, and monetary values. Just looking at the dataset, you won't be able to understand how the attributes in each column vary. For example, one column might comprise lots of rows with numerous unique values, making it difficult to use the groupby() function on it. The other might be replete with duplicate values and have just two unique values. Thus, examining how much your values vary in a column is essential before doing further analysis. pandas provides a very handy function for doing that, and it's called unique().

For example, let's say you have a dataset called customerdetails that contains the names of customers, their IDs, and their countries. The data is stored in a DataFrame df and you would like to see a list of all the countries you have customers in.

Figure 2.5: DataFrame df

Figure 2.5: DataFrame df

Just by looking at the data, you see that there are a lot of countries in the column titled Country. Some country names are repeated multiple times. To know where your customers are located around the globe, you'll have to filter out all the unique values in this column. To do that, you can use the unique function as follows:

df['Country'].unique()

You should get the following output:

Figure 2.6: Different country names

Figure 2.6: Different country names

You can see from the preceding output that the customers are from the following countries – Australia, Canada, Spain, India, Austria, China, Japan, Switzerland, the UK, New Zealand, and the USA.

Now that you know which countries your customers belong to, you may also want to know how many customers are present in each of these countries. This can be taken care of by the value_counts() function provided by pandas.

The value_counts(  ) function

Sometimes, apart from just seeing what the unique values in a column are, you also want to know the count of each unique value in that column. For example, if, after running the unique() function, you find out that Product 1, Product 2, and Product 3 are the only three values that are repeated in a column comprising 1,000 rows; you may want to know how many entries of each product there are in the column. In such cases, you can use the value_counts() function. This function displays the unique values of the categories along with their counts.

Let's revisit the customerdetails dataset we encountered in the previous section. To find out the number of customers present in each country, we can use the value_counts() function as follows:

df['Country'].value_counts()

You will get output similar to the following:

Figure 2.7: Output of value_counts function

Figure 2.7: Output of value_counts function

From the preceding output, you can see that the count of each country is provided against the country name. Now that you have gotten a hang of these methods, let us implement them to derive some insights from sales data.

Exercise 2.01: Exploring the Attributes in Sales Data

You and your team are creating a marketing campaign for a client. All they've handed you is a file called sales.csv, which as they explained, contains the company's historical sales records. Apart from that, you know nothing about this dataset.

Using this data, you'll need to derive insights that will be used to create a comprehensive marketing campaign. Not all insights may be useful to the business, but since you will be presenting your findings to various teams first, an insight that's useful for one team may not matter much for the other team. So, your approach would be to gather as many actionable insights as possible and present those to the stakeholders.

You neither know the time period of these historical sales nor do you know which products the company sells. Download the file sales.csv from GitHub and create as many actionable insights as possible.

Note

You can find sales.csv at the following link: https://packt.link/Z2gRS.

  1. Open a new Jupyter Notebook to implement this exercise. Save the file as Exercise2-01.ipnyb. In a new Jupyter Notebook cell, import the pandas library as follows

    import pandas as pd

  2. Create a new pandas DataFrame named sales and read the sales.csv file into it. Examine if your data is properly loaded by checking the first few values in the DataFrame by using the head() command:

    sales = pd.read_csv('sales.csv')

    sales.head()

    Note

    Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter Notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.

    Your output should look as follows:

    Figure 2.8: The first five rows of sales.csv

    Figure 2.8: The first five rows of sales.csv

  3. Examine the columns of the DataFrame using the following code:

    sales.columns

    This produces the following output:

    Figure 2.9: The columns in sales.csv

    Figure 2.9: The columns in sales.csv

  4. Use the info function to print the datatypes of columns of sales DataFrame using the following code:

    sales.info()

    This should give you the following output:

    Figure 2.10: Information about the sales DataFrame

    Figure 2.10: Information about the sales DataFrame

    From the preceding output, you can see that Year is of int data type and columns such as Product line, Product type, etc. are of the object types.

  5. To check the time frame of the data, use the unique function on the Year column:

    sales['Year'].unique()

    You will get the following output:

    Figure 2.11: The number of years the data is spread over

    Figure 2.11: The number of years the data is spread over

    You can see that we have the data for the years 20042007.

  6. Use the unique function again to find out the types of products that the company is selling:

    sales['Product line'].unique()

    You should get the following output:

    Figure 2.12: The different product lines the data covers

    Figure 2.12: The different product lines the data covers

    You can see that company is selling four different types of products.

  7. Now, check the Product type column:

    sales['Product type'].unique()

    You will get the following output:

    Figure 2.13: The different types of products the data covers

    Figure 2.13: The different types of products the data covers

    From the above output, you can see that you have six different product types namely Golf Accessories, Sleeping Bags, Cooking Gear, First Aid, Insect Repellents, and Climbing Accessories.

  8. Check the Product column to find the unique categories present in it:

    sales['Product'].unique()

    You will get the following output:

    Figure 2.14: Different products covered in the dataset

    Figure 2.14: Different products covered in the dataset

    The above output shows the different products that are sold.

  9. Now, check the Order method type column to find out the ways through which the customer can place an order:

    sales['Order method type'].unique()

    You will get the following output:

    Figure 2.15: Different ways in which people making purchases have ordered

    Figure 2.15: Different ways in which people making purchases have ordered

    As you can see from the preceding figure, there are seven different order methods through which a customer can place an order.

  10. Use the same function again on the Retailer country column to find out the countries where the client has a presence:

    sales['Retailer country'].unique()

    You will get the following output:

    Figure 2.16: The countries in which products have been sold

    Figure 2.16: The countries in which products have been sold

    The preceding output shows the geographical presence of the company.

  11. Now that you have analyzed the categorical values, get a quick summary of the numerical fields using the describe function:

    sales.describe()

    This gives the following output:

    Figure 2.17: Description of the numerical columns in sales.csv

    Figure 2.17: Description of the numerical columns in sales.csv

    You can observe that the mean revenue the company is earning is around $103,846. The describe function is used to give us an overall idea about the range of the data present in the DataFrame.

  12. Analyze the spread of the categorical columns in the data using the value_counts() function. This would shed light on how the data is distributed. Start with the Year column:

    sales['Year'].value_counts()

    This gives the following output:

    Figure 2.18: Frequency table of the Year column

    Figure 2.18: Frequency table of the Year column

    From the above result, you can see that you have around 5451 records in the years 2004, 2005, and 2006 and the number of records in the year 2007 is 1470.

  13. Use the same function on the Product line column:

    sales['Product line'].value_counts()

    This gives the following output:

    Figure 2.19: Frequency table of the Product line column

    Figure 2.19: Frequency table of the Product line column

    As you can see from the preceding output, Camping Equipment has the highest number of observations in the dataset followed by Outdoor Protection.

  14. Now, check for the Product type column:

    sales['Product type'].value_counts()

    This gives the following output:

    Figure 2.20: Frequency table of the Product line column

    Figure 2.20: Frequency table of the Product line column

    Cooking gear followed by climbing accessories has the highest number of observations in the dataset which means that these product types are quite popular among the customers.

  15. Now, find out the most popular order method using the following code:

    sales['Order method type'].value_counts()

    This gives the following output:

    Figure 2.21: Frequency table of the Product line column

    Figure 2.21: Frequency table of the Product line column

    Almost all the order methods are equally represented in the dataset which means customers have an equal probability of ordering through any of these methods.

  16. Finally, check for the Retailer country column along with their respective counts:

    sales['Retailer country'].value_counts()

    You should get the following output:

    Figure 2.22: Frequency table of the Product line column

    Figure 2.22: Frequency table of the Product line column

    The preceding result shows that data points are evenly distributed among all the countries showing no bias.

  17. Get insights into country-wide statistics now. Group attributes such as Revenue, Planned revenue, Product cost, Quantity, and Gross profit by their countries, and sum their corresponding values. Use the following code:

    sales.groupby('Retailer country')[['Revenue',\

                                       'Planned revenue',\

                                       'Product cost',\

                                       'Quantity',\

                                       'Gross profit']].sum()

    You should get the following output:

    Figure 2.23: Total revenue, cost, quantities sold, and profit in each country 
in the past four years

    Figure 2.23: Total revenue, cost, quantities sold, and profit in each country in the past four years

    From the preceding figure, you can infer that Denmark had the least revenue and the US had the highest revenue in the past four years. Most countries generated revenue of around 20,000,000 USD and almost reached their revenue targets.

  18. Now find out the country whose product performance was affected the worst when sales dipped. Use the following code to group data by Retailer country:

    sales.dropna().groupby('Retailer country')\

                          [['Revenue',\

                            'Planned revenue',\

                            'Product cost',\

                            'Quantity',\

                            'Unit cost',\

                            'Unit price',\

                            'Gross profit',\

                            'Unit sale price']].min()

    You should get the following output:

    Figure 2.24: The lowest price, quantity, cost prices, and so on for each country

    Figure 2.24: The lowest price, quantity, cost prices, and so on for each country

    Since most of the values in the gross profit column are negative, you can infer that almost every product has at some point made a loss in the target markets. Brazil, Spain, Finland, and Canada are some exceptions.

  19. Similarly, generate statistics for other categorical variables, such as Year, Product line, Product type, and Product. Use the following code for the Year variable:

    sales.groupby('Year')[['Revenue',\

                           'Planned revenue',\

                           'Product cost',\

                           'Quantity',\

                           'Unit cost',\

                           'Unit price',\

                           'Gross profit',\

                           'Unit sale price']].sum()

    This gives the following output:

    Figure 2.25: Total revenue, cost, quantities, and so on sold every year

    Figure 2.25: Total revenue, cost, quantities, and so on sold every year

    From the above figure, it appears that revenue, profits, and quantities have dipped in the year 2007.

  20. Use the following code for the Product line variable:

    sales.groupby('Product line')[['Revenue',\

                                   'Planned revenue',\

                                   'Product cost',\

                                   'Quantity',\

                                   'Unit cost',\

                                   'Unit price',\

                                   'Gross profit',\

                                   'Unit sale price']].sum()

    You should get the following output:

    Figure 2.26: Total revenue, cost, quantities, and so on, generated by each product division

    Figure 2.26: Total revenue, cost, quantities, and so on, generated by each product division

    The preceding figure indicates that the sale of Camping Equipment contributes the highest to the overall revenue of the company.

  21. Now, find out which order method contributes to the maximum revenue:

    sales.groupby('Order method type')[['Revenue',\

                                        'Planned revenue',\

                                        'Product cost',\

                                        'Quantity',\

                                        'Gross profit']].sum()

    You should get the following output:

    Figure 2.27: Average revenue, cost, quantities, and so on generated 
by each method of ordering

Figure 2.27: Average revenue, cost, quantities, and so on generated by each method of ordering

Observe that the highest sales were generated through the internet (more than all the other sources combined).

Now that you've generated the insights, it's time to select the most useful ones, summarize them, and present them to the business as follows:

Figure 2.28: Summary of the derived insights

Figure 2.28: Summary of the derived insights

In this exercise, you have successfully explored the attributes in a dataset. In the next section, you will learn how to generate targeted insights from the prepared data.

Note

You'll be working with the sales.csv data in the upcoming section as well. It's recommended that you keep the Jupyter Notebook open and continue there.

Fine Tuning Generated Insights

Now that you've generated a few insights, it's important to fine-tune your results to cater to the business. This can involve small changes like renaming columns or relatively big ones like turning a set of rows into columns. pandas provides ample tools that help you fine-tune your data so that you can extract insights from it that are more comprehensible and valuable to the business. Let's examine a few of them in the section that follows.

Selecting and Renaming Attributes

At times, you might notice variances in certain attributes of your dataset. You may want to isolate those attributes so that you can examine them in greater detail. The following functions come in handy for performing such tasks:

  • loc[label]: This method selects rows and columns by a label or by a boolean condition.
  • loc[row_labels, cols]: This method selects rows present in row_labels and their values in the cols columns.
  • iloc[location]: This method selects rows by integer location. It can be used to pass a list of row indices, slices, and so on.

    In the sales DataFrame you can use the loc method to select, based on a certain condition, the values of Revenue, Quantity, and Gross Profit columns:

    sales.loc[sales['Retailer country']=='United States', \

              ['Revenue', 'Quantity', 'Gross profit']].head()

    This code selects values from the Revenue, Quantity, and Gross profit columns only if the Retailer country is the United States. The output of the first five rows should be:

    Figure 2.29: Sub-selecting observations and attributes in pandas

Figure 2.29: Sub-selecting observations and attributes in pandas

Let's also learn about iloc. Even though you won't be using it predominantly in your analysis, it is good to know how it works. For example, in the sales DataFrame, you can select the first two rows (0,1) and the first 3 columns (0,1,2) using the iloc method as follows.

sales.iloc[[0,1],[0,1,2]]

This should give you the following output:

Figure 2.30: Sub-selecting observations using iloc

Figure 2.30: Sub-selecting observations using iloc

At times, you may find that the attributes of your interest might not have meaningful names, or, in some cases, they might be incorrectly named as well. For example, Profit may be named as P&L, making it harder to interpret. To address such issues, you can use the rename function in pandas to rename columns as well as the indexes.

The rename function takes a dictionary as an input, which contains the current column name as a key and the desired renamed attribute name as value. It also takes the axis as a parameter, which denotes whether the index or the column should be renamed.

The syntax of the function is as follows.

df.rename({'old column name':'new column name'},axis=1)

The preceding code will change the 'old column name' to 'new column name'. Here, axis=1 is used to rename across columns. The following code renames the Revenue column to Earnings:

sales=sales.rename({'Revenue':'Earnings'}, axis = 1)

sales.head()

Since we're using the parameter axis=1, we are renaming across columns. The resulting DataFrame will now be:

Figure 2.31: Output after using the rename function on sales

Figure 2.31: Output after using the rename function on sales

Note

The preceding image does not show all the rows of the output.

From the preceding output, you can observe that the Revenue column is now represented as Earnings. In the next section, we will understand how to transform values numerical values into categorical ones and vice versa.

Up until this chapter, we've been using the sales.csv dataset and deriving insights from it. Now, in the upcoming section and exercise, we will be working on another dataset (we will revisit sales.csv in Exercise 2.03, Visualizing Data with pandas).

In the next section, we will be looking at how to reshape data as well as its implementation.

Reshaping the Data

Sometimes, changing how certain attributes and observations are arranged can help understand the data better, focus on the relevant parts, and extract more information.

Let's consider the data present in CTA_comparison.csv. The dataset consists of Click to Action data of various variants of ads for a mobile phone along with the corresponding views and sales. Click to Action (CTA) is a marketing terminology that prompts the user for some kind of response.

Note

You can find the CTA_comparison.csv file here: https://packt.link/RHxDI.

Let's store the data in a DataFrame named cta as follows:

cta = pd.read_csv('CTA_comparison.csv')

cta

Note

Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter Notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.

You should get the following DataFrame:

Figure 2.32: Snapshot of the data in CTA_comparison.csv

Figure 2.32: Snapshot of the data in CTA_comparison.csv

Notes

The outputs in Figure 2.32, Figure 2.33, and Figure 2.34 are truncated. These are for demonstration purposes only.

The DataFrame consists of CTA data for various variants of the ad along with the corresponding timestamps.

But, since we need to analyze the variants in detail, the CTA Variant column should be the index of the DataFrame and not time. For that, we'll need to reshape the data frame so that CTA Variant becomes the index. This can be done in pandas using the set_index function.

You can set the index of the cta DataFrame to CTA Variant using the following code:

cta.set_index('CTA Variant')

The DataFrame will now appear as follows:

Figure 2.33: Changing the index with the help of set_index

Figure 2.33: Changing the index with the help of set_index

You can also reshape data by creating multiple indexes. This can be done by passing multiple columns to the set_index function. For instance, you can set the CTA Variant and views as the indexes of the DataFrame as follows:

cta.set_index(['CTA Variant', 'views'])

The DataFrame will now appear as follows:

Figure 2.34: Hierarchical Data in pandas through set_index

Figure 2.34: Hierarchical Data in pandas through set_index

In the preceding figure, you can see that CTA Variant and views are set as indexes. The same hierarchy can also be created, by passing multiple columns to the groupby function:

cta_views = cta.groupby(['CTA Variant', 'views']).count()

cta_views

This gives the following output:

Figure 2.35: Grouping by multiple columns to generate hierarchies

Figure 2.35: Grouping by multiple columns to generate hierarchies

Using this hierarchy, you can see that CTA Variant B with 800 views is present thrice in the data.

Sometimes, switching the indices from rows to columns and vice versa can reveal greater insights into the data. This reshape transformation is achieved in pandas by using the unstack and stack functions:

  • unstack(level): This function moves the row index with the name or integral location level to the innermost column index. By default, it moves the innermost row:

    h1 = cta_views.unstack(level = 'CTA Variant')

    h1

    This gives the following output:

    Figure 2.36: Example of unstacking DataFrames

Figure 2.36: Example of unstacking DataFrames

You can see that the row index has changed to only views while the column has got the additional CTA Variant attribute as an index along with the existing time and sales columns.

  • stack(level): This function moves the column index with the name or integral location level to the innermost row index. By default, it moves the innermost column:

    h1.stack(0)

    This gives the following output:

    Figure 2.37: Example of stacking a DataFrame

Figure 2.37: Example of stacking a DataFrame

Now the stack function has taken the other sales and time column values to the row index and only the CTA Variant feature has become the column index.

In the next exercise, you will implement stacking and unstacking with the help of a dataset.

Exercise 2.02: Calculating Conversion Ratios for Website Ads.

You are the owner of a website that randomly shows advertisements A or B to users each time a page is loaded. The performance of these advertisements is captured in a simple file called conversion_rates.csv. The file contains two columns: converted and group. If an advertisement succeeds in getting a user to click on it, the converted field gets the value 1, otherwise, it gets 0 by default; the group field denotes which ad was clicked – A or B.

As you can see, comparing the performance of these two ads is not that easy when the data is stored in this format. Use the skills you've learned so far, store this data in a data frame and modify it to show, in one table, information about:

  1. The number of views ads in each group got.
  2. The number of ads converted in each group.
  3. The conversion ratio for each group.

    Note

    You can find the conversion_rates.csv file here: https://packt.link/JyEic.

  1. Open a new Jupyter Notebook to implement this exercise. Save the file as Exercise2-02.ipnyb. Import the pandas library using the import command, as follows:

    import pandas as pd

  2. Create a new pandas DataFrame named data and read the conversion_rates.csv file into it. Examine if your data is properly loaded by checking the first few values in the DataFrame by using the head() command:

    data = pd.read_csv('conversion_rates.csv')

    data.head()

    Note

    Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter Notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.

    You should get the following output:

    Figure 2.38: The first few rows of conversion_rates.csv

    Figure 2.38: The first few rows of conversion_rates.csv

  3. Group the data by the group column and count the number of conversions. Store the result in a DataFrame named converted_df:

    converted_df = data.groupby('group').sum()

    converted_df

    You will get the following output:

    Figure 2.39: Count of converted displays

    Figure 2.39: Count of converted displays

    From the preceding output, you can see that group A has 90 people who have converted. This means that advertisement A is quite successful compared to B.

  4. We would like to find out how many people have viewed the advertisement. For that use the groupby function to group the data and the count() function to count the number of times each advertisement was displayed. Store the result in a DataFrame viewed_df. Also, make sure you change the column name from converted to viewed:

    viewed_df = data.groupby('group').count()\

                    .rename({'converted':'viewed'}, \

                            axis = 'columns')

    viewed_df

    You will get the following output:

    Figure 2.40: Count of number of views

    Figure 2.40: Count of number of views

    You can see that around 1030 people viewed advertisement A and 970 people viewed advertisement B.

  5. Combine the converted_df and viewed_df datasets in a new DataFrame, named stats using the following commands:

    stats = converted_df.merge(viewed_df, on = 'group')

    stats

    This gives the following output:

    Figure 2.41: Combined dataset

    Figure 2.41: Combined dataset

    From the preceding figure, you can see that group A has 1030 people who viewed the advertisement, and 90 of them got converted.

  6. Create a new column called conversion_ratio that displays the ratio of converted ads to the number of views the ads received:

    stats['conversion_ratio'] = stats['converted']\

                                /stats['viewed']

    stats

    This gives the following output:

    Figure 2.42: Adding a column to stats

    Figure 2.42: Adding a column to stats

    From the preceding figure, you can see that group A has a better conversion factor when compared with group B.

  7. Create a DataFrame df where group A's conversion ratio is accessed as df['A'] ['conversion_ratio']. Use the stack function for this operation:

    df = stats.stack()

    df

    This gives the following output:

    Figure 2.43: Understanding the different levels of your dataset

    Figure 2.43: Understanding the different levels of your dataset

    From the preceding figure, you can see the group-wise details which are easier to understand.

  8. Check the conversion ratio of group A using the following code:

    df['A']['conversion_ratio']

    You should get a value of 0.08738.

  9. To bring back the data to its original form we can reverse the rows with the columns in the stats DataFrame with the unstack() function twice:

    stats.unstack().unstack()

    This gives the following output:

    Figure 2.44: Reversing rows with columns

Figure 2.44: Reversing rows with columns

This is exactly the information we needed in the goal of the exercise, stored in a neat, tabulated format. It is clear by now that ad A is performing much better than ad B given its better conversion ratio.

In this exercise, you have reshaped the data in a readable format. pandas also provides a simpler way to reshape the data that allows making comparisons while analyzing data very easy. Let's have a look at it in the next section.

Pivot Tables

A pivot table can be used to summarize, sort, or group data stored in a table. You've already seen an example of the pivot table functionality in excel; this one's similar to that. With a pivot table, you can transform rows to columns and columns to rows. The pivot function is used to create a pivot table. It creates a new table, whose row and column indices are the unique values of the respective parameters.

For example, consider the data DataFrame you saw in Step 2 of the preceding exercise:

Figure 2.45: The first few rows of the dataset being considered

Figure 2.45: The first few rows of the dataset being considered

You can use the pivot function to change the columns to rows as follows:

data=data.pivot(columns = 'group', values='converted')

data.head()

This gives the following output:

Figure 2.46: Data after being passed through the pivot command

Figure 2.46: Data after being passed through the pivot command

In the preceding figure, note that the columns and indices have changed but the observations individually have not. You can see that the data that had either a 0 or 1 value remains as it is, but the groups that were not considered have their remaining values filled in as NaN.

There is also a function called pivot_table, which aggregates fields together using the function specified in the aggfunc parameter and creates outputs accordingly. It is an alternative to aggregations such as groupby functions. For instance, if we apply the pivot_table function to the same DataFrame to aggregate the data:

data.pivot_table(index='group', columns='converted', aggfunc=len)

This gives the following output:

Figure 2.47: Applying pivot_table to data

Figure 2.47: Applying pivot_table to data

Note that the use of the len argument results in columns 0 and 1 that show how many times each of these values appeared in each group. Remember that, unlike pivot, it is essential to pass the aggfunc function when using the pivot_table function.

In the next section, you will be learning how to understand your data even better by visually representing it.

Visualizing Data

There are a lot of benefits to presenting data visually. Visualized data is easy to understand, and it can help reveal hidden trends and patterns in data that might not be so conspicuous compared to when the data is presented in numeric format. Furthermore, it is much quicker to interpret visual data. That is why you'll notice that many businesses rely on dashboards comprising multiple charts. In this section, you will learn the functions that will help you visualize numeric data by generating engaging plots. Once again, pandas comes to our rescue with its built-in plot function. This function has a parameter called kind that lets you choose between different types of plots. Let us look at some common types of plots you'll be able to create.

Density plots:

A density plot helps us to find the distribution of a variable. In a density plot, the peaks display where the values are concentrated.

Here's a sample density plot drawn for the Product cost column in a sales DataFrame:

Figure 2.48: Sample density plot

Figure 2.48: Sample density plot

In this plot, the Product cost is distributed with a peak very close to 0. In pandas, you can create density plots using the following command:

df['Column'].plot(kind = 'kde',color='gray')

Note

The value gray for the attribute color is used to generate graphs in grayscale. You can use other colors like darkgreen, maroon, etc. as values of color parameters to get the colored graphs.

Bar Charts:

Bar charts are used with categorical variables to find their distribution. Here is a sample bar chart:

Figure 2.49: Sample bar chart

Figure 2.49: Sample bar chart

In this plot, you can see the distribution of revenue of the product via different order methods. In pandas, you can create bar plots by passing bar as value to the kind parameter.

df['Column'].plot(kind = 'bar', color='gray')

Box Plot:

A box plot is used to depict the distribution of numerical data and is primarily used for comparisons. Here is a sample box plot:

Figure 2.50: Sample box plot

Figure 2.50: Sample box plot

The line inside the box represents the median values for each numerical variable. In pandas, you can create a box plot by passing box as a value to the kind parameter:

df['Column'].plot(kind = 'box', color='gray')

Scatter Plot:

Scatter plots are used to represent the values of two numerical variables. Scatter plots help you to determine the relationship between the variables.

Here is a sample scatter plot:

Figure 2.51: Sample scatter plot

Figure 2.51: Sample scatter plot

In this plot, you can observe the relationship between the two variables. In pandas, you can create scatter plots by passing scatter as a value to the kind parameter.

df['Column'].plot(kind = 'scatter', color='gray')

Let's implement these concepts in the exercise that follows.

Exercise 2.03: Visualizing Data With pandas

In this exercise, you'll be revisiting the sales.csv file you worked on in Exercise 2.01, Exploring the Attributes in Sales Data. This time, you'll need to visualize the sales data to answer the following two questions:

  1. Which mode of order generates the most revenue?
  2. How have the following parameters varied over four years: Revenue, Planned revenue, and Gross profit?

    Note

    You can find the sales.csv file here: https://packt.link/dhAbB.

You will make use of bar plots and box plots to explore the distribution of the Revenue column.

  1. Open a new Jupyter Notebook to implement this exercise. Save the file as Exercise2-03.ipnyb.
  2. Import the pandas library using the import command as follows:

    import pandas as pd

  3. Create a new panda DataFrame named sales and load the sales.csv file into it. Examine if your data is properly loaded by checking the first few values in the DataFrame by using the head() command:

    sales = pd.read_csv("sales.csv")

    sales.head()

    Note

    Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.

    You will get the following output:

    Figure 2.52: Output of sales.head()

    Figure 2.52: Output of sales.head()

  4. Group the Revenue by Order method type and create a bar plot:

    sales.groupby('Order method type').sum()\

         .plot(kind = 'bar', y = 'Revenue', color='gray')

    This gives the following output:

    Figure 2.53: Revenue generated through each order method type in sales.csv

    Figure 2.53: Revenue generated through each order method type in sales.csv

    From the preceding image, you can infer that web orders generate the maximum revenue.

  5. Now group the columns by year and create boxplots to get an idea on a relative scale:

    sales.groupby('Year')[['Revenue', 'Planned revenue', \

                           'Gross profit']].plot(kind= 'box',\

                                                color='gray')

    Note

    In Steps 4 and 5, the value gray for the attribute color (emboldened) is used to generate graphs in grayscale. You can use other colors like darkgreen, maroon, etc. as values of color parameter to get the colored graphs. You can also refer to the following document to get the code for the colored plot and the colored output: https://packt.link/NOjgT.

    You should get the following plots:

    The first plot represents the year 2004, the second plot represents the year 2005, the third plot represents the year 2006 and the final one represents 2007.

    Figure 2.54: Boxplot for Revenue, Planned revenue and Gross profit for year 2004

Figure 2.54: Boxplot for Revenue, Planned revenue and Gross profit for year 2004

Figure 2.55: Boxplot for Revenue, Planned revenue and Gross profit for year 2005

Figure 2.55: Boxplot for Revenue, Planned revenue and Gross profit for year 2005

Figure 2.56: Boxplot for Revenue, Planned revenue and Gross profit for year 2006

Figure 2.56: Boxplot for Revenue, Planned revenue and Gross profit for year 2006

Figure 2.57: Boxplot for Revenue, Planned revenue and Gross profit for year 2007

Figure 2.57: Boxplot for Revenue, Planned revenue and Gross profit for year 2007

The bubbles in the plots represent outliers. Outliers are extreme values in the data. They are caused either due to mistakes in measurement or due to the real behavior of the data. Outlier treatment depends entirely on the business use case. In some of the scenarios, outliers are dropped or are capped at a certain value based on the inputs from the business. It is not always advisable to drop the outliers as they can give us a lot of hidden information in the data.

From the above plots, we can infer that Revenue and Planned revenue have a higher median than Gross profit (the median is represented by the line inside the box).

Even though pandas provides you with the basic plots, it does not give you a lot of control over the look and feel of the visualizations.

Python has alternate packages such as seaborn which allow you to generate more fine-tuned and customized plots. Let's learn about this package in the next section.

Visualization through Seaborn

Even though pandas provides us with many of the most common plots required for analysis, Python also has a useful visualization library, seaborn. It provides a high-level API to easily generate top-quality plots with a lot of customization options.

You can change the environment from regular pandas/Matplotlib to seaborn directly through the set function of seaborn. Seaborn also supports a displot function, which plots the actual distribution of the pandas series passed to. To generate histograms through seaborn, you can use the following code:

import seaborn as sns

sns.set()

sns.displot(sales['Gross profit'].dropna(),color='gray')

The preceding code plots a histogram of the values of the Gross profit column. We have set the parameter dropna() which tells the plotting function to ignore null values if present in the data. The sns.set() function changes the environment from regular pandas/Matplotlib to seaborn.

The color attribute is used to provide colors to the graphs. In the preceding code, gray color is used to generate grayscale graphs. You can use other colors like darkgreen, maroon, etc. as values of color parameters to get the colored graphs.

This gives the following output:

Figure 2.58: Histogram for Gross Profit through Seaborn

Figure 2.58: Histogram for Gross Profit through Seaborn

From the preceding plot, you can infer that most of the gross profit is around $1,000.

Pair Plots:

Pair plots are one of the most effective tools for exploratory data analysis. They can be considered as a collection of scatter plots between the variables present in the dataset. With a pair plot, one can easily study the distribution of a variable and its relationship with the other variables. These plots also reveal trends that may need further exploration.

For example, if your dataset has four variables, a pair plot would generate 16 charts that show the relationship of all the combinations of variables.

To generate a pair plot through seaborn, you can use the following code:

import seaborn as sns

sns.pairplot(dataframe, palette='gray')

The palette attribute is used to define the color of the pair plot. In the preceding code, gray color is used to generate grayscale graphs.

An example pair plot generated using seaborn would look like this:

Figure 2.59: Sample pair plot

Figure 2.59: Sample pair plot

The following inferences can be made from the above plot.

  1. Revenue and Gross profit have a linear relationship; that is, when Revenue increases the Gross Profit increases
  2. Quantity and Revenue show no trend; that is, there is no relationship.

    Note

    You can refer to the following link for more details about the seaborn library: https://seaborn.pydata.org/tutorial.html.

In the next section, we will understand how to visualize insights using the matplotlib library.

Visualization with Matplotlib

Python's default visualization library is matplotlib. matplotlib was originally developed to bring visualization capabilities from the MATLAB academic tool into an open-source programming language, Python. matplotlib provides low-level additional features that can be added to plots made from any other visualization library like pandas or seaborn.

To start using matplotlib, you need to first import the matplotlib.pyplot object as plt. This plt object becomes the basis for generating figures in matplotlib.

import matplotlib.pyplot as plt

We can then run any functions on this object as follows:

plt.<function name>

Some of the functions that we can call on this plt object for these options are as follows:

Figure 2.60: Functions that can be used on plt

Figure 2.60: Functions that can be used on plt

For example, on the sales DataFrame, you can plot a bar graph between products and revenue using the following code.

# Importing the matplotlib library

import matplotlib.pyplot as plt

#Declaring the color of the plot as gray

plt.bar(sales['Product line'], sales['Revenue'], color='gray')

# Giving the title for the plot

plt.title("Products with their corresponding revenue")

# Naming the x and y axis

plt.xlabel('Products')

plt.ylabel('Revenue')

# Rotates X-Axis labels by 45-degrees

plt.xticks(rotation = 45)

# Displaying the bar plot

plt.show()

This gives the following output:

Figure 2.61: Sample bar graph

Figure 2.61: Sample bar graph

The color of the plot can be altered with the color parameter. We can use different colors such as blue, black, red, and cyan.

Note

Feel free to explore some of the things you can do directly with Matplotlib by reading up the official documentation at https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html.

For a complete course on data visualization in general, you can refer to the Data Visualization Workshop: https://www.packtpub.com/product/the-data-visualization-workshop/9781800568846.

Before we head to the activity, the following table shows some of the key plots along with their usage:

Figure 2.62: Key plots and their usage

Figure 2.62: Key plots and their usage

With that, it's time to put everything you've learned so far to test in the activity that follows.

Activity 2.01: Analyzing Advertisements

Your company has collated data on the advertisement views through various mediums in a file called Advertising.csv. The advert campaign ran through radio, TV, web, and newspaper and you need to mine the data to answer the following questions:

  1. What are the unique values present in the Products column?
  2. How many data points belong to each category in the Products column?
  3. What are the total views across each category in the Products column?
  4. Which product has the highest viewership on TV?
  5. Which product has the lowest viewership on the web?

To do that, you will need to examine the dataset with the help of the functions you have learned, along with charts wherever needed.

Note

You can find the Advertising.csv file here: https://packt.link/q1c34.

Follow the following steps to achieve the aim of this activity:

  1. Open a new Jupyter Notebook and load pandas and the visualization libraries that you will need.
  2. Load the data into a pandas DataFrame named ads and look at the first few rows. Your DataFrame should look as follows:Figure 2.63: The first few rows of Advertising.csv

    Figure 2.63: The first few rows of Advertising.csv

  3. Understand the distribution of numerical variables in the dataset using the describe function.
  4. Plot the relationship between the variables in the dataset with the help of pair plots. You can use the hue parameter as Products. The hue parameter determines which column can be used for color encoding. Using Products as a hue parameter will show the different products in various shades of gray.

    You should get the below output:

    Figure 2.64: Expected output of Activity 2.01

Figure 2.64: Expected output of Activity 2.01

Note

The solution to this activity can be found via this link.

Summary

In this chapter, you have looked at exploratory data analysis. You learned how to leverage pandas to help you focus on the attributes that are critical to your business goals. Later, you learned how to use the tools that helped you fine-tuned these insights to make them more comprehensible to your stakeholders. Toward the end of the chapter, you used visualization libraries like seaborn and matplotlib to visually present your insights. Effective data visualization will also help reveal hidden insights in your data. The skills you've learned so far should help you create a strong foundation for your journey toward mastering marketing analytics.

In the next chapter, you will build upon these skills by learning some of the practical applications of exploratory data analysis through examples you may encounter in practice as a marketing analyst.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Use data analytics and machine learning in a sales and marketing context
  • Gain insights from data to make better business decisions
  • Build your experience and confidence with realistic hands-on practice

Description

Unleash the power of data to reach your marketing goals with this practical guide to data science for business. This book will help you get started on your journey to becoming a master of marketing analytics with Python. You'll work with relevant datasets and build your practical skills by tackling engaging exercises and activities that simulate real-world market analysis projects. You'll learn to think like a data scientist, build your problem-solving skills, and discover how to look at data in new ways to deliver business insights and make intelligent data-driven decisions. As well as learning how to clean, explore, and visualize data, you'll implement machine learning algorithms and build models to make predictions. As you work through the book, you'll use Python tools to analyze sales, visualize advertising data, predict revenue, address customer churn, and implement customer segmentation to understand behavior. By the end of this book, you'll have the knowledge, skills, and confidence to implement data science and machine learning techniques to better understand your marketing data and improve your decision-making.

Who is this book for?

This marketing book is for anyone who wants to learn how to use Python for cutting-edge marketing analytics. Whether you're a developer who wants to move into marketing, or a marketing analyst who wants to learn more sophisticated tools and techniques, this book will get you on the right path. Basic prior knowledge of Python and experience working with data will help you access this book more easily.

What you will learn

  • Load, clean, and explore sales and marketing data using pandas
  • Form and test hypotheses using real data sets and analytics tools
  • Visualize patterns in customer behavior using Matplotlib
  • Use advanced machine learning models like random forest and SVM
  • Use various unsupervised learning algorithms for customer segmentation
  • Use supervised learning techniques for sales prediction
  • Evaluate and compare different models to get the best outcomes
  • Optimize models with hyperparameter tuning and SMOTE

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 07, 2021
Length: 636 pages
Edition : 2nd
Language : English
ISBN-13 : 9781800563889
Category :
Languages :
Concepts :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Sep 07, 2021
Length: 636 pages
Edition : 2nd
Language : English
ISBN-13 : 9781800563889
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₱260 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₱260 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 6,786.97
Machine Learning for Time-Series with Python
₱2806.99
Data Science Projects with Python
₱1989.99
Data Science for Marketing Analytics
₱1989.99
Total 6,786.97 Stars icon
Banner background image

Table of Contents

9 Chapters
1. Data Preparation and Cleaning Chevron down icon Chevron up icon
2. Data Exploration and Visualization Chevron down icon Chevron up icon
3. Unsupervised Learning and Customer Segmentation Chevron down icon Chevron up icon
4. Evaluating and Choosing the Best Segmentation Approach Chevron down icon Chevron up icon
5. Predicting Customer Revenue Using Linear Regression Chevron down icon Chevron up icon
6. More Tools and Techniques for Evaluating Regression Models Chevron down icon Chevron up icon
7. Supervised Learning: Predicting Customer Churn Chevron down icon Chevron up icon
8. Fine-Tuning Classification Algorithms Chevron down icon Chevron up icon
9. Multiclass Classification Algorithms Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(203 Ratings)
5 star 57.6%
4 star 24.6%
3 star 12.8%
2 star 3.4%
1 star 1.5%
Filter icon Filter
Top Reviews

Filter reviews by




Joao falcao Dec 15, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a great book for marketers who have started the journey into data science in marketing. It is really enjoyable and very pratical. i would recommend this book.
Amazon Verified review Amazon
Gaurav Khare Oct 06, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
topic is explained clearly. Mirza already has teaching experience in IIIT-B so has good experience in this field.
Amazon Verified review Amazon
Steph Oct 15, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book does an excellent job of introducing the common tasks of a data analyst in a marketing or e-commerce environment. It takes you from data preparation and cleaning, to exploration and visualisation, then builds on this to introduce common modelling techniques. It brings together technical programming skills in Python, understanding of the methodologies behind techniques introduced and, crucially, the business context for all of this. All examples and metrics are relevant to business and/or marketing and there are helpful tips throughout relating to how techniques are applied in the industry. It also addresses common pitfalls and problems which is really helpful.There is a lot of detail into how to run, evaluate and interpret models, such as k-means clustering for customer segmentation. There are plenty of visuals alongside code to help illustrate what to look out for. The book can also be used alongside a git repository of datasets, to aid learning.If you're new to data analysis or want to build on your skills in marketing or business analytics, I would recommend this book.
Amazon Verified review Amazon
Devanshu Oct 22, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you ask me personally, this book is for those who want to start a career in data science from scratch. This book covers data analysis from the very basics and takes you to the expert level. I would recommend the book to all. This book shows end-to-end financial analysis and also includes predicting, forecasting financial data by using Anaconda which is a distribution of Python.I can honestly say this book is quite beginner-friendly furthermore every detail is explained carefully. As an upcoming data scientist, I can definitely say that this is an awesome book to refer to... I would highly recommend everyone who wants to learn from the basics to the end PLEASE go with this amazing book.
Amazon Verified review Amazon
Eric Sep 13, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
All seven chapters of this book contain a TON of helpful information and useful code examples. Here’s a quick overview of each chapter:CHAPTER 1Starts with a brief introduction to the current state of marketing and data science, and then dives right into code. I liked that the code example shows a few of the parameters that can be used with pd.read_csv(). I think a lot of people just use the function but never realize that it has a lot of extra capabilities, like controlling which specific columns are imported, adding column names, etc.If you go through these exercises and try out each of the functions shown, you'll be well on your way to getting comfortable with data manipulation. In one of the examples, the data imports incorrectly by default, and you have to go back and fix it. That's very true to reality, so I really appreciated that being included.CHAPTER 2Starts off by introducing the concept of Key Performance Indicators (KPIs) and gives six examples of the most common. It's very common to be asked to derive any of those KPIs from a dataset, so this introduction is very relevant.This chapter had lots of great examples of using groupby(), which is very helpful for summarizing data. I wish I would have had this example when I was learning! Also, the whole section on reshaping data with stacking/unstacking and pivot tables is very interesting and potentially useful if you are working with data that is too big for Excel but would like to leverage the helpful features available in Excel.One thing that could be confusing to a new Python user is the authors' frequent use of method chaining. It's not complex, but it can be intimidating if you are new to it. Just don't be intimidated when you see df.groupby('Column Name').sum().plot() all at once instead of broken out into 2-3 separate code cells.CHAPTER 3It feels like you make quick progress. By chapter 3, you're starting exercises already using Numpy, pandas, Matplotlib, and Seaborn. I like that the author shows how to quickly segment customers with simple visualization as well as how to use unsupervised machine learning. Both approaches have practical benefit.I love principal components analysis (PCA), so I'm really glad this was included. Dimension reduction is fascinating, and I didn't expect it to get a nod in this book. I would have liked for there to be a bit more explanation about how it works because it's helpful, but I also recognize that there is only so much that can be included in a chapter.CHAPTER 4This chapter introduces some great clustering methods, including k-modes and k-prototypes that can be used with categorical data. I was really excited to see that! It also provided a good introduction to visual and sum of squared error methods as well as silhouette scores for evaluating clustering solutions. The final segmentation exercise in the chapter was also very interesting and relevant.CHAPTERS 5 & 6These chapters covered feature engineering using the agg() groupby method, residuals analysis, and even Recursive Feature Elimination. A lot of good analysis info here.One important thing to note: Some of the code syntax in the feature engineering exercise is not explained, so don't feel dumb if you're new to Python. Just plan to spend some extra time understanding what you're doing.CHAPTER 7This chapter is primarily about logistic regression. The explanation is definitely the most math-heavy of the book. I feel like the authors could have spent more time explaining the basic concept of the logit transformation, but instead they focused on the math. Take your time when reading through it, and be prepared to think hard if the concepts are new. You may need to supplement with some YouTube videos or blog posts.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.