Chapter 1: Data Preparation and Cleaning
Activity 1: Addressing Data Spilling
Import pandas and copy into the console, as follows:
import pandas as pd import copy
Use the read_excel function to read the xlsx file and the head function to look at the first few rows:
sales = pd.read_excel("sales.xlsx") sales.head()
Look at the data types of sales and see if they make more sense:
sales.dtypes
You should get the following output:
We can iterate through the DataFrame rows to understand how the data is in the first row and how it should be:
forlabel,content in sales.iteritems(): print label, content[1]
This gives the following output:
From the preceding output, you can infer that the data should actually contain column names starting with a capital letter. Add the following code to get an overview of what the data should actually look like:
d = {"Year": 2004, "Product line": "Camping Equipment", "Product type":"Cooking Gear", "Product":"TrailChef Water Bag", "Order method type":"Telephone", "Retailer Country":"Canada", "Revenue":13444.68, "Planned revenue":14313.48, "Product cost":6298.8, "Quantity":2172, "Unit cost":2.9, "Unit price":6.59, "Gross Profit":7145.88, "Unit sale price":6.19}
You should get the following output:
The Year column seems to be in its right place, so we can start by creating the Product line column. Let's see how many values it has leaked into:
sales.groupby(['Product','line'])['Year'].count()
Your output will be as follows:
We are not sure whether the spillage is only restricted to these two columns, so let's look at the next column too, to be sure:
sales.groupby(['Product','line', 'Product.1'])['Year'].count()
Your output will be as follows:
Let's resolve the Product line column and see if we were able to do anything:
sales['Product line'] = sales.apply(lambda x: x['Product'] +''+ x['line'], axis = 1) sales = sales.drop(['Product', 'line'], axis = 1) sales.head()
The DataFrame should now look as follows:
Instead of directly changing the sales DataFrame, let's create a copy of it, called tmp, and make changes to it. Once we have finalized the filtering procedure, we don't want to lose the original DataFrame. Let's continue looking at the next column of interest, that is, Product type:
tmp = copy.deepcopy(sales) tmp.groupby(['Product.1','type', 'Product.2'])['Year'].count()
This gives the following output:
We can see that some fields are not the fields we are interested in. Let's limit our view to only those that we are interested in.
As we are only interested in the Climbing Accessories, Cooking Gear, First Aid, Golf Accessories, Insect Repellents, and Sleeping Bags product types, so let's filter them out. Using a similar logic as before, we store these columns in a new DataFrame, tmp1:
tmp1 = copy.deepcopy(tmp[tmp['Product.1'].isin(['Climbing', 'Cooking', 'First', 'Golf', 'Insect', 'Sleeping'])]) tmp1.head()
This gives the following output:
Then perform the following groupby operations as well:
tmp1.groupby(['Product.1')['Year'].count() tmp1.groupby(['Product.1', 'type', 'Product.2'])['Year'].count()
This gives the following output:
We can see that—because of our choice of fields—luckily, we only have to append two columns to get the attribute we need.
Now that we know tmp1 requires only two words for Product type, we are done:
tmp1['Product type'] = tmp1['Product.1'] + ''+ tmp1['type'] tmp1 = tmp1.drop(['Product.1', 'type'], axis = 1) tmp1.head()
This gives the following output:
The next column we have to worry about is Product. Let's see how many columns we need for that:
tmp1.groupby(['Product.2', 'Order'])['Year'].count()
This gives the following output:
This column has some values that are one word and some that are more than that.
Let's create another variable, tmp2, for values containing more than one word:
tmp2 = copy.deepcopy(tmp1[~tmp1['Order'].isin(['E-mail', 'Fax', 'Mail', 'Sales', 'Special', 'Telephone', 'Web'])]) tmp2.head()
This gives the following output:
We look at the variation in tmp2 and see that, while most of the columns have two words, some columns have more than two words:
tmp2.groupby(['Product.2','Order', 'method'])['Year'].count()
This gives the following output:
We keep on performing this procedure of repeatedly storing values that have more leakage into another variable, until we have exhausted all the columns and got a structured dataset by the last step:
tmp8 = copy.deepcopy(tmp7[tmp7['Product.3'].isin(['Kingdom', 'States'])]) tmp8.head()
This gives the following output:
After structuring the fields with the longest names for Products, Order method type, and Retailer country, we can see that there is no spillage in the columns containing numerical values. Let's handle the remaining cases directly.
We finish structuring the last part of the data directly and store the final structured data separately:
tmp8['Retail country'] = tmp8['revenue'] + '' + tmp8['Product.3'] tmp8 = tmp8.drop(['revenue', 'Product.3'], axis = 1) tmp8["Revenue"] = tmp8['cost'] tmp8 = tmp8.drop(['cost'], axis = 1) tmp8["Planned revenue"] = tmp8['Quantity'] tmp8 = tmp8.drop(['Quantity'], axis = 1) tmp8["Product cost"] = tmp8['Unit'] tmp8 = tmp8.drop(['Unit'], axis = 1) tmp8["Quantity"] = tmp8['cost.1'] tmp8 = tmp8.drop(['cost.1'], axis = 1) tmp8["Unit cost"] = tmp8['Unit.1'] tmp8 = tmp8.drop(['Unit.1'], axis = 1) tmp8["Unit price"] = tmp8['price'] tmp8 = tmp8.drop(['price'], axis = 1) tmp8["Gross profit"] = tmp8['Gross'] tmp8 = tmp8.drop(['Gross'], axis = 1) tmp8["Unit sale price"] = tmp8['profit'] tmp8 = tmp8.drop(['profit', 'Unit.2', 'sale', 'price.1'], axis = 1) tmp8.head()
This gives the following output:
We store the structured dataset separately as str1:
str1 = tmp8
Once we structure the last layer of the data completely like this, it is easier for us to structure the layer just before it, as we can use the knowledge of the previous layer and reduce structuring the current layer to a problem we have already solved before:
temp = copy.deepcopy(tmp7[~tmp7.index.isin(tmp8.index.values)]) temp.head()
You should get the following output:
Now, we keep on going backward and structure the entire dataset with the help of the correctly structured preceding layers:
temp1['Retailer country'] = temp1['method'] temp1 = temp1.drop(['method'], axis = 1) \temp1["Planned revenue"] = temp1['Retailer'] temp1 = temp1.drop(['Retailer'], axis = 1) temp1["Product cost"] = temp1['country'] temp1 = temp1.drop(['country'], axis = 1) temp1["Unit cost"] = temp1['Planned'] temp1 = temp1.drop(['Planned'], axis = 1) temp1["Unit price"] = temp1['revenue'] temp1 = temp1.drop(['revenue'], axis = 1) temp1["Gross profit"] = temp1['Product.3'] temp1 = temp1.drop(['Product.3'], axis = 1) temp1["Unit sale price"] = temp1['cost'] temp1 = temp1.drop(['cost'], axis = 1) temp1["Quantity"] = temp1['Revenue'] temp1 = temp1.drop(['Revenue'], axis = 1) temp1['Revenue'] = temp1['type.1'] temp1 = temp1.drop('type.1', axis = 1) temp1 = temp1[['Year', 'Product line', 'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 'Planned revenue', 'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']] temp1.head()
This gives the following output:
Finally, we just make sure to combine our data and we are done:
df = pd.concat([str1, str2, str3, str4, str5, str6, str7, str8, str9, str10, str11, str12, str13, str14, str15, str16, str17, str18, str19, str20, str21, str22], sort = True)[['Year', 'Product line', 'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 'Planned revenue', 'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']] df.groupby('Product type').count()
This gives the following output: