Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Labeling in Machine Learning with Python

You're reading from   Data Labeling in Machine Learning with Python Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Arrow left icon
Product type Paperback
Published in Jan 2024
Publisher Packt
ISBN-13 9781804610541
Length 398 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Vijaya Kumar Suda Vijaya Kumar Suda
Author Profile Icon Vijaya Kumar Suda
Vijaya Kumar Suda
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Part 1: Labeling Tabular Data
2. Chapter 1: Exploring Data for Machine Learning FREE CHAPTER 3. Chapter 2: Labeling Data for Classification 4. Chapter 3: Labeling Data for Regression 5. Part 2: Labeling Image Data
6. Chapter 4: Exploring Image Data 7. Chapter 5: Labeling Image Data Using Rules 8. Chapter 6: Labeling Image Data Using Data Augmentation 9. Part 3: Labeling Text, Audio, and Video Data
10. Chapter 7: Labeling Text Data 11. Chapter 8: Exploring Video Data 12. Chapter 9: Labeling Video Data 13. Chapter 10: Exploring Audio Data 14. Chapter 11: Labeling Audio Data 15. Chapter 12: Hands-On Exploring Data Labeling Tools 16. Index 17. Other Books You May Enjoy

Summary statistics and data aggregates

In this section, we will derive the summary statistics for numerical columns.

Before generating summary statistics, we will identify the categorical columns and numerical columns in the dataset. Then, we will calculate the summary statistics for all numerical columns.

We will also calculate the mean value of each numerical column for the target class. Summary statistics are useful to gain insights about each feature’s mean values and their effect on the target label class.

Let’s print the categorical columns using the following code snippet:

#categorical column
catogrical_column = [column for column in df.columns if df[column].
dtypes=='object']
print(catogrical_column)

We will get the following result:

Figure 1.8 – Categorical columns

Figure 1.8 – Categorical columns

Now, let’s print the numerical columns using the following code snippet:

#numerical_column
numerical_column = [column for column in df.columns if df[column].dtypes !='object']
print(numerical_column)

We will get the following output:

Figure 1.9 – Numerical columns

Figure 1.9 – Numerical columns

Summary statistics

Now, let’s generate summary statistics (i.e., mean, standard deviation, minimum value, maximum value, and lower (25%), middle (50%), and higher (75%) percentiles) using the following code snippet:

df.describe().T

We will get the following results:

Figure 1.10 – Summary statistics

Figure 1.10 – Summary statistics

As shown in the results, the mean value of age is 38.5 years, the minimum age is 17 years, and the maximum age is 90 years in the dataset. As we have only five numerical columns in the dataset, we can only see five rows in this summary statistics table.

Data aggregates of the feature for each target class

Now, let’s calculate the average age of the people for each income group range using the following code snippet:

df.groupby("income")["age"].mean()

We will see the following output:

Figure 1.11 – Average age by income group

Figure 1.11 – Average age by income group

As shown in the results, we have used the groupby clause on the target variable and calculated the mean of the age in each group. The mean age is 36.78 for people with an income group of less than or equal to $50K. Similarly, the mean age is 44.2 for the income group greater than $50K.

Now, let’s calculate the average hours per week of the people for each income group range using the following code snippet:

df.groupby("income")["hours.per.week"]. mean()

We will get the following output:

Figure 1.12 – Average hours per week by income group

Figure 1.12 – Average hours per week by income group

As shown in the results, the average hours per week for the income group =< $50K is 38.8 hours. Similarly, the average hours per week for the income group > $50K is 45.47 hours.

Alternatively, we can write a generic reusable function for calculating the mean of any numerical column group by the categorical column as follows:

def get_groupby_stats(categorical, numerical):
    groupby_df = df[[categorical, numerical]].groupby(categorical). 
        mean().dropna()
    print(groupby_df.head)

If we want to get aggregations of multiple columns for each target income group, then we can calculate aggregations as follows:

columns_to_show = ["age", "hours.per.week"]
df.groupby(["income"])[columns_to_show].agg(['mean', 'std', 'max', 'min'])

We get the following results:

Figure 1.13 – Aggregations for multiple columns

Figure 1.13 – Aggregations for multiple columns

As shown in the results, we have calculated the summary statistics for age and hours per week for each income group.

We learned how to calculate the aggregate values of features for the target group using reusable functions. This aggregate value gives us a correlation of those features for the target label value.

You have been reading a chapter from
Data Labeling in Machine Learning with Python
Published in: Jan 2024
Publisher: Packt
ISBN-13: 9781804610541
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image