Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Labeling in Machine Learning with Python

You're reading from   Data Labeling in Machine Learning with Python Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Arrow left icon
Product type Paperback
Published in Jan 2024
Publisher Packt
ISBN-13 9781804610541
Length 398 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Vijaya Kumar Suda Vijaya Kumar Suda
Author Profile Icon Vijaya Kumar Suda
Vijaya Kumar Suda
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Part 1: Labeling Tabular Data
2. Chapter 1: Exploring Data for Machine Learning FREE CHAPTER 3. Chapter 2: Labeling Data for Classification 4. Chapter 3: Labeling Data for Regression 5. Part 2: Labeling Image Data
6. Chapter 4: Exploring Image Data 7. Chapter 5: Labeling Image Data Using Rules 8. Chapter 6: Labeling Image Data Using Data Augmentation 9. Part 3: Labeling Text, Audio, and Video Data
10. Chapter 7: Labeling Text Data 11. Chapter 8: Exploring Video Data 12. Chapter 9: Labeling Video Data 13. Chapter 10: Exploring Audio Data 14. Chapter 11: Labeling Audio Data 15. Chapter 12: Hands-On Exploring Data Labeling Tools 16. Index 17. Other Books You May Enjoy

Exploring Data for Machine Learning

Imagine embarking on a journey through an expansive ocean of data, where within this vastness are untold stories, patterns, and insights waiting to be discovered. Welcome to the world of data exploration in machine learning (ML). In this chapter, I encourage you to put on your analytical lenses as we embark on a thrilling quest. Here, we will delve deep into the heart of your data, armed with powerful techniques and heuristics, to uncover its secrets. As you embark on this adventure, you will discover that beneath the surface of raw numbers and statistics, there exists a treasure trove of patterns that, once revealed, can transform your data into a valuable asset. The journey begins with exploratory data analysis (EDA), a crucial phase where we unravel the mysteries of data, laying the foundation for automated labeling and, ultimately, building smarter and more accurate ML models. In this age of generative AI, the preparation of quality training data is essential to the fine-tuning of domain-specific large language models (LLMs). Fine-tuning involves the curation of additional domain-specific labeled data for training publicly available LLMs. So, fasten your seatbelts for a captivating voyage into the art and science of data exploration for data labeling.

First, let’s start with the question: What is data exploration? It is the initial phase of data analysis, where raw data is examined, visualized, and summarized to uncover patterns, trends, and insights. It serves as a crucial step in understanding the nature of the data before applying advanced analytics or ML techniques.

In this chapter, we will explore tabular data using various libraries and packages in Python, including Pandas, NumPy, and Seaborn. We will also plot different bar charts and histograms to visualize data to find the relationships between various features, which is useful for labeling data. We will be exploring the Income dataset located in this book’s GitHub repository (a link for which is located in the Technical requirements section). A good understanding of the data is necessary in order to define business rules, identify matching patterns, and, subsequently, label the data using Python labeling functions.

By the end of this chapter, we will be able to generate summary statistics for the given dataset. We will derive aggregates of the features for each target group. We will also learn how to perform univariate and bivariate analyses of the features in the given dataset. We will create a report using the ydata-profiling library.

We’re going to cover the following main topics:

  • EDA and data labeling
  • Summary statistics and data aggregates with Pandas
  • Data visualization with Seaborn for univariate and bivariate analysis
  • Profiling data using the ydata-profiling library
  • Unlocking insights from data with OpenAI and LangChain
You have been reading a chapter from
Data Labeling in Machine Learning with Python
Published in: Jan 2024
Publisher: Packt
ISBN-13: 9781804610541
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image