Summary
In this chapter, we have learned how to use Pandas and matplotlib to analyze a dataset and understand the data and correlations between various features. This understanding of data and patterns in the data is required to build the rules for labeling raw data before using it for training ML models and fine-tuning LLMs.
We also went through various examples for aggregating columns and categorical values using groupby
and mean
. Then, we created reusable functions so that those functions can be reused simply by calling and passing column names to get aggregates of one or more columns.
Finally, we saw a fast and easy exploration of data using the ydata-profiling
library with simple one-line Python code. Using this library, we need not remember many Pandas functions. We can simply call one line of code to perform a detailed analysis of data. We can create detailed reports of statistics for each variable with missing values, correlations, interactions, and duplicate rows.
Once we get a good sense of our data using EDA, we will be able to build the rules for creating labels for the dataset.
In the next chapter, we will see how to build these rules using Python libraries such as snorkel
and compose
to label an unlabeled dataset. We will also explore other methods, such as pseudo-labeling and K-means clustering, for data labeling.