It is a well-established fact that your machine-learning model is only as good as the data it is fed. ML model trained on bad-quality data usually has a number of issues. Here are a few ways that bad data might affect machine-learning models -
1. Predictions that are wrong may be made as a result of errors, missing numbers, or other irregularities in low-quality data. The model's predictions are likely to be inaccurate if the data used to train is unreliable.
2. Bad data can also bias the model. The ML model can learn and reinforce these biases if the data is not representative of the real-world situations, which can result in predictions that are discriminating.
3. Poor data also disables the the ability of ML model to generalize on fresh data. Poor data may not effectively depict the underlying patterns and relationships in the data.
4. Models trained on bad-quality data might need more retraining and maintenance. The overall cost and complexity of model deployment could rise as a result.
As a result, it is critical to devote time and effort to data preprocessing and cleaning in order to decrease the impact of bad data on ML models. Furthermore, to ensure the model's dependability and performance, it is often necessary to use domain knowledge to recognize and address data quality issues.
It might come as a surprise, but gold-standard datasets like ImageNet, CIFAR, MNIST, 20News, and more also contain labeling issues. I have put in some examples below for reference -
The above snippet is from the Amazon sentiment review dataset , where the original label was Neutral in both cases, whereas Cleanlab and Mechanical turk said it to be positive (which is correct).
The above snippet is from the MNIST dataset, where the original label was marked to be 8 and 0 respectively, which is incorrect. Instead, both Cleanlab and Mechanical Turk said it to be 9 and 6 (which is correct).
Feel free to check out labelerrors to explore more such cases in similar datasets.
This is where Cleanlab can come in handy as your best bet. It helps by automatically identifying problems in your ML dataset, it assists you in cleaning both data and labels. This data centric AI software uses your existing models to estimate dataset problems that can be fixed to train even better models. The graphic below depicts the typical data-centric AI model development cycle:
Apart from the standard way of coding all the way through finding data issues, it also offers Cleanlab Studio - a no-code platform for fixing all your data errors. For the purpose of this blog, we will go the former way on our sample use case.
Installing cleanlab is as easy as doing a pip install. I recommend installing optional dependencies as well, you never know what you need and when. I also installed sentence transformers, as I would be using them for vectorizing the text. Sentence transformers come with a bag of many amazing models, we particularly use ‘all-mpnet-base-v2’ as our choice of sentence-transformers for vectorizing text sequences. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. Feel free to check out this for the list of all models and their comparisons.
pip install ‘cleanlab[all]’
pip install sentence-transformers
We picked the SMS Spam Detection dataset as our choice of dataset for doing the experimentation. It is a public set of labeled SMS messages that have been collected for mobile phone spam research with total instances of roughly ~5.5k. The below graphic gives a sneak peek of some of the samples from the dataset.
Data Preview
Let’s now delve into the code. For demonstration purposes, we inject a 5% noise in the dataset, and see if we are able to detect them and eventually train a better model.
Note: I have also annotated every segment of the code wherever necessary for better understanding.
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
from cleanlab.classification import CleanLearning
from sklearn.metrics import f1_score
# Reading and renaming data. Here we set sep=’\t’ because the data is tab
separated.
data = pd.read_csv('SMSSpamCollection', sep='\t')
data.rename({0:'label', 1:'text'}, inplace=True, axis=1)
# Dropping any instance of duplicates that could exist
data.drop_duplicates(subset=['text'], keep=False, inplace=True)
# Original data distribution for spam and not spam (ham) categories
print (data['label'].value_counts(normalize=True))
ham 0.865937
spam 0.134063
# Adding noise. Switching 5% of ham data to ‘spam’ label
tmp_df = data[data['label']=='ham']
examples_to_change = int(tmp_df.shape[0]*0.05)
print (f'Changing examples: {examples_to_change}')
examples_text_to_change = tmp_df.head(examples_to_change)['text'].tolist()
changed_df = pd.DataFrame([[i, 'spam'] for i in examples_text_to_change])
changed_df.rename({0:'text', 1:'label'}, axis=1, inplace=True)
left_data = data[~data['text'].isin(examples_text_to_change)]
final_df = pd.concat([left_data, changed_df])
final_df.reset_index(drop=True, inplace=True)
Changing examples: 216
# Modified data distribution for spam and not spam (ham) categories
print (final_df['label'].value_counts(normalize=True))
ham 0.840016
spam 0.159984
raw_texts, raw_labels = final_df["text"].values, final_df["label"].values
# Converting label into integers
encoder = LabelEncoder()
encoder.fit(raw_train_labels)
train_labels = encoder.transform(raw_train_labels)
test_labels = encoder.transform(raw_test_labels)
# Vectorizing text sequence using sentence-transformers
transformer = SentenceTransformer('all-mpnet-base-v2')
train_texts = transformer.encode(raw_train_texts)
test_texts = transformer.encode(raw_test_texts)
# Instatiating model instance
model = LogisticRegression(max_iter=200)
# Wrapping the sckit model around CL
cl = CleanLearning(model)
# Finding label issues in the train set
label_issues = cl.find_label_issues(X=train_texts, labels=train_labels)
# Picking top 50 samples based on confidence scores
identified_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels =
label_issues["label_quality"].argsort()[:50].to_numpy()
# Beauty print the label issue detected by CleanLab
def print_as_df(index):
return pd.DataFrame(
{
"text": raw_train_texts,
"given_label": raw_train_labels,
"predicted_label":
encoder.inverse_transform(label_issues["predicted_label"]),
},
).iloc[index]
print_as_df(lowest_quality_labels[:5])
As we can see, Cleanlab assisted us in automatically removing the incorrect labels and training a better model with the same parameters and settings. In my experience, people frequently ignore data concerns in favor of building more sophisticated models to increase accuracy numbers. Improving data, on the other hand, is a pretty simple performance win. And, thanks to products like Cleanlab, it's become really simple and convenient.
Feel free to access and play around with the above code in the Colab notebook here
In conclusion, Cleanlab offers a straightforward solution to enhance data quality by addressing label inconsistencies, a crucial step in building more reliable and accurate machine learning models. By focusing on data integrity, Cleanlab simplifies the path to better performance and underscores the significance of clean data in the ever-evolving landscape of AI. Elevate your model's accuracy by investing in data quality, and explore the provided code to see the impact for yourself.
Prakhar has a Master’s in Data Science with over 4 years of experience in industry across various sectors like Retail, Healthcare, Consumer Analytics, etc. His research interests include Natural Language Understanding and generation, and has published multiple research papers in reputed international publications in the relevant domain. Feel free to reach out to him on LinkedIn