Text classification tasks such as natural language inference (NLI) are a central part of modern natural language processing (NLP). In this article, we present an application of unsupervised machine learning techniques to detect anomalies in the MultiNLI dataset.
Our aim is to use unsupervised Large Language Models (LLM) to create embeddings and discover patterns and relationships within the data. We'll preprocess the data, generate sentence pair embeddings, and use the Out-Of-Distribution (OOD) module from the cleanlab
Python package to get outlier scores.
The following block of code is essentially the initial setup phase of our data processing and analysis script. Here, we import all the necessary libraries and packages that will be used throughout the code. First, we need to install some of the necessary libraries:
!pip install cleanlab datasets hdbscan nltk matplotlib numpy torch transformers umap-learn
It is highly recommended to use Google Colab with GPUs or TPUs to be able to create the embeddings in a proper amount of time.
Now we can start with the importing of the sentences:
import cleanlab
import datasets
import hdbscan
import nltk
import matplotlib.pyplot as plt
import numpy as np
import re
import torch
from cleanlab.outlier import OutOfDistribution
from datasets import load_dataset, concatenate_datasets
from IPython.display import display
from sklearn.metrics import precision_recall_curve
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel
from umap import UMAP
nltk.download('stopwords')
datasets.logging.set_verbosity_error()
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(SEED)
Here's what each imported library/package does:
cleanlab
: A package used for finding label errors in datasets and learning with noisy labels.datasets
: Provides easy-to-use, high-level APIs for downloading and preparing datasets for modeling.hdbscan
: A clustering algorithm that combines the benefits of hierarchical clustering and density-based spatial clustering of applications with noise (DBSCAN).nltk
: Short for Natural Language Toolkit, a leading platform for building Python programs to work with human language data.torch
: PyTorch is an open-source machine learning library based on the Torch library, used for applications such as natural language processing.This part of the code also downloads the NLTK (Natural Language Toolkit) stopwords. Stopwords are words like 'a', 'an', and 'the', which are not typically useful for modeling and are often removed during pre-processing. The datasets.logging.set_verbosity_error()
sets the logging level to error
. This means that only the messages with the level error or above will be displayed.
The code also sets some additional properties for CUDA operations (if a CUDA-compatible GPU is available), which can help ensure consistency across different executions of the code.
The following block of code represents the next major phase: preprocessing and loading the datasets. This is where we clean and prepare our data so that it can be fed into our LLM models:
def preprocess_datasets(
*datasets,
sample_sizes = [5000, 450, 450],
columns_to_remove = ['premise_binary_parse', 'premise_parse', 'hypothesis_binary_parse', 'hypothesis_parse', 'promptID', 'pairID', 'label'],
):
# Remove -1 labels (no gold label)
f = lambda ex: ex["label"] != -1
datasets = [dataset.filter(f) for dataset in datasets]
# Sample a subset of the data
assert len(sample_sizes) == len(datasets), "Number of datasets and sample sizes must match"
datasets = [
dataset.shuffle(seed=SEED).select([idx for idx in range(sample_size)])
for dataset, sample_size in zip(datasets, sample_sizes)
]
# Remove columns
datasets = [data.remove_columns(columns_to_remove) for data in datasets]
return datasets
This is a function definition for preprocess_datasets
, which takes any number of datasets (with their sample sizes and columns to be removed specified as lists). The function does three main things:
train_data = load_dataset("multi_nli", split="train")
val_matched_data = load_dataset("multi_nli", split="validation_matched")
val_mismatched_data = load_dataset("multi_nli", split="validation_mismatched")
train_data, val_matched_data, val_mismatched_data = preprocess_datasets(
train_data, val_matched_data, val_mismatched_data
)
The above lines load the train and validation datasets from multi_nli
(a multi-genre natural language inference corpus) and then preprocess them using the function we just defined.
Finally, we print the genres available in each dataset and display the first few records using the Pandas data frame. This is useful to confirm that our datasets have been loaded and preprocessed correctly:
print("Training data")
print(f"Genres: {np.unique(train_data['genre'])}")
display(train_data.to_pandas().head())
print("Validation matched data")
print(f"Genres: {np.unique(val_matched_data['genre'])}")
display(val_matched_data.to_pandas().head())
print("Validation mismatched data")
print(f"Genres: {np.unique(val_mismatched_data['genre'])}")
display(val_mismatched_data.to_pandas().head())
With the help of this block, we have our datasets loaded and preprocessed, ready to be transformed into vector embeddings.
Now, we proceed to the next crucial step, transforming our textual data into numerical vectors. This is where text or sentence embeddings come into play.
In simple terms, sentence embeddings are the numerical representations of sentences. Just as words can be represented by dense vectors (a process known as word embeddings), entire sentences can also be encoded into vectors. This transformation process facilitates mathematical operations on text, making it possible for machine learning algorithms to perform tasks like text classification, sentence similarity, sentiment analysis, and more.
To produce high-quality sentence embeddings, the context of each word in the sentence and the semantics should be considered. Transformer-based models, like BERT, DistilBERT, or RoBERTa, are very effective in creating these contextual sentence embeddings.
Now, let's explain the next block of code:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
This function mean_pooling
is used to calculate the mean of all token embeddings that belong to a single sentence. The function receives the model_output
(containing the token embeddings) and an attention_mask
(indicating where actual tokens are and where padding tokens are in the sentence). The mask is used to correctly compute the average over the length of each sentence, ignoring the padding tokens.
The function embed_sentence_pairs
processes the sentence pairs, creates their embeddings, and stores them. It uses a data loader (which loads data in batches), a tokenizer (to convert sentences into model-understandable format), and a pre-trained language model (to create the embeddings).
The function is a vital part of the sentence embedding process. This function uses a language model to convert pairs of sentences into high-dimensional vectors that represent their combined semantics. Here's an annotated walkthrough:
def embed_sentence_pairs(dataloader, tokenizer, model, disable_tqdm=False):
# Empty lists are created to store the embeddings of premises and hypotheses
premise_embeddings = []
hypothesis_embeddings = []
feature_embeddings = []
# The device (CPU or GPU) to be used for computations is determined
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# The model is moved to the chosen device and set to evaluation mode
model.to(device)
model.eval()
# A loop is set up to iterate over the data in the dataloader
loop = tqdm(dataloader, desc=f"Embedding sentences...", disable=disable_tqdm)
for data in loop:
# The premise and hypothesis sentences are extracted from the data
premise, hypothesis = data['premise'], data['hypothesis']
# The premise and hypothesis sentences are encoded into a format that the model can understand
encoded_premise, encoded_hypothesis = (
tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
for sentences in (premise, hypothesis)
)
# The model computes token embeddings for the encoded sentences
with torch.no_grad():
encoded_premise = encoded_premise.to(device)
encoded_hypothesis = encoded_hypothesis.to(device)
model_premise_output = model(**encoded_premise)
model_hypothesis_output = model(**encoded_hypothesis)
# Mean pooling is performed on the token embeddings to create sentence embeddings
pooled_premise = mean_pooling(model_premise_output, encoded_premise['attention_mask']).cpu().numpy()
pooled_hypothesis = mean_pooling(model_hypothesis_output, encoded_hypothesis['attention_mask']).cpu().numpy()
# The sentence embeddings are added to the corresponding lists
premise_embeddings.extend(pooled_premise)
hypothesis_embeddings.extend(pooled_hypothesis)
# The embeddings of the premises and hypotheses are concatenated along with their absolute difference
feature_embeddings = np.concatenate(
[
np.array(premise_embeddings),
np.array(hypothesis_embeddings),
np.abs(np.array(premise_embeddings) - np.array(hypothesis_embeddings))
],
axis=1
)
return feature_embeddings
This function does all the heavy lifting of turning raw textual data into dense vectors that machine learning algorithms can use. It takes in a dataloader
, which feeds batches of sentence pairs into the function, a tokenizer
to prepare the input for the language model, and the model itself to create the embeddings.
The embedding process involves first tokenizing each sentence pair and then feeding the tokenized sentences into the language model. This yields a sequence of token embeddings for each sentence. To reduce these sequences to a single vector per sentence, we apply a mean pooling operation, which takes the mean of all token vectors in a sentence, weighted by their attention masks.
Finally, the function concatenates the embeddings of the premise and hypothesis of each pair, along with the absolute difference between these two embeddings. This results in a single vector that represents both the individual meanings of the sentences and the semantic relationship between them. The absolute difference between the premise and hypothesis embeddings helps to capture the semantic contrast in the sentence pair.
These concatenated embeddings, returned by the function, serve as the final input features for further machine-learning tasks.
The function begins by setting the device to GPU if it's available. It sets the model to evaluation mode using model.eval()
. Then, it loops over the data loader, retrieving batches of sentence pairs.
For each sentence pair, it tokenizes the premise and hypothesis using the provided tokenizer. The tokenized sentences are then passed to the model to generate the model outputs. Using these outputs, mean pooling is performed to generate sentence-level embeddings.
Finally, the premise and hypothesis embeddings are concatenated along with their absolute difference, resulting in our final sentence pair embeddings. These combined embeddings capture the information from both sentences and the relational information between them, which are stored in feature_embeddings
.
These feature embeddings are critical and are used as input features for the downstream tasks. Their high-dimensional nature contains valuable semantic information which can help in various NLP tasks such as text classification, information extraction, and more.
This block of code takes care of model loading, data preparation, and finally, the embedding process for each sentence pair in our datasets. Here's an annotated walkthrough:
# Pretrained SentenceTransformers handle this task better than regular Transformers
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
# Uncomment the following line to try a regular Transformers model trained on MultiNLI
# model_name = 'sileod/roberta-base-mnli'
# Instantiate the tokenizer and model from the pretrained transformers on the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
batch_size = 128
# Prepare the PyTorch DataLoaders for each of the train, validation matched, and validation mismatched datasets
trainloader = DataLoader(train_data, batch_size=batch_size, shuffle=False)
valmatchedloader = DataLoader(val_matched_data, batch_size=batch_size, shuffle=False)
valmismatchedloader = DataLoader(val_mismatched_data, batch_size=batch_size, shuffle=False)
# Use the embed_sentence_pairs function to create embeddings for each dataset
train_embeddings = embed_sentence_pairs(trainloader, tokenizer, model, disable_tqdm=True)
val_matched_embeddings = embed_sentence_pairs(valmatchedloader, tokenizer, model, disable_tqdm=True)
val_mismatched_embeddings = embed_sentence_pairs(valmismatchedloader, tokenizer, model, disable_tqdm=True)
This block begins by setting the model_name
variable to the identifier of a pretrained SentenceTransformers model available on the Hugging Face Model Hub. SentenceTransformers are transformer-based models specifically trained for generating sentence embeddings, so they are generally more suitable for this task than regular transformer models. The MiniLM model was chosen for its relatively small size and fast inference times, but provides performance comparable to much larger models. If you wish to experiment with a different model, you can simply change the identifier.
Next, the tokenizer and model corresponding to the model_name
are loaded using the from_pretrained method, which fetches the necessary components from the Hugging Face Model Hub and initializes them for use.
The DataLoader utility from the PyTorch library is then used to wrap our Hugging Face datasets. The DataLoader handles the batching of the data and provides an iterable over the dataset, which will be used by our embed_sentence_pairs
function. The batch size is set to 128, which means that the model processes 128 sentence pairs at a time.
Finally, the embed_sentence_pairs
function is called for each of our data loaders (train, validation matched, and validation mismatched), returning the corresponding embeddings for each sentence pair in these datasets. These embeddings will be used as input features for our downstream tasks.
Outlier Detection in Datasets
In the realm of machine learning, outliers often pose a significant challenge. These unusual or extreme values can cause the model to make erroneous decisions based on data points that don't represent the general trend or norm in the data. Therefore, an essential step in data preprocessing for machine learning is identifying and handling these outliers effectively.
In our project, we make use of the OutOfDistribution
object from the cleanlab
Python package to conduct outlier detection. The OutOfDistribution
method computes an outlier score for each data point based on how well it fits within the overall distribution of the data. The higher the outlier score, the more anomalous the data point is considered to be.
Let's take a detailed look at how this is achieved in the code:
ood = OutOfDistribution()
train_outlier_scores = ood.fit_score(features=train_embeddings)
In the first step, we instantiate the OutOfDistribution
object. Then, we fit this object to our training data embeddings and calculate outlier scores for each data point in the training data:
top_train_outlier_idxs = (train_outlier_scores).argsort()[:15]
top_train_outlier_subset = train_data.select(top_train_outlier_idxs)
top_train_outlier_subset.to_pandas().head()
Next, we select the top 15 training data points with the highest outlier scores. These data points are then displayed for manual inspection, helping us understand the nature of these outliers.
We then apply a similar process to our validation data:
test_feature_embeddings = np.concatenate([val_matched_embeddings, val_mismatched_embeddings], axis=0)
test_outlier_scores = ood.score(features=test_feature_embeddings)
test_data = concatenate_datasets([val_matched_data, val_mismatched_data])
First, we concatenate the matched and mismatched validation embeddings. Then, we calculate the outlier scores for each data point in this combined validation dataset using the previously fitted OutOfDistribution
object:
top_outlier_idxs = (test_outlier_scores).argsort()[:20]
top_outlier_subset = test_data.select(top_outlier_idxs)
top_outlier_subset.to_pandas()
Lastly, we identify the top 20 validation data points with the highest outlier scores. Similar to our approach with the training data, these potential outliers are selected and visualized for inspection.
By conducting this outlier analysis, we gain valuable insights into our data. These insights can inform our decisions on data preprocessing steps, such as outlier removal or modification, to potentially enhance the performance of our machine learning model.
Once we have determined the outlier scores for each data point, the next step is to set a threshold for what we will consider an "outlier." While there are various statistical methods to determine this threshold, one simple and commonly used approach is to use percentiles.
In this project, we choose to set the threshold at the 2.5th percentile of the outlier scores in the training data. This choice implies that we consider the bottom 2.5% of our data (in terms of their fit to the overall distribution) as outliers. Let's look at how this is implemented in the code:
threshold = np.percentile(test_outlier_scores, 2.5)
The code above calculates the 2.5th percentile of the outlier scores in the training data and sets this value as our threshold for outliers.
Next, we visualize the distribution of outlier scores for both the training and test data:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
plt_range = [min(train_outlier_scores.min(),test_outlier_scores.min()), \\\\
max(train_outlier_scores.max(),test_outlier_scores.max())]
axes[0].hist(train_outlier_scores, range=plt_range, bins=50)
axes[0].set(title='train_outlier_scores distribution', ylabel='Frequency')
axes[0].axvline(x=threshold, color='red', linewidth=2)
axes[1].hist(test_outlier_scores, range=plt_range, bins=50)
axes[1].set(title='test_outlier_scores distribution', ylabel='Frequency')
axes[1].axvline(x=threshold, color='red', linewidth=2)
In the histogram, the red vertical line represents the threshold value. By observing the distributions and where the threshold falls, we get a visual representation of what proportion of our data is considered "outlying.":
Finally, we select the outliers from our test data based on this threshold:
sorted_ids = test_outlier_scores.argsort()
outlier_scores = test_outlier_scores[sorted_ids]
outlier_ids = sorted_ids[outlier_scores < threshold]
selected_outlier_subset = test_data.select(outlier_ids)
selected_outlier_subset.to_pandas().tail(15)
This piece of code arranges the outlier scores in ascending order, determines which data points fall below the threshold (hence are considered outliers), and selects these data points from our test data. The bottom 15 rows of this selected outlier subset are then displayed:
By setting and applying this threshold, we can objectively identify and handle outliers in our data. This process helps improve the quality and reliability of our LLM models.
This article focuses on detecting anomalies in multi-genre NLI datasets using advanced tools and techniques, from preprocessing with transformers to outlier detection. The MultiNLI dataset was streamlined using Hugging Face's datasets
library, enhancing manageability. Exploring sentence embeddings, transformers
library generated robust representations by averaging token embeddings with mean_pooling. Outliers were identified using cleanlab
library and visualized via plots and tables, revealing data distribution and characteristics.
A threshold was set based on the 2.5th percentile of outlier scores, aiding anomaly identification in the test dataset. The study showcases the potential of Large Language Models in NLP, offering efficient solutions to complex tasks. This exploration enriches dataset understanding and highlights LLM's impressive capabilities, underlining its impact on previously daunting challenges. The methods and libraries employed demonstrate the current LLM technology's prowess, providing potent solutions. By continuously advancing these approaches, NLP boundaries are pushed, paving the way for diverse research and applications in the future.
Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.