NLP evasion attacks with BERT using TextAttack
While initially the focus of evasion attacks was on image classification tasks, its underlying principles can be adapted for NLP use. TextAttack is a popular Python framework to generate adversarial text inputs. We will demonstrate its use to stage adversarial attacks in NLP for two attack scenarios: sentiment analysis and language inference.
Let’s start with sentiment analysis.
Attack scenario – sentiment analysis
In NLP, linear classifiers, such as logistic regression or linear support vector machines (SVMs), or language models such as BERT, are often used for tasks such as sentiment analysis or spam detection. These classifiers work by learning a decision boundary separating different feature space classes. Adversarial samples in NLP might involve changing words or phrases in a text snippet to change its classification from positive to negative sentiment or non-spam to spam, with the smallest change possible. This would allow, for instance, positive reviews to be misclassified as negative with barely detected changes. Our attack example will demonstrate this by attacking sentiment analysis on IMDb.
Attack example
We’ll use TextAttack with BERT to carry out an attack on the sentiment analysis process. BERT TextAttack is a Python framework designed explicitly for generating adversarial examples in NLP. It offers a variety of pre-built attack recipes, transformations, and goal functions tailored to text data, making it an ideal tool for testing and strengthening NLP models against evasion attacks.
The steps are documented in the Python code as comments. The attack will involve altering words in the input text to change the model’s classification.
First, ensure that you have the necessary packages installed. In addition to TextAttack, we will install transformers that we can access and use from Hugging Face:
pip install textattack transformers
Now, we can proceed with the implementation using a pre-built attack, TextFoolerJin2019
:
import transformers import random from textattack. models.wrappers import HuggingFaceModelWrapper, ModelWrapper from textattack.attack_recipes import TextFoolerJin2019 from textattack.datasets import HuggingFaceDataset # Load the target pre-trained model for sentiment analysis and a tokenizer model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-imdb") tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb") #tokenizer = AutoTokenizer('bert-base-uncased-snli') model_wrapper = HuggingFaceModelWrapper(model, tokenizer) # Choose the attack method attack = TextFoolerJin2019.build(model_wrapper) # Test the attack with your own simple text input_text = "I really enjoyed the new movie that came out last month." label = 1 #Positive attack_result = attack.attack(input_text, label) print(attack_result)
The code downloads and instantiates an IMDb-fine-tuned version of BERT from Hugging Face and the appropriate tokenizer. It then uses a TextAttack TextFoolerJin2019
attack recipe to implement the attack by perturbating the input text. The following are the results of this simple test, showing how by changing one word, we flip the classification from positive to negative:
1 (99%) --> 0 (97%) I really enjoyed the new movie that came out last month. I really rained the new movie that came out last month.
This is a very simple example. The TextAttack - NLP Evasion Attacks on Bert
notebook contains this example and tests on portions of the IMDb dataset.
We used TextAttack to demonstrate adversarial perturbations focusing on sentiment analysis and how such perturbations can lead to sentiment misclassification in language models. In the next section, we will see how they can also be used in more sophisticated attacks by attacking language inference.
Attack scenario – natural language inference
NLP models are particularly susceptible to word-level perturbations that maintain the semantic meaning of a text but alter its classification. For instance, a spam detection model might classify an email as non-spam, but by changing certain words or phrases, an attacker could cause it to be filtered incorrectly. TextAttack can automate the process of identifying and applying such perturbations to test the resilience of these models. This goes beyond simple spam classification of meaning in Natural Language Inference (NLI). We will use TextAttack and BERT with the Stanford Natural Language Inference (SNLI) dataset.
Attack example
The SLNI dataset is a collection of sentence pairs annotated with one of three labels: entailment, contradiction, or neutral. These labels represent the relationship between a premise and a hypothesis sentence:
- Entailment: The hypothesis is a true statement given the premise
- Contradiction: The hypothesis is a false statement given the premise
- Neutral: The truth of the hypothesis is undetermined given the premise
The SNLI-fine-tuned version of BERT supports this. We provide a pair of text sentences, and it returns with a classification from one of these:
0
– contradiction1
– neutral2
– entailment
In our example, we will demonstrate how to use TextAttack and the same attack recipe to subtly change either the premise or the hypothesis and manipulate the inference.
The attack is like the previous one; we use a different model and tokenizer for SNLI:
# Load model and tokenizer slni_model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-snli") slni_tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-snli")
We can wrap the model and build the attack:
# Wrap the model with TextAttack's HuggingFaceModelWrapper slni_model_wrapper = HuggingFaceModelWrapper(slni_model, slni_tokenizer) # Build the attack object slni_attack = TextFoolerJin2019.build(slni_model_wrapper)
Here is a simple test attack we can run to flip a contradictory inference to entailment:
from collections import OrderedDict input_text_pair = OrderedDict([ ("premise", "A man inspects the uniform of a figure in some East Asian country."), ("hypothesis", "The man is sleeping") ]) label = 0 # 0 - contradiction, 1 - neutral, 2 - entailment attack_result = slni_attack.attack(input_text_pair, label) print(attack_result)
The example successfully produces an entailment classification by rewording the hypothesis. Here is the output of the attack:
0 (100%) --> 2 (57%) Premise: A man inspects the uniform of a figure in some East Asian country. Hypothesis: The man is sleeping Premise: A man inspects the uniform of a figure in some East Asian country. Hypothesis: The comrade is dream
This is a simple attack, and the sample notebook (TextAttack - NLP Evasion Attacks on Bert
) contains the code and tests against a test portion of the SNLI dataset.
This concludes our exploration of NLP-based evasion attacks.