You're reading from Adversarial AI Attacks, Mitigations, and Defense Strategies A cybersecurity professional's guide to AI attacks, threat modeling, and securing AI with MLSecOps

Product type Paperback

Published in Jul 2024

Publisher Packt

ISBN-13 9781835087985

Length 586 pages

Edition 1st Edition

Languages

Python

Tools

DeepFakes

Concepts

Artificial Intelligence

Author (1):

John Sotiropoulos

View More author details

Table of Contents (27) Chapters

Preface

1. Part 1: Introduction to Adversarial AI FREE CHAPTER

2. Chapter 1: Getting Started with AI

3. Chapter 2: Building Our Adversarial Playground

4. Chapter 3: Security and Adversarial AI

5. Part 2: Model Development Attacks

6. Chapter 4: Poisoning Attacks

7. Chapter 5: Model Tampering with Trojan Horses and Model Reprogramming

8. Chapter 6: Supply Chain Attacks and Adversarial AI

9. Part 3: Attacks on Deployed AI

10. Chapter 7: Evasion Attacks against Deployed AI

11. Chapter 8: Privacy Attacks – Stealing Models

12. Chapter 9: Privacy Attacks – Stealing Data

13. Chapter 10: Privacy-Preserving AI

14. Part 4: Generative AI and Adversarial Attacks

15. Chapter 11: Generative AI – A New Frontier

16. Chapter 12: Weaponizing GANs for Deepfakes and Adversarial Attacks

17. Chapter 13: LLM Foundations for Adversarial AI

18. Chapter 14: Adversarial Attacks with Prompts

19. Chapter 15: Poisoning Attacks and LLMs

20. Chapter 16: Advanced Generative AI Scenarios

21. Part 5: Secure-by-Design AI and MLSecOps

22. Chapter 17: Secure by Design and Trustworthy AI

23. Chapter 18: AI Security with MLSecOps

24. Chapter 19: Maturing AI Security

25. Index

Why subscribe?

26. Other Books You May Enjoy

NLP evasion attacks with BERT using TextAttack

While initially the focus of evasion attacks was on image classification tasks, its underlying principles can be adapted for NLP use. TextAttack is a popular Python framework to generate adversarial text inputs. We will demonstrate its use to stage adversarial attacks in NLP for two attack scenarios: sentiment analysis and language inference.

Let’s start with sentiment analysis.

Attack scenario – sentiment analysis

In NLP, linear classifiers, such as logistic regression or linear support vector machines (SVMs), or language models such as BERT, are often used for tasks such as sentiment analysis or spam detection. These classifiers work by learning a decision boundary separating different feature space classes. Adversarial samples in NLP might involve changing words or phrases in a text snippet to change its classification from positive to negative sentiment or non-spam to spam, with the smallest change possible. This would allow, for instance, positive reviews to be misclassified as negative with barely detected changes. Our attack example will demonstrate this by attacking sentiment analysis on IMDb.

Attack example

We’ll use TextAttack with BERT to carry out an attack on the sentiment analysis process. BERT TextAttack is a Python framework designed explicitly for generating adversarial examples in NLP. It offers a variety of pre-built attack recipes, transformations, and goal functions tailored to text data, making it an ideal tool for testing and strengthening NLP models against evasion attacks.

The steps are documented in the Python code as comments. The attack will involve altering words in the input text to change the model’s classification.

First, ensure that you have the necessary packages installed. In addition to TextAttack, we will install transformers that we can access and use from Hugging Face:

pip install textattack transformers

Now, we can proceed with the implementation using a pre-built attack, TextFoolerJin2019:

import transformers
import random
from textattack. models.wrappers import HuggingFaceModelWrapper, ModelWrapper
from textattack.attack_recipes import TextFoolerJin2019
from textattack.datasets import HuggingFaceDataset
# Load the target pre-trained model for sentiment analysis and a tokenizer
model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-imdb")
tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb")
#tokenizer = AutoTokenizer('bert-base-uncased-snli')
model_wrapper = HuggingFaceModelWrapper(model, tokenizer)
# Choose the attack method
attack = TextFoolerJin2019.build(model_wrapper)
# Test the attack with your own simple text
input_text = "I really enjoyed the new movie that came out last month."
label = 1 #Positive
attack_result = attack.attack(input_text, label)
print(attack_result)

The code downloads and instantiates an IMDb-fine-tuned version of BERT from Hugging Face and the appropriate tokenizer. It then uses a TextAttack TextFoolerJin2019 attack recipe to implement the attack by perturbating the input text. The following are the results of this simple test, showing how by changing one word, we flip the classification from positive to negative:

1 (99%) --> 0 (97%)
I really enjoyed the new movie that came out last month.
I really rained the new movie that came out last month.

This is a very simple example. The TextAttack - NLP Evasion Attacks on Bert notebook contains this example and tests on portions of the IMDb dataset.

We used TextAttack to demonstrate adversarial perturbations focusing on sentiment analysis and how such perturbations can lead to sentiment misclassification in language models. In the next section, we will see how they can also be used in more sophisticated attacks by attacking language inference.

Attack scenario – natural language inference

NLP models are particularly susceptible to word-level perturbations that maintain the semantic meaning of a text but alter its classification. For instance, a spam detection model might classify an email as non-spam, but by changing certain words or phrases, an attacker could cause it to be filtered incorrectly. TextAttack can automate the process of identifying and applying such perturbations to test the resilience of these models. This goes beyond simple spam classification of meaning in Natural Language Inference (NLI). We will use TextAttack and BERT with the Stanford Natural Language Inference (SNLI) dataset.

Attack example

The SLNI dataset is a collection of sentence pairs annotated with one of three labels: entailment, contradiction, or neutral. These labels represent the relationship between a premise and a hypothesis sentence:

Entailment: The hypothesis is a true statement given the premise
Contradiction: The hypothesis is a false statement given the premise
Neutral: The truth of the hypothesis is undetermined given the premise

The SNLI-fine-tuned version of BERT supports this. We provide a pair of text sentences, and it returns with a classification from one of these:

0 – contradiction
1 – neutral
2 – entailment

In our example, we will demonstrate how to use TextAttack and the same attack recipe to subtly change either the premise or the hypothesis and manipulate the inference.

The attack is like the previous one; we use a different model and tokenizer for SNLI:

# Load model and tokenizer
slni_model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-snli")
slni_tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-snli")

We can wrap the model and build the attack:

# Wrap the model with TextAttack's HuggingFaceModelWrapper
slni_model_wrapper = HuggingFaceModelWrapper(slni_model, slni_tokenizer)
# Build the attack object
slni_attack = TextFoolerJin2019.build(slni_model_wrapper)

Here is a simple test attack we can run to flip a contradictory inference to entailment:

from collections import OrderedDict
input_text_pair = OrderedDict([
    ("premise", "A man inspects the uniform of a figure in some East Asian country."),
    ("hypothesis", "The man is sleeping")
])
label = 0  # 0 - contradiction, 1 - neutral, 2 - entailment
attack_result = slni_attack.attack(input_text_pair, label)
print(attack_result)

The example successfully produces an entailment classification by rewording the hypothesis. Here is the output of the attack:

0 (100%) --> 2 (57%)
Premise: A man inspects the uniform of a figure in some East Asian country.
Hypothesis: The man is sleeping
Premise: A man inspects the uniform of a figure in some East Asian country.
Hypothesis: The comrade is dream

This is a simple attack, and the sample notebook (TextAttack - NLP Evasion Attacks on Bert) contains the code and tests against a test portion of the SNLI dataset.

This concludes our exploration of NLP-based evasion attacks.