Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Adversarial AI Attacks, Mitigations, and Defense Strategies

You're reading from   Adversarial AI Attacks, Mitigations, and Defense Strategies A cybersecurity professional's guide to AI attacks, threat modeling, and securing AI with MLSecOps

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Packt
ISBN-13 9781835087985
Length 586 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
John Sotiropoulos John Sotiropoulos
Author Profile Icon John Sotiropoulos
John Sotiropoulos
Arrow right icon
View More author details
Toc

Table of Contents (27) Chapters Close

Preface 1. Part 1: Introduction to Adversarial AI FREE CHAPTER
2. Chapter 1: Getting Started with AI 3. Chapter 2: Building Our Adversarial Playground 4. Chapter 3: Security and Adversarial AI 5. Part 2: Model Development Attacks
6. Chapter 4: Poisoning Attacks 7. Chapter 5: Model Tampering with Trojan Horses and Model Reprogramming 8. Chapter 6: Supply Chain Attacks and Adversarial AI 9. Part 3: Attacks on Deployed AI
10. Chapter 7: Evasion Attacks against Deployed AI 11. Chapter 8: Privacy Attacks – Stealing Models 12. Chapter 9: Privacy Attacks – Stealing Data 13. Chapter 10: Privacy-Preserving AI 14. Part 4: Generative AI and Adversarial Attacks
15. Chapter 11: Generative AI – A New Frontier 16. Chapter 12: Weaponizing GANs for Deepfakes and Adversarial Attacks 17. Chapter 13: LLM Foundations for Adversarial AI 18. Chapter 14: Adversarial Attacks with Prompts 19. Chapter 15: Poisoning Attacks and LLMs 20. Chapter 16: Advanced Generative AI Scenarios 21. Part 5: Secure-by-Design AI and MLSecOps
22. Chapter 17: Secure by Design and Trustworthy AI 23. Chapter 18: AI Security with MLSecOps 24. Chapter 19: Maturing AI Security 25. Index 26. Other Books You May Enjoy

NLP evasion attacks with BERT using TextAttack

While initially the focus of evasion attacks was on image classification tasks, its underlying principles can be adapted for NLP use. TextAttack is a popular Python framework to generate adversarial text inputs. We will demonstrate its use to stage adversarial attacks in NLP for two attack scenarios: sentiment analysis and language inference.

Let’s start with sentiment analysis.

Attack scenario – sentiment analysis

In NLP, linear classifiers, such as logistic regression or linear support vector machines (SVMs), or language models such as BERT, are often used for tasks such as sentiment analysis or spam detection. These classifiers work by learning a decision boundary separating different feature space classes. Adversarial samples in NLP might involve changing words or phrases in a text snippet to change its classification from positive to negative sentiment or non-spam to spam, with the smallest change possible. This would allow, for instance, positive reviews to be misclassified as negative with barely detected changes. Our attack example will demonstrate this by attacking sentiment analysis on IMDb.

Attack example

We’ll use TextAttack with BERT to carry out an attack on the sentiment analysis process. BERT TextAttack is a Python framework designed explicitly for generating adversarial examples in NLP. It offers a variety of pre-built attack recipes, transformations, and goal functions tailored to text data, making it an ideal tool for testing and strengthening NLP models against evasion attacks.

The steps are documented in the Python code as comments. The attack will involve altering words in the input text to change the model’s classification.

First, ensure that you have the necessary packages installed. In addition to TextAttack, we will install transformers that we can access and use from Hugging Face:

pip install textattack transformers

Now, we can proceed with the implementation using a pre-built attack, TextFoolerJin2019:

import transformers
import random
from textattack. models.wrappers import HuggingFaceModelWrapper, ModelWrapper
from textattack.attack_recipes import TextFoolerJin2019
from textattack.datasets import HuggingFaceDataset
# Load the target pre-trained model for sentiment analysis and a tokenizer
model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-imdb")
tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb")
#tokenizer = AutoTokenizer('bert-base-uncased-snli')
model_wrapper = HuggingFaceModelWrapper(model, tokenizer)
# Choose the attack method
attack = TextFoolerJin2019.build(model_wrapper)
# Test the attack with your own simple text
input_text = "I really enjoyed the new movie that came out last month."
label = 1 #Positive
attack_result = attack.attack(input_text, label)
print(attack_result)

The code downloads and instantiates an IMDb-fine-tuned version of BERT from Hugging Face and the appropriate tokenizer. It then uses a TextAttack TextFoolerJin2019 attack recipe to implement the attack by perturbating the input text. The following are the results of this simple test, showing how by changing one word, we flip the classification from positive to negative:

1 (99%) --> 0 (97%)
I really enjoyed the new movie that came out last month.
I really rained the new movie that came out last month.

This is a very simple example. The TextAttack - NLP Evasion Attacks on Bert notebook contains this example and tests on portions of the IMDb dataset.

We used TextAttack to demonstrate adversarial perturbations focusing on sentiment analysis and how such perturbations can lead to sentiment misclassification in language models. In the next section, we will see how they can also be used in more sophisticated attacks by attacking language inference.

Attack scenario – natural language inference

NLP models are particularly susceptible to word-level perturbations that maintain the semantic meaning of a text but alter its classification. For instance, a spam detection model might classify an email as non-spam, but by changing certain words or phrases, an attacker could cause it to be filtered incorrectly. TextAttack can automate the process of identifying and applying such perturbations to test the resilience of these models. This goes beyond simple spam classification of meaning in Natural Language Inference (NLI). We will use TextAttack and BERT with the Stanford Natural Language Inference (SNLI) dataset.

Attack example

The SLNI dataset is a collection of sentence pairs annotated with one of three labels: entailment, contradiction, or neutral. These labels represent the relationship between a premise and a hypothesis sentence:

  • Entailment: The hypothesis is a true statement given the premise
  • Contradiction: The hypothesis is a false statement given the premise
  • Neutral: The truth of the hypothesis is undetermined given the premise

The SNLI-fine-tuned version of BERT supports this. We provide a pair of text sentences, and it returns with a classification from one of these:

  • 0contradiction
  • 1neutral
  • 2entailment

In our example, we will demonstrate how to use TextAttack and the same attack recipe to subtly change either the premise or the hypothesis and manipulate the inference.

The attack is like the previous one; we use a different model and tokenizer for SNLI:

# Load model and tokenizer
slni_model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-snli")
slni_tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-snli")

We can wrap the model and build the attack:

# Wrap the model with TextAttack's HuggingFaceModelWrapper
slni_model_wrapper = HuggingFaceModelWrapper(slni_model, slni_tokenizer)
# Build the attack object
slni_attack = TextFoolerJin2019.build(slni_model_wrapper)

Here is a simple test attack we can run to flip a contradictory inference to entailment:

from collections import OrderedDict
input_text_pair = OrderedDict([
    ("premise", "A man inspects the uniform of a figure in some East Asian country."),
    ("hypothesis", "The man is sleeping")
])
label = 0  # 0 - contradiction, 1 - neutral, 2 - entailment
attack_result = slni_attack.attack(input_text_pair, label)
print(attack_result)

The example successfully produces an entailment classification by rewording the hypothesis. Here is the output of the attack:

0 (100%) --> 2 (57%)
Premise: A man inspects the uniform of a figure in some East Asian country.
Hypothesis: The man is sleeping
Premise: A man inspects the uniform of a figure in some East Asian country.
Hypothesis: The comrade is dream

This is a simple attack, and the sample notebook (TextAttack - NLP Evasion Attacks on Bert) contains the code and tests against a test portion of the SNLI dataset.

This concludes our exploration of NLP-based evasion attacks.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image