Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Machine Learning for Cybersecurity Cookbook

You're reading from   Machine Learning for Cybersecurity Cookbook Over 80 recipes on how to implement machine learning algorithms for building security systems using Python

Arrow left icon
Product type Paperback
Published in Nov 2019
Publisher Packt
ISBN-13 9781789614671
Length 346 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Emmanuel Tsukerman Emmanuel Tsukerman
Author Profile Icon Emmanuel Tsukerman
Emmanuel Tsukerman
Arrow right icon
View More author details
Toc

Table of Contents (11) Chapters Close

Preface 1. Machine Learning for Cybersecurity FREE CHAPTER 2. Machine Learning-Based Malware Detection 3. Advanced Malware Detection 4. Machine Learning for Social Engineering 5. Penetration Testing Using Machine Learning 6. Automatic Intrusion Detection 7. Securing and Attacking Data with Machine Learning 8. Secure and Private AI 9. Other Books You May Enjoy Appendix

Generating text using Markov chains

Markov chains are simple stochastic models in which a system can exist in a number of states. To know the probability distribution of where the system will be next, it suffices to know where it currently is. This is in contrast with a system in which the probability distribution of the subsequent state may depend on the past history of the system. This simplifying assumption allows Markov chains to be easily applied in many domains, surprisingly fruitfully.

In this recipe, we will utilize Markov chains to generate fake reviews, which is useful for pen-testing a review system's spam detector. In a later recipe, you will upgrade the technology from Markov chains to RNNs.

Getting ready

Preparation for this recipe consists of installing the markovify and pandas packages in pip. The command for this is as follows:

pip install markovify pandas

In addition, the directory in the repository for this chapter includes a CSV dataset, airport_reviews.csv, which should be placed alongside the code for the chapter.

How to do it...

Let's see how to generate text using Markov chains by performing the following steps:

  1. Start by importing the markovify library and a text file whose style we would like to imitate:
import markovify
import pandas as pd

df = pd.read_csv("airport_reviews.csv")

As an illustration, I have chosen a collection of airport reviews as my text:

"The airport is certainly tiny! ..."
  1. Next, join the individual reviews into one large text string and build a Markov chain model using the airport review text:
from itertools import chain

N = 100
review_subset = df["content"][0:N]
text = "".join(chain.from_iterable(review_subset))
markov_chain_model = markovify.Text(text)

Behind the scenes, the library computes the transition word probabilities from the text.

  1. Generate five sentences using the Markov chain model:
for i in range(5):
print(markov_chain_model.make_sentence())

  1. Since we are using airport reviews, we will have the following as the output after executing the previous code:
On the positive side it's a clean airport transfer from A to C gates and outgoing gates is truly enormous - but why when we arrived at about 7.30 am for our connecting flight to Venice on TAROM.
The only really bother: you may have to wait in a polite manner.
Why not have bus after a short wait to check-in there were a lots of shops and less seating.
Very inefficient and hostile airport. This is one of the time easy to access at low price from city center by train.
The distance between the incoming gates and ending with dirty and always blocked by never ending roadworks.

Surprisingly realistic! Although the reviews would have to be filtered down to the best ones.

  1. Generate 3 sentences with a length of no more than 140 characters:
for i in range(3):
print(markov_chain_model.make_short_sentence(140))

With our running example, we will see the following output:

However airport staff member told us that we were put on a connecting code share flight.
Confusing in the check-in agent was friendly.
I am definitely not keen on coming to the lack of staff . Lack of staff . Lack of staff at boarding pass at check-in.

How it works...

We begin the recipe by importing the Markovify library, a library for Markov chain computations, and reading in text, which will inform our Markov model (step 1). In step 2, we create a Markov chain model using the text. The following is a relevant snippet from the text object's initialization code:

class Text(object):

reject_pat = re.compile(r"(^')|('$)|\s'|'\s|[\"(\(\)\[\])]")

def __init__(self, input_text, state_size=2, chain=None, parsed_sentences=None, retain_original=True, well_formed=True, reject_reg=''):
"""
input_text: A string.
state_size: An integer, indicating the number of words in the model's state.
chain: A trained markovify.Chain instance for this text, if pre-processed.
parsed_sentences: A list of lists, where each outer list is a "run"
of the process (e.g. a single sentence), and each inner list
contains the steps (e.g. words) in the run. If you want to simulate
an infinite process, you can come very close by passing just one, very
long run.
retain_original: Indicates whether to keep the original corpus.
well_formed: Indicates whether sentences should be well-formed, preventing
unmatched quotes, parenthesis by default, or a custom regular expression
can be provided.
reject_reg: If well_formed is True, this can be provided to override the
standard rejection pattern.
"""

The most important parameter to understand is state_size = 2, which means that the Markov chains will be computing transitions between consecutive pairs of words. For more realistic sentences, this parameter can be increased, at the cost of making sentences appear less original. Next, we apply the Markov chains we have trained to generate a few example sentences (steps 3 and 4). We can see clearly that the Markov chains have captured the tone and style of the text. Finally, in step 5, we create a few tweets in the style of the airport reviews using our Markov chains.

You have been reading a chapter from
Machine Learning for Cybersecurity Cookbook
Published in: Nov 2019
Publisher: Packt
ISBN-13: 9781789614671
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image