You're reading from 50 Algorithms Every Programmer Should Know Tackle computer science challenges with classic to modern algorithms in machine learning, software design, data systems, and cryptography

Product type Paperback

Published in Sep 2023

Publisher Packt

ISBN-13 9781803247762

Length 538 pages

Edition 2nd Edition

Languages

Processing

Tools

Processing

Concepts

Data Structures and Algorithms

Author (1):

Imran Ahmad

View More author details

Table of Contents (22) Chapters

Preface

1. Section 1: Fundamentals and Core Algorithms FREE CHAPTER

2. Overview of Algorithms

3. Data Structures Used in Algorithms

4. Sorting and Searching Algorithms

5. Designing Algorithms

6. Graph Algorithms

7. Section 2: Machine Learning Algorithms

8. Unsupervised Machine Learning Algorithms

9. Traditional Supervised Learning Algorithms

10. Neural Network Algorithms

11. Algorithms for Natural Language Processing

12. Understanding Sequential Models

13. Advanced Sequential Modeling Algorithms

14. Section 3: Advanced Topics

15. Recommendation Engines

16. Algorithmic Strategies for Data Handling

17. Cryptography

18. Large-Scale Algorithms

19. Practical Considerations

20. Other Books You May Enjoy

21. Index

Understanding NLP terminology

NLP is a vast field of study. In this section, we will investigate some of the basic terminology related to NLP:

Corpus: A corpus is a large and structured collection of text or speech data that serves as a resource for NLP algorithms. It can consist of various types of textual data, such as written text, spoken language, transcribed conversations, and social media posts. A corpus is created by intentionally gathering and organizing data from various online and offline sources, including the internet. While the internet can be a rich source for acquiring data, deciding what data to include in a corpus requires a purposeful selection and alignment with the goals of the particular study or analysis being conducted.
Corpora, the plural of corpus, can be annotated, meaning they may contain extra details about the texts, such as part-of-speech tags and named entities. These annotated corpora offer specific information that enhances the training and evaluation of NLP algorithms, making them especially valuable resources in the field.

Normalization: This process involves converting text into a standard form, such as converting all characters to lowercase or removing punctuation, making it more amenable to analysis.
Tokenization: Tokenization breaks down text into smaller parts called tokens, usually words or subwords, enabling a more structured analysis.
Named Entity Recognition (NER): NER identifies and classifies named entities within the text, such as people’s names, locations, organizations, etc.
Stop words: These are commonly used words such as and, the, and is, which are often filtered out during text processing as they may not contribute significant meaning.
Stemming and lemmatization: Stemming involves reducing words to their root form, while lemmatization involves converting words to their base or dictionary form. Both techniques help in analyzing the core meaning of words.

Next, let us study different text preprocessing techniques used in NLP:

Word embeddings: This is a method used to translate words into numerical form, where each word is represented as a vector in a space that may have many dimensions. In this context, a “high-dimensional vector” refers to an array of numbers where the number of dimensions, or individual components, is quite large—often in the hundreds or even thousands. The idea behind using high-dimensional vectors is to capture the complex relationships between words, allowing words with similar meanings to be positioned closer together in this multi-dimensional space. The more dimensions the vector has, the more nuanced the relationships it can capture. Therefore, in word embeddings, semantically related words end up being closer to each other in this high-dimensional space, making it easier for algorithms to understand and process language in a way that reflects human understanding.
Language modeling: Language modeling is the process of developing statistical models that can predict or generate sequences of words or characters based on the patterns and structures found in a given text corpus.
Machine translation: The process of automatically translating text from one language to another using NLP techniques and models.
Sentiment analysis: The process of determining the attitude or sentiment expressed in a piece of text, often by analyzing the words and phrases used and their context.

Text preprocessing in NLP

Text preprocessing is a vital stage in NLP, where raw text data undergoes a transformation to become suitable for machine learning algorithms. This transformation involves converting the unorganized and often messy text into what is known as a “structured format.” A structured format means that the data is organized into a more systematic and predictable pattern, often involving techniques like tokenization, stemming, and removing unwanted characters. These steps help in cleaning the text, reducing irrelevant information or “noise,” and arranging the data in a manner that makes it easier for the machine learning models to understand. By following this approach, the raw text, which may contain inconsistencies and irregularities, is molded into a form that enhances the accuracy, performance, and efficiency of subsequent NLP tasks. In this section, we will explore various techniques used in text preprocessing to achieve this structured format.

Tokenization

As a reminder, tokenization is the crucial process of dividing text into smaller units, known as tokens. These tokens can be as small as individual words or even subwords. In NLP, tokenization is often considered the first step in preparing text data for further analysis. The reason for this foundational role lies in the very nature of language, where understanding and processing text requires breaking it down into manageable parts. By transforming a continuous stream of text into individual tokens, we create a structured format that mirrors the way humans naturally read and understand language. This structuring provides the machine learning models with a clear and systematic way to analyze the text, allowing them to recognize patterns and relationships within the data. As we delve deeper into NLP techniques, this tokenized format becomes the basis upon which many other preprocessing and analysis steps are built.

The following code snippet is tokenizing the given text using the Natural Language Toolkit (nltk) library in Python. The nltk is a widely used library in Python, specifically designed for working with human language data. It provides easy-to-use interfaces and tools for tasks such as classification, tokenization, stemming, tagging, parsing, and more, making it a valuable asset for NLP. For those who wish to leverage these capabilities in their Python projects, the nltk library can be downloaded and installed directly from the Python Package Index (PyPI) by using the command pip install nltk. By incorporating the nltk library into your code, you can access a rich set of functions and resources that streamline the development and execution of various NLP tasks, making it a popular choice among researchers, educators, and developers in the field of computational linguistics. Let us start by importing relevant functions and using them:

from nltk.tokenize import word_tokenize
corpus = 'This is a book about algorithms.'
tokens = word_tokenize(corpus)
print(tokens)

The output will be a list that looks like this:

['This', 'is', 'a', 'book', 'about', 'algorithms', '.']

In this example, each token is a word. The granularity of the resulting tokens will vary based on the objective—for example, each token can consist of a word, a sentence, or a paragraph.

To tokenize text based on sentences, you can use the sent_tokenize function from the nltk.tokenize module:

from nltk.tokenize import sent_tokenize
corpus = 'This is a book about algorithms. It covers various topics in depth.'

In this example, the corpus variable contains two sentences. The sent_tokenize function takes the corpus as input and returns a list of sentences. When you run the modified code, you will get the following output:

sentences = sent_tokenize(corpus)
print(sentences)

['This is a book about algorithms.', 'It covers various topics in depth.']

Sometimes we may need to break down large texts into paragraph-level chunks. nltk can help with that task. It’s a feature that could be particularly useful in applications such as document summarization, where understanding the structure at the paragraph level may be crucial. Tokenizing text into paragraphs might seem straightforward, but it can be complex depending on the structure and format of the text. A simple approach is to split the text into two newline characters, which often separate paragraphs in plain text documents.

Here’s a basic example:

def tokenize_paragraphs(text):
    # Split by two newline characters
    paragraphs = text.split('\n\n') 
    return [p.strip() for p in paragraphs if p]

Next, let us look into how we can clean the data.

Cleaning data

Cleaning data is an essential step in NLP, as raw text data often contains noise and irrelevant information that can hinder the performance of NLP models. The goal of cleaning data for NLP is to preprocess the text data to remove noise and irrelevant information, and to transform it into a format that is suitable for analysis using NLP techniques. Note that data cleaning is done after it is tokenized. The reason is that cleaning might involve operations that depend on the structure revealed by tokenization. For instance, removing specific words or altering word forms might be done more accurately after the text is tokenized into individual terms.

Let us study some techniques used to clean data and prepare it for machine learning tasks:

Case conversion

Case conversion is a technique in NLP where text is transformed from one case format to another, such as from uppercase to lowercase, or from title case to uppercase.

For example, the text “Natural Language Processing” in title case could be converted to lowercase to be “natural language processing.”

This simple yet effective step helps in standardizing the text, which in turn simplifies its processing for various NLP algorithms. By ensuring that the text is in a uniform case, it aids in eliminating inconsistencies that might otherwise arise from variations in capitalization.

Punctuation removal

Punctuation removal in NLP refers to the process of removing punctuation marks from raw text data before analysis. Punctuation marks are symbols such as periods (.), commas (,), question marks (?), and exclamation marks (!) that are used in written language to indicate pauses, emphasis, or intonation. While they are essential in written language, they can add noise and complexity to raw text data, which can hinder the performance of NLP models.

It’s a reasonable concern to wonder how the removal of punctuation might affect the meaning of sentences. Consider the following examples:

"She's a cat."

"She's a cat??"

Without punctuation, both lines become “She’s a cat,” potentially losing the distinct emphasis conveyed by the question marks.

However, it’s worth noting that in many NLP tasks, such as topic classification or sentiment analysis, punctuation might not significantly impact the overall understanding. Additionally, models can rely on other cues from the text’s structure, content, or context to derive meaning. In cases where the nuances of punctuation are critical, specialized models and preprocessing techniques may be employed to retain the required information.

Handling numbers in NLP

Numbers within text data can pose challenges in NLP. Here’s a look at two main strategies for handling numbers in text analysis, considering both the traditional approach of removal and an alternative option of standardization.

In some NLP tasks, numbers may be considered noise, particularly when the focus is on aspects likeyWord frequency or sentiment analysis. Here’s why some analysts might choose to remove numbers:

Lack of relevance: Numeric characters may not carry significant meaning in specific text analysis scenarios.
Skewing frequency counts: Numbers can distort word frequency counts, especially in models like topic modeling.
Reducing complexity: Removing numbers may simplify the text data, potentially enhancing the performance of NLP models.

However, an alternative approach is to convert all numbers to a standard representation rather than discarding them. This method acknowledges that numbers can carry essential information and ensures that their value is retained in a consistent format. It can be particularly useful in contexts where numerical data plays a vital role in the meaning of the text.

Deciding whether to remove or retain numbers requires an understanding of the problem being solved. An algorithm may need customization to distinguish whether a number is significant based on the context of the text and the specific NLP task. Analyzing the role of numbers within the domain of the text and the goals of the analysis can guide this decision-making process.

Handling numbers in NLP is not a one-size-fits-all approach. Whether to remove, standardize, or carefully analyze numbers depends on the unique requirements of the task at hand. Understanding these options and their implications helps in making informed decisions that align with the goals of the text analysis.

White space removal

White space removal in NLP refers to the process of removing unnecessary white spaces, such as multiple spaces and tab characters. White space in the context of text data is not merely the space between words but includes other “invisible” characters that create spacing within text. In NLP, white space removal refers to the process of eliminating these unnecessary white space characters. Removing unnecessary white spaces can reduce the size of the text data and make it easier to process and analyze.

Here’s a simple example to illustrate white space removal:

Input text: "The quick brown fox \tjumps over the lazy dog."
Processed text: "The quick brown fox jumps over the lazy dog."

In the above example, extra spaces and a tab character (denoted by \t) are removed to create a cleaner and more standardized text string.

Stop word removal

Stop word removal is the process of eliminating common words, known as stop words, from a text corpus. stop words are words that occur frequently in a language but do not carry significant meaning or contribute to the overall understanding of the text. Examples of stop words in English include the, and, is, in and for. Stop word removal helps reduce the dimensionality of the data and improve the efficiency of the algorithms. By removing words that don’t contribute meaningfully to the analysis, computational resources can be focused on the words that do matter, improving the efficiency of various NLP algorithms.

Note that stop word removal is more than a mere reduction in text size; it’s about focusing on the words that truly matter for the analysis at hand. While stop words play a vital role in language structure, their removal in NLP can enhance the efficiency and focus of the analysis, particularly in tasks like sentiment analysis where the primary concern is understanding the underlying emotion or opinion.

Stemming and lemmatization

In textual data, most words are likely to be present in slightly different forms. Reducing each word to its origin or stem in a family of words is called stemming. It is used to group words based on their similar meanings to reduce the total number of words that need to be analyzed. Essentially, stemming reduces the overall conditionality of the problem. The most common algorithm for stemming English is the Porter algorithm.

For example, let us look into a couple of examples:

Example 1: {use, used, using, uses} => use
Example 2: {easily, easier, easiest} => easi

It’s important to note that stemming can sometimes result in misspelled words, as seen in example 2 where easi was produced.

Stemming is a simple and quick process, but it may not always produce correct results. For cases where correct spelling is required, lemmatization is a more appropriate method. Lemmatization considers the context and reduces words to their base form. The base form of a word, also known as the lemma, is its most simple and meaningful version. It represents the way a word would appear in the dictionary, devoid of any inflectional endings, which will be a correct English word, resulting in more accurate and meaningful word roots.

The process of guiding algorithms to recognize similarities is a precise and thoughtful task. Unlike humans, algorithms need explicit rules and criteria to make connections that might seem obvious to us. Understanding this distinction and knowing how to provide the necessary guidance is a vital skill in the development and tuning of algorithms for various applications.

The rest of the chapter is locked

You're reading from 50 Algorithms Every Programmer Should Know Tackle computer science challenges with classic to modern algorithms in machine learning, software design, data systems, and cryptography

Table of Contents (22) Chapters

Understanding NLP terminology

Text preprocessing in NLP

Tokenization

Cleaning data

Case conversion

Punctuation removal

Handling numbers in NLP

White space removal

Stop word removal

Stemming and lemmatization

Authors (1)

Personalised recommendations for you

You're reading from 50 Algorithms Every Programmer Should Know Tackle computer science challenges with classic to modern algorithms in machine learning, software design, data systems, and cryptography

Table of Contents (22) Chapters

Understanding NLP terminology

Text preprocessing in NLP

Tokenization

Cleaning data

Case conversion

Punctuation removal

Handling numbers in NLP

White space removal

Stop word removal

Stemming and lemmatization

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you