Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The TensorFlow Workshop

You're reading from   The TensorFlow Workshop A hands-on guide to building deep learning models from scratch using real-world datasets

Arrow left icon
Product type Paperback
Published in Dec 2021
Publisher Packt
ISBN-13 9781800205253
Length 600 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (4):
Arrow left icon
Matthew Moocarme Matthew Moocarme
Author Profile Icon Matthew Moocarme
Matthew Moocarme
Abhranshu Bagchi Abhranshu Bagchi
Author Profile Icon Abhranshu Bagchi
Abhranshu Bagchi
Anthony Maddalone Anthony Maddalone
Author Profile Icon Anthony Maddalone
Anthony Maddalone
Anthony So Anthony So
Author Profile Icon Anthony So
Anthony So
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface
1. Introduction to Machine Learning with TensorFlow 2. Loading and Processing Data FREE CHAPTER 3. TensorFlow Development 4. Regression and Classification Models 5. Classification Models 6. Regularization and Hyperparameter Tuning 7. Convolutional Neural Networks 8. Pre-Trained Networks 9. Recurrent Neural Networks 10. Custom TensorFlow Components 11. Generative Models Appendix

Text Processing

Text data represents a large class of raw data that is readily available. For example, text data can be from web pages such as Wikipedia, transcribed speech, or social media conversations—all of which are increasing at a massive scale and must be processed before they can be used for training machine learning models.

Working with text data can be challenging for several different reasons, including the following:

  • Thousands of different words exist.
  • Different languages present challenges.
  • Text data often varies in size.

There are many ways to convert text data into a numerical representation. One way is to one-hot encode the words, much like you did with the date field in Exercise 2.02, Preprocessing Non-Numerical Data. However, this presents issues when training models since large datasets with many unique words will result in a sparse dataset and can lead to slow training speeds and potentially inaccurate models. Moreover, if a new word is encountered that was not in the training data, the model cannot use that word.

One popular method that's used to represent text data is to convert the entire piece of text into embedding vectors. Pretrained models exist to convert raw text into vectors. These models are usually trained on large volumes of text. Using word embedding vectors from pretrained models has some distinct advantages:

  • The resulting vectors have a fixed size.
  • The vectors maintain contextual information, so they benefit from transfer learning.
  • No further preprocessing of the data needs to be done and the results of the embedding can be fed directly into an ANN.

While TensorFlow Hub will be covered in more depth in the next chapter, the following is an example of how to use pretrained models as a preprocessing step. To load in the pretrained model, you need to import the tensorflow_hub library. By doing this, the URL of the model can be loaded. Then, the model can be loaded into the environment by calling the KerasLayer class, which wraps the model so that it can be used like any other TensorFlow model. It can be created as follows:

import tensorflow_hub as hub
model_url = "url_of_model"
hub_layer = hub.KerasLayer(model_url, \
                           input_shape=[], dtype=tf.string, \
                           trainable=True)

The data type of the input data, indicated by the dtype parameter, should be used as input for the KerasLayer class, as well as a Boolean argument indicating whether the weights are trainable. Once the model has been loaded using the tensorflow_hub library, it can be called on text data, as follows:

hub_layer(data)

This will run the data through the pretrained model. The output will be based on the architecture and weights of the pretrained model.

In the following exercise, you will explore how to load in data that includes a text field, batch the dataset, and apply a pretrained model to the text field to convert the field into embedded vectors.

Note

The pretrained model can be found here: https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1.

The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29.

Exercise 2.04: Loading Text Data for TensorFlow Models

The dataset, drugsComTrain_raw.tsv, contains information related to patient reviews on specific drugs, along with their related conditions and a rating indicating the patient's satisfaction with the drug. In this exercise, you will load in text data for batch processing. You will apply a pretrained model from TensorFlow Hub to perform a word embedding on the patient reviews. You are required to work on the review field only as that contains text data.

Perform the following steps:

  1. Open a new Jupyter notebook to implement this exercise. Save the file as Exercise2-04.ipnyb.
  2. In a new Jupyter Notebook cell, import the TensorFlow library:
    import tensorflow as tf
  3. Create a TensorFlow dataset object using the library's make_csv_dataset function. Set the batch_size argument equal to 1 and the field_delim argument to '\t' since the dataset is tab-delimited:
    df = tf.data.experimental.make_csv_dataset\
         ('../Datasets/drugsComTest_raw.tsv', \
          batch_size=1, field_delim='\t')
  4. Create a function that takes a dataset object as input and shuffles, repeats, and batches the dataset:
    def prep_ds(ds, shuffle_buffer_size=1024, \
                batch_size=32):
        # Shuffle the dataset
        ds = ds.shuffle(buffer_size=shuffle_buffer_size)
        # Repeat the dataset
        ds = ds.repeat()
        # Batch the dataset
        ds = ds.batch(batch_size)
        return ds
  5. Apply the function to the dataset object you created in Step 3, setting batch_size equal to 5:
    ds = prep_ds(df, batch_size=5)
  6. Take the first batch and print it out:
    for x in ds.take(1):\
        print(x)

    You should get output similar to the following:

    Figure 2.14: A batch from the dataset object

    Figure 2.14: A batch from the dataset object

    The output represents the input data in tensor format.

  7. Import the pretrained word embedding model from TensorFlow Hub and create a Keras layer:
    import tensorflow_hub as hub
    embedding = "https://tfhub.dev/google/tf2-preview"\
                "/gnews-swivel-20dim/1"
    hub_layer = hub.KerasLayer(embedding, input_shape=[], \
                               dtype=tf.string, \
                               trainable=True)
  8. Take one batch from the dataset, flatten the tensor corresponding to the review field, apply the pretrained layer, and print it out:
    for x in ds.take(1):\
        print(hub_layer(tf.reshape(x['review'],[-1])))

    This will display the following output:

    Figure 2.15: A batch of the review column after the pretrained model 
has been applied to the text

Figure 2.15: A batch of the review column after the pretrained model has been applied to the text

The preceding output represents the embedding vectors for the first batch of drug reviews. The specific values may not mean much at first glance but encoded within the embeddings is contextual information based on the dataset that the embedding model was trained upon. The batch size is equal to 5 and the embedding vector size is 20, which means the resulting size, after applying the pretrained layer, is 5x20.

In this exercise, you learned how to import tabular data that might contain a variety of data types. You took the review field and applied a pretrained word embedding model to convert the text into a numerical tensor. Ultimately, you preprocessed and batched the text data so that it was appropriate for large-scale training. This is one way to represent text so that it can be input into machine learning models in TensorFlow. In fact, other pretrained word embedding models can be used and are available on TensorFlow Hub. You will learn more about how to utilize TensorFlow Hub in the next chapter.

In this section, you learned about one way to preprocess text data for use in machine learning models. There are a number of different methods you could have used to generate a numerical tensor from the text. For example, you could have one-hot encoded the words, removed the stop words, stemmed and lemmatized the words, or even done something as simple as counting the number of words in each review. The method demonstrated in this section is advantageous as it is simple to implement. Also, the word embedding incorporates contextual information in the text that is difficult to encode in other methods, such as one-hot encoding.

Ultimately, it is up to the practitioner to apply any domain knowledge to the preprocessing step to retain as much contextual information as possible. This will allow any subsequent models to learn the underlying function between the features and the target variable.

In the next section, you will learn how to load and process audio data so that the data can be used for TensorFlow models.

You have been reading a chapter from
The TensorFlow Workshop
Published in: Dec 2021
Publisher: Packt
ISBN-13: 9781800205253
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image