Text Processing
Text data represents a large class of raw data that is readily available. For example, text data can be from web pages such as Wikipedia, transcribed speech, or social media conversations—all of which are increasing at a massive scale and must be processed before they can be used for training machine learning models.
Working with text data can be challenging for several different reasons, including the following:
- Thousands of different words exist.
- Different languages present challenges.
- Text data often varies in size.
There are many ways to convert text data into a numerical representation. One way is to one-hot encode the words, much like you did with the date field in Exercise 2.02, Preprocessing Non-Numerical Data. However, this presents issues when training models since large datasets with many unique words will result in a sparse dataset and can lead to slow training speeds and potentially inaccurate models. Moreover, if a new word is encountered that was not in the training data, the model cannot use that word.
One popular method that's used to represent text data is to convert the entire piece of text into embedding vectors. Pretrained models exist to convert raw text into vectors. These models are usually trained on large volumes of text. Using word embedding vectors from pretrained models has some distinct advantages:
- The resulting vectors have a fixed size.
- The vectors maintain contextual information, so they benefit from transfer learning.
- No further preprocessing of the data needs to be done and the results of the embedding can be fed directly into an ANN.
While TensorFlow Hub will be covered in more depth in the next chapter, the following is an example of how to use pretrained models as a preprocessing step. To load in the pretrained model, you need to import the tensorflow_hub
library. By doing this, the URL of the model can be loaded. Then, the model can be loaded into the environment by calling the KerasLayer
class, which wraps the model so that it can be used like any other TensorFlow model. It can be created as follows:
import tensorflow_hub as hub model_url = "url_of_model" hub_layer = hub.KerasLayer(model_url, \ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â input_shape=[], dtype=tf.string, \ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â trainable=True)
The data type of the input data, indicated by the dtype
parameter, should be used as input for the KerasLayer
class, as well as a Boolean argument indicating whether the weights are trainable. Once the model has been loaded using the tensorflow_hub
library, it can be called on text data, as follows:
hub_layer(data)
This will run the data through the pretrained model. The output will be based on the architecture and weights of the pretrained model.
In the following exercise, you will explore how to load in data that includes a text field, batch the dataset, and apply a pretrained model to the text field to convert the field into embedded vectors.
Note
The pretrained model can be found here: https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1.
The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29.
Exercise 2.04: Loading Text Data for TensorFlow Models
The dataset, drugsComTrain_raw.tsv
, contains information related to patient reviews on specific drugs, along with their related conditions and a rating indicating the patient's satisfaction with the drug. In this exercise, you will load in text data for batch processing. You will apply a pretrained model from TensorFlow Hub to perform a word embedding on the patient reviews. You are required to work on the review
field only as that contains text data.
Perform the following steps:
- Open a new Jupyter notebook to implement this exercise. Save the file asÂ
Exercise2-04.ipnyb
. - In a new Jupyter Notebook cell, import the TensorFlow library:
import tensorflow as tf
- Create a TensorFlow dataset object using the library's
make_csv_dataset
function. Set thebatch_size
argument equal to1
and thefield_delim
argument to'\t'
since the dataset is tab-delimited:df = tf.data.experimental.make_csv_dataset\ Â Â Â Â Â ('../Datasets/drugsComTest_raw.tsv', \ Â Â Â Â Â Â batch_size=1, field_delim='\t')
- Create a function that takes a dataset object as input and shuffles, repeats, and batches the dataset:
def prep_ds(ds, shuffle_buffer_size=1024, \             batch_size=32):     # Shuffle the dataset     ds = ds.shuffle(buffer_size=shuffle_buffer_size)     # Repeat the dataset     ds = ds.repeat()     # Batch the dataset     ds = ds.batch(batch_size)     return ds
- Apply the function to the dataset object you created in Step 3, setting
batch_size
equal to5
:ds = prep_ds(df, batch_size=5)
- Take the first batch and print it out:
for x in ds.take(1):\ Â Â Â Â print(x)
You should get output similar to the following:
The output represents the input data in tensor format.
- Import the pretrained word embedding model from TensorFlow Hub and create a Keras layer:
import tensorflow_hub as hub embedding = "https://tfhub.dev/google/tf2-preview"\ Â Â Â Â Â Â Â Â Â Â Â Â "/gnews-swivel-20dim/1" hub_layer = hub.KerasLayer(embedding, input_shape=[], \ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â dtype=tf.string, \ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â trainable=True)
- Take one batch from the dataset, flatten the tensor corresponding to the
review
field, apply the pretrained layer, and print it out:for x in ds.take(1):\ Â Â Â Â print(hub_layer(tf.reshape(x['review'],[-1])))
This will display the following output:
The preceding output represents the embedding vectors for the first batch of drug reviews. The specific values may not mean much at first glance but encoded within the embeddings is contextual information based on the dataset that the embedding model was trained upon. The batch size is equal to 5
and the embedding vector size is 20
, which means the resulting size, after applying the pretrained layer, is 5x20
.
In this exercise, you learned how to import tabular data that might contain a variety of data types. You took the review
field and applied a pretrained word embedding model to convert the text into a numerical tensor. Ultimately, you preprocessed and batched the text data so that it was appropriate for large-scale training. This is one way to represent text so that it can be input into machine learning models in TensorFlow. In fact, other pretrained word embedding models can be used and are available on TensorFlow Hub. You will learn more about how to utilize TensorFlow Hub in the next chapter.
In this section, you learned about one way to preprocess text data for use in machine learning models. There are a number of different methods you could have used to generate a numerical tensor from the text. For example, you could have one-hot encoded the words, removed the stop words, stemmed and lemmatized the words, or even done something as simple as counting the number of words in each review. The method demonstrated in this section is advantageous as it is simple to implement. Also, the word embedding incorporates contextual information in the text that is difficult to encode in other methods, such as one-hot encoding.
Ultimately, it is up to the practitioner to apply any domain knowledge to the preprocessing step to retain as much contextual information as possible. This will allow any subsequent models to learn the underlying function between the features and the target variable.
In the next section, you will learn how to load and process audio data so that the data can be used for TensorFlow models.