Tokenizing the text in your dataset
The components contained within the transformer do not have any intrinsic knowledge of the words that it processes. Instead, the tokenizer only uses the token identifiers for the words that it processes. In this recipe, we will learn how to transform the text in your dataset into a representation that can be used by the models for downstream tasks.
Getting ready
As part of this recipe, we will use the AutoTokenizer
module from the transformers package. You can use the 8.2_Basic_Tokenization.ipynb
notebook from the code site if you need to work from an existing notebook.
How to do it...
In this recipe, you will continue from the previous example of using the RottenTomatoes
dataset and sampling a few sentences from it. We will then encode the sampled sentences into tokens and their respective representations.
The recipe does the following things:
- Loads a few sentences into memory
- Instantiates a tokenizer and tokenizes the...