Working with language models and tokenizers
In this section, we will look at using the Transformer
library with language models, along with their related tokenizers. In order to use any specified language model, we first need to import it. We will start with the BERT model provided by Google and use its pretrained version, as follows:
from transformers import BertTokenizer tokenizer = \ BertTokenizer.from_pretrained('bert-base-uncased')
The first line of the preceding code snippet imports the BERT tokenizer, and the second line downloads a pretrained tokenizer for the BERT base version. Note that the uncased version is trained with uncased letters, so it does not matter whether the letters appear in upper- or lowercase. To test and see the output, you must run the following line of code:
text = "Using Transformers is easy!" tokenizer(text)
This will be the output:
{'input_ids': [101, 2478, 19081, 2003, 3733, 999, 102...