Understanding BERT
Looking at the transformer’s encoder/decoder architecture discussed in the Introducing transformers section of Chapter 7, Summarizing Wikipedia Articles, we can observe a clear separation of tasks. The encoder is responsible for extracting features from an input sentence, such as syntax, grammar, and context. At the same time, the decoder maps it to a target sequence – for example, translates it to another language. This separation makes the two components self-contained; therefore, they can be used independently.
This section introduces a state-of-the-art transformer-based technique to generate language representation models named Bidirectional Encoder Representation from Transformers (BERT). BERT incorporates a stack of transformer encoders to understand the language better.
Similarly to word embedding, the method belongs to the self-supervised learning family because it does not require human-annotated observation labels. Therefore, BERT can...