In the preceding chapters, we learned how BERT is pre-trained using the general Wikipedia corpus and how we can fine-tune and use it for downstream tasks. Instead of using the BERT that is pre-trained on the general Wikipedia corpus, we can also train BERT from scratch on a domain-specific corpus. This helps the BERT model to learn embeddings specific to a domain and it also helps in learning the domain-specific vocabulary that may not be present in the general Wikipedia corpus. In this section, we will look into two interesting domain-specific BERT models:
- ClinicalBERT
- BioBERT
We will learn how the preceding models are pre-trained and how we can fine-tune them for downstream tasks.
ClinicalBERT
ClinicalBERT is a clinical domain-specific BERT pre-trained on a large clinical corpus. The clinical notes or progress notes contain very useful information about the patient. They include a record of patient visits, their symptoms, diagnosis, daily activities, observations...