Introduction to NLP
Within the scope of biotechnology, we often turn to NLP for numerous reasons, which generally involve the need to organize data and develop models to find answers to scientific questions. As opposed to the many other areas we have investigated so far, NLP is unique in the sense that we focus on one type of data at hand: text data. When we think of text data within the realm of NLP, we can divide things into two general categories: structured data and unstructured data. We can think of structured data as text fields living within tables and databases in which items are organized, labeled, and linked together for easier retrieval, such as a SQL or DynamoDB database. On the other hand, we have what is known as unstructured data such as documents, PDFs, and images, which can contain static content that is neither searchable nor easily accessible. An example of this can be seen in the following diagram: