Summary
In this chapter, we made an adventurous attempt to cover a wide range of NLP topics. We explored a range of introductory topics such as NER, tokenization, and parts of speech using the NLTK and spaCy libraries. We then explored NLP through the lens of structured datasets, in which we utilized the pymed
library as a source for scientific literature and proceeded to analyze and clean the data in our preprocessing steps. Next, we developed a word cloud to visualize the frequency of words in a given dataset. Finally, we developed a clustering model to group our abstracts and a topic modeling model to identify prominent topics.
We then explored NLP through the lens of unstructured data in which we explored two common AWS NLP products. We used Textract to convert PDFs and images into searchable and structured text and Comprehend to analyze and provide insights. Finally, we learned how to develop a semantic search engine using deep learning transformers to find pertinent information...