Building our pipeline
In an NLP pipeline, preparation generally encompasses a pre-processing step where we clean and normalize the data. Following that, a feature representation step translates the language into input that can be consumed by our chosen models. Once this is completed, we are ready to build, train, and evaluate the model. This strategic plan will be implemented throughout the subsequent sections.
Preparation
Language manifests in numerous variations. There are formatting nuances, such as capitalization or punctuation; words that serve as linguistic aids without true semantic meaning, such as prepositions; and special characters, including emojis, further enrich the landscape. To work with this data, we must transform raw text into a dataset while following a similar criterion as numeric datasets. This cleaning process enables us to eliminate outliers, reduce noise, manage vocabulary size, and optimize data for ingestion by NLP models.
A basic flow diagram of...