Creating repositories and pipelines
In this section, we will create all the pipelines that we reviewed in the previous section. The six-step workflow will clean the data, apply POS tagging, perform NER, train a new custom mode based on the provided data, run the improved pipeline, and output the results to the final repo.
The first step is to create the data cleaning pipeline that will strip the text from the elements we won't need for further processing.
Important note
You need to download all files for this example from https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/tree/main/Chapter08-End-to-End-Machine-Learning-Workflow. The Docker image is stored at https://hub.docker.com/repository/docker/svekars/nlp-example.
Creating the data cleaning pipeline
Data cleaning is typically performed before any other types of tasks. For this pipeline, we have created a Python script that uses the Natural Language Toolkit (NLTK) platform to perform...