You're reading from Reproducible Data Science with Pachyderm Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

Product type Paperback

Published in Mar 2022

Publisher Packt

ISBN-13 9781801074483

Length 364 pages

Edition 1st Edition

Languages

Tools

GitHub

Concepts

Data Science

Author (1):

Svetlana Karslioglu

View More author details

Table of Contents (16) Chapters

Preface

1. Section 1: Introduction to Pachyderm and Reproducible Data Science

2. Chapter 1: The Problem of Data Reproducibility FREE CHAPTER

3. Chapter 2: Pachyderm Basics

4. Chapter 3: Pachyderm Pipeline Specification

5. Section 2:Getting Started with Pachyderm

6. Chapter 4: Installing Pachyderm Locally

7. Chapter 5: Installing Pachyderm on a Cloud Platform

8. Chapter 6: Creating Your First Pipeline

9. Chapter 7: Pachyderm Operations

10. Chapter 8: Creating an End-to-End Machine Learning Workflow

11. Chapter 9: Distributed Hyperparameter Tuning with Pachyderm

12. Section 3:Pachyderm Clients and Tools

13. Chapter 10: Pachyderm Language Clients

14. Chapter 11: Using Pachyderm Notebooks

15. Other Books You May Enjoy

Creating repositories and pipelines

In this section, we will create all the pipelines that we reviewed in the previous section. The six-step workflow will clean the data, apply POS tagging, perform NER, train a new custom mode based on the provided data, run the improved pipeline, and output the results to the final repo.

The first step is to create the data cleaning pipeline that will strip the text from the elements we won't need for further processing.

Important note

You need to download all files for this example from https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/tree/main/Chapter08-End-to-End-Machine-Learning-Workflow. The Docker image is stored at https://hub.docker.com/repository/docker/svekars/nlp-example.

Creating the data cleaning pipeline

Data cleaning is typically performed before any other types of tasks. For this pipeline, we have created a Python script that uses the Natural Language Toolkit (NLTK) platform to perform...

The rest of the chapter is locked

You're reading from Reproducible Data Science with Pachyderm Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

Table of Contents (16) Chapters

Creating repositories and pipelines

Creating the data cleaning pipeline

Authors (1)

Personalised recommendations for you

You're reading from Reproducible Data Science with Pachyderm Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

Table of Contents (16) Chapters

Creating repositories and pipelines

Creating the data cleaning pipeline

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you