You're reading from Machine Learning With Go Implement Regression, Classification, Clustering, Time-series Models, Neural Networks, and More using the Go Programming Language

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781785882104

Length 304 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Author (1):

Joseph Langstaff Whitenack

View More author details

Data versioning

As mentioned, machine learning models produce extremely different results depending on the training data you use, the choices of parameters, and the input data. It is essential to be able to reproduce results for collaborative, creative, and compliance reasons:

Collaboration: Despite what you see on social media, there are no data science and machine learning unicorns (that is, people with knowledge and capabilities in every area of data science and machine learning). We need to have our colleagues' reviews and improve on our work, and this is impossible if they aren't able to reproduce our model results and analyses.
Creativity: I don't know about you, but I have trouble remembering even what I did yesterday. We can't trust ourselves to always remember our reasoning and logic, especially when we are dealing with machine learning workflows. We need to track exactly what data we are using, what results we created, and how we created them. This is the only way we will be able to continually improve our models and techniques.
Compliance: Finally, we may not have a choice regarding data versioning and reproducibility in machine learning very soon. Laws are being passed around the world (for example, the General Data Protection Regulation (GDPR) in the European Union) that give users a right to an explanation for algorithmically made decisions. We simply cannot hope to comply with these rulings if we don't have a robust way of tracking what data we are processing and what results we are producing.

There are multiple open source data versioning projects. Some of these are focused on security and peer-to-peer distributed storage of data. Others are focused on data science workflows. In this book, we will focus on and utilize Pachyderm (http://pachyderm.io/), an open source framework for data versioning and data pipelining. Some of the reasons for this will be clear later in the book when we talk about production deploys and managing ML pipelines. For now, I will just summarize some of the features of Pachyderm that make it an attractive choice for data versioning in Go-based (and other) ML projects:

It has an convenient Go client, github.com/pachyderm/pachyderm/src/client
The ability to version any type and format of data
A flexible object store backing for the versioned data
Integration with a data pipelining system for driving versioned ML workflows

Pachyderm jargon

Think about versioning data in Pachyderm kind of like versioning code in Git. The primitives are similar:

Repositories: These are versioned collections of data, similar to having versioned collections of code in Git repositories
Commits: Data is versioned in Pachyderm by making commits of that data into data repositories
Branches: These lightweight points to certain commits or sets of commits (for example, master points to the latest HEAD commit)
Files: Data is versioned at the file level in Pachyderm, and Pachyderm automatically employs strategies, such as de-duplication, to keep your versioned data space efficient

Even though versioning data with Pachyderm feels similar to versioning code with Git, there are some major differences. For example, merging data doesn't exactly make sense. If there are merge conflicts on petabytes of data, no human could resolve these. Furthermore, the Git protocol would not be space efficient in general for large sets of data. Pachyderm uses its own internal logic to perform the versioning and work with versioned data, and the logic is both space efficient and processing efficient in terms of caching.

Deploying/installing Pachyderm

We will be using Pachyderm in various other places in the book to both version data and create distributed ML workflows. Pachyderm itself is an app that runs on top of Kubernetes (https://kubernetes.io/), and is backed by an object store of your choice. For the purposes of this book, development, and experimentation, you can easily install and run Pachyderm locally. It should take 5-10 minutes to install and doesn't require much effort. The instructions for the local installation can be found in the Pachyderm documentation at http://docs.pachyderm.io.

When you are ready to run your workflows in production or your deploy model, you can easily deploy a production-ready Pachyderm cluster that will behave the same exact way as your local installation. Pachyderm can be deployed to any cloud, or even on premises.

As mentioned, Pachyderm is an open source project and has an active group of users. If you have questions or need help, you can join the public Pachyderm Slack channel by visiting http://slack.pachyderm.io/. The active Pachyderm users and the Pachyderm team itself will be able to respond very quickly to your questions there.

Creating data repositories for data versioning

If you followed the local installation of Pachyderm specified in the Pachyderm documentation, you should have the following:

Kubernetes running in a Minikube VM on your machine
The pachctl command line tool installed and connected to your Pachyderm cluster

Of course, if you have a production cluster running in a cloud, the following steps still apply. Your pachctl would just be connected to the remote cluster.

We will be demonstrating data versioning functionality with the pachctl Command-line Interface (CLI) tool below (which is a Go program). However, as mentioned above, Pachyderm has a full-fledged Go client. You can create repositories, commit data, and much more directly from your Go programs. This functionality will be demonstrated later in Chapter 9, Deploying and distributing Analyses and Models.

To create a repository of data called myrepo, you can run this code:

$ pachctl create-repo myrepo

You can then confirm that the repository exists with list-repo:

$ pachctl list-repo
NAME CREATED SIZE 
myrepo 2 seconds ago 0 B

This myrepo repository is a collection of data that we have defined and is ready for housing-versioned data. Right now, there is no data in the repository, because we haven't put any data there yet.

Putting data into data repositories

Let's say that we have a simple text file:

$ cat blah.txt 
This is an example file.

If this file is part of the data we are utilizing in our ML workflow, we should version it. To version this file in our repository, myrepo, we just need to commit it into that repository:

$ pachctl put-file myrepo master -c -f blah.txt

The -c flag specifies that we want Pachyderm to open a new commit, insert the file we are referencing, and close the commit all in one shot. The -f flag specifies that we are providing a file.

Note that we are committing a single file to the master branch of a single repository here. However, the Pachyderm API is incredibly flexible. We can commit, delete, or otherwise modify many versioned files in a single commit or over multiple commits. Further, these files could be versioned via a URL, object store link, database dump, and so on.

As a sanity check, we can confirm that our file was versioned in the repository:

$ pachctl list-repo
NAME CREATED SIZE 
myrepo 10 minutes ago 25 B 
$ pachctl list-file myrepo master
NAME TYPE SIZE 
blah.txt file 25 B

Getting data out of versioned data repositories

Now that we have versioned data in Pachyderm, we probably want to know how to interact with that data. The primary way is via Pachyderm data pipelines (which will be discussed later in this book). The mechanism for interacting with versioned data when using pipelines is a simple file I/O.

However, if we manually want to pull certain sets of versioned data out of Pachyderm, analyze them interactively, then we can use the pachctl CLI to get data:

$ pachctl get-file myrepo master blah.txt
This is an example file.

You're reading from Machine Learning With Go Implement Regression, Classification, Clustering, Time-series Models, Neural Networks, and More using the Go Programming Language

Table of Contents (11) Chapters

Data versioning

Pachyderm jargon

Deploying/installing Pachyderm

Creating data repositories for data versioning

Putting data into data repositories

Getting data out of versioned data repositories

Authors (1)

Other recommended products

Personalised recommendations for you