Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Reproducible Data Science with Pachyderm

You're reading from   Reproducible Data Science with Pachyderm Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

Arrow left icon
Product type Paperback
Published in Mar 2022
Publisher Packt
ISBN-13 9781801074483
Length 364 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Svetlana Karslioglu Svetlana Karslioglu
Author Profile Icon Svetlana Karslioglu
Svetlana Karslioglu
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Section 1: Introduction to Pachyderm and Reproducible Data Science
2. Chapter 1: The Problem of Data Reproducibility FREE CHAPTER 3. Chapter 2: Pachyderm Basics 4. Chapter 3: Pachyderm Pipeline Specification 5. Section 2:Getting Started with Pachyderm
6. Chapter 4: Installing Pachyderm Locally 7. Chapter 5: Installing Pachyderm on a Cloud Platform 8. Chapter 6: Creating Your First Pipeline 9. Chapter 7: Pachyderm Operations 10. Chapter 8: Creating an End-to-End Machine Learning Workflow 11. Chapter 9: Distributed Hyperparameter Tuning with Pachyderm 12. Section 3:Pachyderm Clients and Tools
13. Chapter 10: Pachyderm Language Clients 14. Chapter 11: Using Pachyderm Notebooks 15. Other Books You May Enjoy

Demystifying MLOps

This section defines Machine Learning Operations (MLOps) and describes why it is crucial to establish a reliable MLOps process within your data science department.

In many organizations, data science departments have been created fairly recently, in the last few years. The profession of data scientist is fairly new as well. Therefore, many of these departments have to find a way to integrate into the existing corporate process and devise ways to ensure the reliability and scalability of data science deliverables.

In many cases, the burden of building a suitable infrastructure falls on the shoulders of the data scientists themselves, who often are not as familiar with the latest infrastructure trends. Another problem is how to make it all work for different languages, platforms, and environments. In the end, data scientists spend more time on building the infrastructure than on working on the model itself. This is where the new discipline has emerged to help bridge the gap between data science and infra.

MLOps is a lifecycle process that identifies the stages of machine learning operations, ensuring the reliability of the data science process. MLOps is a set of practices that define the machine learning development process. Although the term was coined fairly recently, most data scientists agree that a successful MLOps process should adhere to the following principles:

  • Collaboration: This principle implies that everything that goes into developing an ML model must be shared among data scientists to preserve knowledge.
  • Reproducibility: This principle implies that not only the code but datasets, metadata, and parameters should be versioned and reproducible for all production models.
  • Continuity: This principle implies that a lifecycle of a model is a continuous process that means repetition of the lifecycle stages and improvement of the model with each iteration.
  • Testability: This principle implies that the organization implements ML testing and monitoring practices to ensure the model's quality.

Before we dive into the MLOps process stages, let's take a look at more established software development practices. DevOps is a software development practice that is used in many enterprise-level software projects. A typical DevOps lifecycle includes the following stages that continuously repeat, ensuring product improvement:

  • Planning: In this stage, the overall vision for the software is developed, and a more detailed design is devised.
  • Development: In this stage, the code is written, and the planned functionality is implemented. The code is shared through version control systems, such as Git, which ensures collaboration between software developers.
  • Testing: In this stage, the developed code is tested for defects through an automated or manual process.
  • Deployment: In this stage, the code is released to production servers, and the users have a chance to test it and provide feedback.
  • Monitoring: In this stage, the DevOps engineers focus on software performance and causes of outages, identifying possible areas of improvement.
  • Operations: This stage ensures the automated release of software updates.

The following diagram illustrates the DevOps lifecycle:

Figure 1.5 – DevOps Lifecycle

Figure 1.5 – DevOps Lifecycle

All these phases are continuously repeated, enabling communication between departments and a customer feedback loop. This practice has brought enterprises such benefits as a faster development cycle, better products, and continuous innovation. Better teamwork enabled by the close relationships between departments is one of the key factors that make this process efficient.

Data scientists deserve a process that brings the same level of reliability. One of the biggest problems of enterprise data science is that very few machine learning models make it to production. Many companies are just starting to adopt data science, and the new departments face unprecedented challenges. Often, the teams lack an understanding of the workflows that need to be implemented in order to make enterprise-level data science work.

Another important challenge is that unlike in traditional software development, data scientists operate not only with code but also with data and parameters. Data is taken from the real world, and the code is accurately developed in the office. The only time they cross is when they are combined in a data model.

The challenges that all data science departments face include the following:

  • Inconsistent or totally absent data science processes
  • No way to track data changes and reproduce past results
  • Slow performance

In many enterprises, data science departments are still small and struggle to create a reliable workflow. Building such a process requires certain expertise, such as an understanding of traditional software practices, such as DevOps, mixed with an understanding of data science challenges. That is where MLOps started to emerge, combining data science with best practices of software development.

If we try to apply similar DevOps practices to data science, here is what we might see:

  • Design: In this phase, data scientists work on acquiring the data and designing a data pipeline, also known as an Extract, Transform, Load (ETL) pipeline. A data pipeline is a sequence of transformation steps data goes through, which ends with an output result.
  • Development: In this stage, data scientists work on writing the algorithmic code for the previously developed data pipeline.
  • Training: In this stage, the model is trained with the selected or autogenerated data. During this stage, such techniques as hyperparameter tuning can be used.
  • Validation: In this stage, the trained data is validated to work with the rest of the data pipeline.
  • Deployment: In this stage, the trained and validated model is deployed into production.
  • Monitoring: In this stage, the model is constantly monitored for performance and possible flaws, and feedback is delivered directly to the data scientist for further improvement.

Similar to DevOps, the stages of MLOps are constantly repeated. The following diagram shows the stages of MLOps:

Figure 1.6 – MLOps Lifecycle

Figure 1.6 – MLOps Lifecycle

As you can see, the two practices are very similar, and the latter borrows the main concepts from the former. Using MLOps in practice has brought the following advantages to enterprise-level data science:

  • Faster go-to-market delivery: A data science model only has value when it is successfully deployed in production. With so many companies struggling to implement a proper process in their data science departments, an MLOps solution can genuinely make a difference.
  • Cross-team collaboration and communication: Software-development practices applied to data science create a common ground for developers, data scientists, and IT operations to work together and speak the same language.
  • Reproducibility and knowledge transfer: Keeping the code, the datasets, and the history of changes plays a big role in the improvement of overall model quality and enables data scientists to learn from each other's examples, contributing to innovation and feature development.
  • Automation: Automating a data pipeline helps to keep the process consistent across multiple releases and speeds up the promotion of a Proof of Concept (POC) model to a production-grade pipeline.

In this section, we've learned about the important stages of the MLOps process. In the next section, we will learn more about the types of data science platforms that can help you implement MLOps in your organization.

You have been reading a chapter from
Reproducible Data Science with Pachyderm
Published in: Mar 2022
Publisher: Packt
ISBN-13: 9781801074483
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image