You're reading from Practical Machine Learning on Databricks Seamlessly transition ML models and MLOps on Databricks

Product type Paperback

Published in Nov 2023

Publisher Packt

ISBN-13 9781801812030

Length 244 pages

Edition 1st Edition

Languages

Python

Tools

MLOps

Concepts

Data Science

Author (1):

Debu Sinha

View More author details

Table of Contents (16) Chapters

Preface

1. Part 1: Introduction

2. Chapter 1: The ML Process and Its Challenges FREE CHAPTER

3. Chapter 2: Overview of ML on Databricks

4. Part 2: ML Pipeline Components and Implementation

5. Chapter 3: Utilizing the Feature Store

6. Chapter 4: Understanding MLflow Components on Databricks

7. Chapter 5: Create a Baseline Model Using Databricks AutoML

8. Part 3: ML Governance and Deployment

9. Chapter 6: Model Versioning and Webhooks

10. Chapter 7: Model Deployment Approaches

11. Chapter 8: Automating ML Workflows Using Databricks Jobs

12. Chapter 9: Model Drift Detection and Retraining

13. Chapter 10: Using CI/CD to Automate Model Retraining and Redeployment

14. Index

Why subscribe?

15. Other Books You May Enjoy

Challenges with productionizing machine learning use cases in organizations

At this point, we understand what a typical ML project life cycle looks like in an organization and the different personas involved in the ML process. It looks very intuitive, though we still see many enterprises struggling to deliver business value from their data science projects.

In 2017, Gartner analyst Nick Heudecker admitted that 85% of data science projects fail. A report published by Dimensional Research (https://dimensionalresearch.com/) also uncovered that only 4% of companies have been successful in deploying ML use cases to production. A recent study done by Rackspace Global Technologies in 2021 uncovered that only 20% of the 1,870 organizations in various industries have mature AI and ML practices.

Sources

See the Further reading section for more details on these statistics.

Most enterprises face some common technical challenges in successfully delivering business value from data science projects:

Unintended data silos and messy data: Data silos can be considered as groups of data in an organization that are governed and accessible only by specific users or groups within the organization. Some valid reasons to have data silos include compliance with particular regulations around privacy laws such as General Data Protection Regulation (GDPR) in Europe or the California Privacy Rights Act (CCPA). These conditions are usually an exception to the norm. Gartner stated that almost 87% of organizations have low analytics and business intelligence maturity, meaning that data is not being fully utilized.
Data silos generally arise as different departments within organizations. They have different technology stacks to manage and process the data.
The following figure highlights this challenge:

Figure 1.3 – The tools used by the different teams in an organization and the different silos

The different personas work with different sets of tools and have different work environments. Data analysts, data engineers, data scientists, and ML engineers utilize different tools and development environments due to their distinct roles and objectives. Data analysts rely on SQL, spreadsheets, and visualization tools for insights and reporting. Data engineers work with programming languages and platforms such as Apache Spark to build and manage data infrastructure. Data scientists use statistical programming languages, ML frameworks, and data visualization libraries to develop predictive models. ML engineers combine ML expertise with software engineering skills to deploy models into production systems. These divergent toolsets can pose challenges in terms of data consistency, tool compatibility, and collaboration. Standardized processes and knowledge sharing can help mitigate these challenges and foster effective teamwork. Traditionally, there is little to no collaboration between these teams. As a result, a data science use case with a validated business value may not be developed at the required pace, negatively impacting the growth and effective management of the business.

When the concept of data lakes came up in the past decade, they promised a scalable and cheap solution to support structured and unstructured data. The goal was to enable organization-wide effective usage and collaboration of data. In reality, most data lakes ended up becoming data swamps, with little to no governance regarding the quality of data.

This inherently made ML very difficult since an ML model is only as good as the data it’s trained on.

Building and managing an effective ML production environment is challenging: The ML teams at Google have done a lot of research on the technical challenges around setting up an ML development environment. A research paper published in NeurIPS on hidden technical debt in ML systems engineering from Google (https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf) documented that writing ML code is just a tiny piece of the whole ML development life cycle. To develop an effective ML development practice in an organization, many tools, configurations, and monitoring aspects need to be integrated into the overall architecture. One of the critical components is monitoring drift in model performance and providing feedback and retraining:

Figure 1.4 – Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015

Let’s understand the requirements of an enterprise-grade ML platform a bit more.

You're reading from Practical Machine Learning on Databricks Seamlessly transition ML models and MLOps on Databricks

Table of Contents (16) Chapters

Challenges with productionizing machine learning use cases in organizations

Authors (1)

Personalised recommendations for you