You're reading from Reproducible Data Science with Pachyderm Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

Product type Paperback

Published in Mar 2022

Publisher Packt

ISBN-13 9781801074483

Length 364 pages

Edition 1st Edition

Languages

Tools

GitHub

Concepts

Data Science

Author (1):

Svetlana Karslioglu

View More author details

Table of Contents (16) Chapters

Preface

1. Section 1: Introduction to Pachyderm and Reproducible Data Science

2. Chapter 1: The Problem of Data Reproducibility FREE CHAPTER

3. Chapter 2: Pachyderm Basics

4. Chapter 3: Pachyderm Pipeline Specification

5. Section 2:Getting Started with Pachyderm

6. Chapter 4: Installing Pachyderm Locally

7. Chapter 5: Installing Pachyderm on a Cloud Platform

8. Chapter 6: Creating Your First Pipeline

9. Chapter 7: Pachyderm Operations

10. Chapter 8: Creating an End-to-End Machine Learning Workflow

11. Chapter 9: Distributed Hyperparameter Tuning with Pachyderm

12. Section 3:Pachyderm Clients and Tools

13. Chapter 10: Pachyderm Language Clients

14. Chapter 11: Using Pachyderm Notebooks

15. Other Books You May Enjoy

Why is reproducibility important?

First of all, let's define AI, ML, and data science.

Data science is a field of study that involves collecting and preparing large amounts of data to extract knowledge and produce insights.

AI is more of an umbrella term for technology that enables machines to mimic the behavior of human beings. Machine learning is a subset of AI that is based on the idea that an algorithm can learn based on past experiences.

Now, let's define reproducibility. A data science experiment is considered reproducible if other data scientists can repeat it with a comparable outcome on a similar dataset and problem. And although reproducibility has been a pillar of scientific research for decades, it has only recently become an important topic in the data science scope.

Not only is a reproducible experiment more likely to be free of errors, but it also takes the experiment further and allows others to build on top of it, contributing to knowledge transfer and speeding up future discoveries.

It's not a secret that data science has become one of the hottest topics in the last 10 years. Many big tech companies have opened tens of high-paying data scientist, data engineering, and data analyst positions. With that, the demand to join the profession has been rising exponentially. According to the AI Index 2019 Annual Report published by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the number of AI papers has grown threefold in the last 20 years. You can read more about this report on the Stanford University HAI website: https://hai.stanford.edu/blog/introducing-ai-index-2019-report.

Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)

Almost every learning platform and university now offers a data science or AI program, and these programs never lack students. Thousands of people of all backgrounds, from software developers to CEOs, take ML classes to keep up with the rapidly growing industry.

The number of AI conferences has been steadily growing as well. Even in the pandemic world, where in-person events have become impossible, the AI community has continued to meet in a virtual format. Such flagship conferences as Neural Information Processing Systems (NeurIPS) and International Conference on Machine Learning (ICML), which typically attract more than 10,000 visitors, took place online with significant attendance.

According to some predictions, the AI market size will increase to more than $350 billion by 2025. The market grew from $12 billion to $58 billion from 2020 to 2021 alone. The Silicon Valley tech giants are fiercely battling to achieve dominance in the space, while smaller players emerge to get their share of the market. The number of AI start-ups worldwide is steadily growing, with billions being invested in them each year.

The following graph shows the growth of AI-related start-ups in recent years:

Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)

The total private investment in AI start-ups grew by more than 30 times in the last 10 years.

And another interesting metric from the same source is the number of AI patents published between 2015 and 2018:

Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)

The United States is leading in the number of published patents among other countries.

These trends boost the economy and industry but inevitably affect the quality of submitted AI papers, processes, practices, and experiments. That's why a proper process is needed to ensure the validation of data science models. The replication of experiments is an important part of a data science model's quality control.

Next, let's learn what a model is.

What is a model?

Let's define what a model is. A data science or AI model is a simplified representation of a process that also suggests possible results. Whether it is a weather-prediction algorithm or a website attendance calculator, a model provides the most probable outcome and helps us make informed decisions. When a data scientist creates a model, they need to make decisions about the critical parameters that must be included in that model because they cannot include everything. Therefore, a model is a simplified version of a process. And that's when sacrifices are made based on the data scientist's or organization's definition of success.

The following diagram demonstrates a data model:

Figure 1.4 – Data science model

Every model needs a continuous data flow to improve and perform correctly. Consider the Amazon Go stores where shoppers' behavior is analyzed by multiple cameras inside the store. The models that ensure safety in the store are trained continuously on real-life customer behavior. These models had to learn that sometimes shoppers might pick up an item and then change their mind and put it back; sometimes shoppers can drop an item on the floor, damaging the product, and so on. The Amazon Go store model is likely good because it has access to a lot of real data, and it improves over time. However, not all models have access to real data, and that's when a synthetic dataset can be used.

A synthetic dataset is a dataset that was generated artificially by a computer. The problem with synthetic data is that it is only as good as the algorithm that generated it. Often, such data misrepresents the real world. In some cases, such as when users' privacy prevents data scientists from using real data, usage of a synthetic dataset is justified; in other cases, it can lead to negative results.

IBM's Watson was an ambitious project that promised to revolutionize healthcare by promising to diagnose patients based on a provided list of symptoms in a matter of a few seconds. This invention could greatly speed up the diagnosis process. In some places on this planet, where people have no access to healthcare, a system like that could save many lives. Unfortunately, since the original promise was to replace doctors, Watson is a recommendation system that can assist in diagnosing, but nothing more than that. One of the reasons is that Watson was trained on a synthetic dataset and not on real data.

There are cases when detecting issues in a trained model can be especially difficult. Take the example of an image recognition algorithm developed by the University of Washington that was built to identify whether an image had a husky portrayed in it or a wolf. The model was seemingly working really well, predicting the correct result with almost 90% accuracy. However, when the scientists dug a bit deeper into the algorithm and data, they learned that the model was basing its predictions on the background. The majority of images with huskies had grass in the background, while the majority of images with wolves had snow in the background.

The main principles of reproducibility

How can you ensure that a data science process in your company adheres to the principles of reproducibility? Here is a list of the main principles of reproducibility:

Use open data: The data that is used for training models should not be a black box. It has to be available to other data scientists in an unmodified state.
Train the model on many examples: The information about experiments and on how many examples it was trained must be available for review.
Rigorously document the process: The process of data modifications, statistical failures, and experiment performance must be thoroughly documented so that the author and other data scientists can reproduce the experiment in the future.

Let's consider a few examples where reproducibility, collaboration, and open data principles were not part of the experiment process.

A few years ago, a group of scientists at Duke University became wildly popular because they emerged with an ambitious claim of predicting the course of lung cancer based on the data collected from patients. The medical community was very excited about the prospect of such a discovery. However, a group of other scientists in the MD Anderson Cancer Centre in Houston found severe errors in that research when they tried to reproduce the original result. They discovered mislabeling in the chemotherapy prediction model, mismatches in genes to gene-expression data, and other issues that would make correct treatment prescription based on the model calculations significantly less likely. While the flaws were eventually unraveled, it took almost 3 years and more than 2,000 working hours for the researchers to get to the bottom of the problem, which could have been easily avoided if the proper research practices were established in the first place.

Now let's look at how AI can go wrong based on a chatbot example. You might remember the infamous Microsoft chatbot called Tay. Tay was a robot who could learn from his conversations with internet users. When Tay went live, his first conversations were friendly, but overnight his language changed, and he started to post harmful, racist, and overall inappropriate responses. He learned from the users who taught him to be rude, and as the bot was designed to mirror human behavior, he did what he was created for. Why was he not racist from the very beginning, you might ask? The answer is that he was trained on clean, cherry-picked data that did not include vulgar and abusive language. But we cannot control the web and what people post, and the bot did not have any sense of morals programmed into it. This experiment raised many questions about AI ethics and how we can ensure that the AI that we build does not turn on us one day.

The new generation of chatbots is built on the recently released GPT-3 library. These chatbots are trained with neural networks that, during training, create associations that cannot be broken. These chatbots, although using a seemingly more advanced technology behind them than their predecessors, still easily might become racists and hateful depending on the data they are trained on. If a bot is trained on misogynist and hateful conversations, it will be offensive and will likely reply inappropriately.

As you can see, data science, AI, and machine learning are powerful technologies that help us solve many difficult problems, but at the same time, they can endanger their users and have devastating consequences. The data science community needs to work on devising better ways of minimizing adverse outcomes by establishing proper standards and processes to ensure the quality of data science experiments and AI software.

Now that we've seen why reproducibility is so important, let's look at what consequences it has on the scientific community and data science.

You're reading from Reproducible Data Science with Pachyderm Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

Table of Contents (16) Chapters

Why is reproducibility important?

What is a model?

The main principles of reproducibility

Authors (1)

Personalised recommendations for you