Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Reproducible Data Science with Pachyderm
Reproducible Data Science with Pachyderm

Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

eBook
AU$14.99 AU$53.99
Paperback
AU$67.99
Subscription
Free Trial
Renews at AU$24.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Reproducible Data Science with Pachyderm

Chapter 1: The Problem of Data Reproducibility

Today, machine learning algorithms are used everywhere. They are integrated into our day-to-day lives, and we use them without noticing. While we are rushing to work, planning a vacation, or visiting a doctor's office, the models are at work, at times making important decisions about us. If we are unsure what the model is doing and how it makes decisions, how can we be sure that its decisions are fair and just? Pachyderm profoundly cares about the reproducibility of data science experiments and puts data lineage, reproducibility, and version control at its core. But before we proceed, let's discuss why reproducibility is so important.

This chapter explains the concepts of reproducibility, ethical Artificial Intelligence (AI), and Machine Learning Operations (MLOps), as well as providing an overview of the existing data science platforms and how they compare to Pachyderm.

In this chapter, we're going to cover the following main topics:

  • Why is reproducibility important?
  • The reproducibility crisis in science
  • Demystifying MLOps
  • Types of data science platforms
  • Explaining ethical AI

Why is reproducibility important?

First of all, let's define AI, ML, and data science.

Data science is a field of study that involves collecting and preparing large amounts of data to extract knowledge and produce insights.

AI is more of an umbrella term for technology that enables machines to mimic the behavior of human beings. Machine learning is a subset of AI that is based on the idea that an algorithm can learn based on past experiences.

Now, let's define reproducibility. A data science experiment is considered reproducible if other data scientists can repeat it with a comparable outcome on a similar dataset and problem. And although reproducibility has been a pillar of scientific research for decades, it has only recently become an important topic in the data science scope.

Not only is a reproducible experiment more likely to be free of errors, but it also takes the experiment further and allows others to build on top of it, contributing to knowledge transfer and speeding up future discoveries.

It's not a secret that data science has become one of the hottest topics in the last 10 years. Many big tech companies have opened tens of high-paying data scientist, data engineering, and data analyst positions. With that, the demand to join the profession has been rising exponentially. According to the AI Index 2019 Annual Report published by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the number of AI papers has grown threefold in the last 20 years. You can read more about this report on the Stanford University HAI website: https://hai.stanford.edu/blog/introducing-ai-index-2019-report.

Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)

Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)

Almost every learning platform and university now offers a data science or AI program, and these programs never lack students. Thousands of people of all backgrounds, from software developers to CEOs, take ML classes to keep up with the rapidly growing industry.

The number of AI conferences has been steadily growing as well. Even in the pandemic world, where in-person events have become impossible, the AI community has continued to meet in a virtual format. Such flagship conferences as Neural Information Processing Systems (NeurIPS) and International Conference on Machine Learning (ICML), which typically attract more than 10,000 visitors, took place online with significant attendance.

According to some predictions, the AI market size will increase to more than $350 billion by 2025. The market grew from $12 billion to $58 billion from 2020 to 2021 alone. The Silicon Valley tech giants are fiercely battling to achieve dominance in the space, while smaller players emerge to get their share of the market. The number of AI start-ups worldwide is steadily growing, with billions being invested in them each year.

The following graph shows the growth of AI-related start-ups in recent years:

Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)

Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)

The total private investment in AI start-ups grew by more than 30 times in the last 10 years.

And another interesting metric from the same source is the number of AI patents published between 2015 and 2018:

Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)

Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)

The United States is leading in the number of published patents among other countries.

These trends boost the economy and industry but inevitably affect the quality of submitted AI papers, processes, practices, and experiments. That's why a proper process is needed to ensure the validation of data science models. The replication of experiments is an important part of a data science model's quality control.

Next, let's learn what a model is.

What is a model?

Let's define what a model is. A data science or AI model is a simplified representation of a process that also suggests possible results. Whether it is a weather-prediction algorithm or a website attendance calculator, a model provides the most probable outcome and helps us make informed decisions. When a data scientist creates a model, they need to make decisions about the critical parameters that must be included in that model because they cannot include everything. Therefore, a model is a simplified version of a process. And that's when sacrifices are made based on the data scientist's or organization's definition of success.

The following diagram demonstrates a data model:

Figure 1.4 – Data science model

Figure 1.4 – Data science model

Every model needs a continuous data flow to improve and perform correctly. Consider the Amazon Go stores where shoppers' behavior is analyzed by multiple cameras inside the store. The models that ensure safety in the store are trained continuously on real-life customer behavior. These models had to learn that sometimes shoppers might pick up an item and then change their mind and put it back; sometimes shoppers can drop an item on the floor, damaging the product, and so on. The Amazon Go store model is likely good because it has access to a lot of real data, and it improves over time. However, not all models have access to real data, and that's when a synthetic dataset can be used.

A synthetic dataset is a dataset that was generated artificially by a computer. The problem with synthetic data is that it is only as good as the algorithm that generated it. Often, such data misrepresents the real world. In some cases, such as when users' privacy prevents data scientists from using real data, usage of a synthetic dataset is justified; in other cases, it can lead to negative results.

IBM's Watson was an ambitious project that promised to revolutionize healthcare by promising to diagnose patients based on a provided list of symptoms in a matter of a few seconds. This invention could greatly speed up the diagnosis process. In some places on this planet, where people have no access to healthcare, a system like that could save many lives. Unfortunately, since the original promise was to replace doctors, Watson is a recommendation system that can assist in diagnosing, but nothing more than that. One of the reasons is that Watson was trained on a synthetic dataset and not on real data.

There are cases when detecting issues in a trained model can be especially difficult. Take the example of an image recognition algorithm developed by the University of Washington that was built to identify whether an image had a husky portrayed in it or a wolf. The model was seemingly working really well, predicting the correct result with almost 90% accuracy. However, when the scientists dug a bit deeper into the algorithm and data, they learned that the model was basing its predictions on the background. The majority of images with huskies had grass in the background, while the majority of images with wolves had snow in the background.

The main principles of reproducibility

How can you ensure that a data science process in your company adheres to the principles of reproducibility? Here is a list of the main principles of reproducibility:

  • Use open data: The data that is used for training models should not be a black box. It has to be available to other data scientists in an unmodified state.
  • Train the model on many examples: The information about experiments and on how many examples it was trained must be available for review.
  • Rigorously document the process: The process of data modifications, statistical failures, and experiment performance must be thoroughly documented so that the author and other data scientists can reproduce the experiment in the future.

Let's consider a few examples where reproducibility, collaboration, and open data principles were not part of the experiment process.

A few years ago, a group of scientists at Duke University became wildly popular because they emerged with an ambitious claim of predicting the course of lung cancer based on the data collected from patients. The medical community was very excited about the prospect of such a discovery. However, a group of other scientists in the MD Anderson Cancer Centre in Houston found severe errors in that research when they tried to reproduce the original result. They discovered mislabeling in the chemotherapy prediction model, mismatches in genes to gene-expression data, and other issues that would make correct treatment prescription based on the model calculations significantly less likely. While the flaws were eventually unraveled, it took almost 3 years and more than 2,000 working hours for the researchers to get to the bottom of the problem, which could have been easily avoided if the proper research practices were established in the first place.

Now let's look at how AI can go wrong based on a chatbot example. You might remember the infamous Microsoft chatbot called Tay. Tay was a robot who could learn from his conversations with internet users. When Tay went live, his first conversations were friendly, but overnight his language changed, and he started to post harmful, racist, and overall inappropriate responses. He learned from the users who taught him to be rude, and as the bot was designed to mirror human behavior, he did what he was created for. Why was he not racist from the very beginning, you might ask? The answer is that he was trained on clean, cherry-picked data that did not include vulgar and abusive language. But we cannot control the web and what people post, and the bot did not have any sense of morals programmed into it. This experiment raised many questions about AI ethics and how we can ensure that the AI that we build does not turn on us one day.

The new generation of chatbots is built on the recently released GPT-3 library. These chatbots are trained with neural networks that, during training, create associations that cannot be broken. These chatbots, although using a seemingly more advanced technology behind them than their predecessors, still easily might become racists and hateful depending on the data they are trained on. If a bot is trained on misogynist and hateful conversations, it will be offensive and will likely reply inappropriately.

As you can see, data science, AI, and machine learning are powerful technologies that help us solve many difficult problems, but at the same time, they can endanger their users and have devastating consequences. The data science community needs to work on devising better ways of minimizing adverse outcomes by establishing proper standards and processes to ensure the quality of data science experiments and AI software.

Now that we've seen why reproducibility is so important, let's look at what consequences it has on the scientific community and data science.

The reproducibility crisis in science

The reproducibility crisis is a problem that has been around for more than a decade. Because data science is a close discipline to science, it is important to review the issues many scientists have outlined in the past and correlate them with similar problems the data science space is facing today.

One of the most important issues is replication—the ability to reproduce the results of a scientific experiment has been one of the founding principles of good research. In other words, if an experiment can be reproduced, it is valid, and if not, it could be a one-time occurrence that does not represent real phenomena. Unfortunately, in recent years, more and more research papers in sociology, medicine, biology, and other areas of science cannot withhold retesting against an increased number of samples, even if these papers were published in well-known and trustworthy science magazines, such as Nature. This tendency could lead to public mistrust in science and AI as part of it.

As was mentioned previously, because of the popularity and growth of the AI industry, the number of AI papers has increased multiple times. Unfortunately, the quality of these papers does not grow with the number of papers itself.

Nature magazine recently conducted a survey among scientists asking them whether they feel that there is a reproducibility crisis in science. The majority of scientists agreed that false-positive results due to pressure to publish results frequently definitely exists. Researchers need sponsorship and sponsors need to see results to invest additional money in the research, which results in many published papers with declining credibility. Ultimately, the fight for grants and bureaucracy are often named as the main causes of the lack of the reproducibility process in labs.

The research papers that were questioned for integrity have the following common attributes:

  • No code or data were publicly shared for other researchers to attempt to replicate the results.
  • The scientists who attempted to replicate the results failed completely or partially to do it by following the provided instructions.

Even the papers published by Nobel laureates can sometimes be questioned due to an inability to reproduce the results. For example, in 2014, Science magazine retracted a paper published by Nobel Prize winner and immunologist Bruce Beutler. His paper was about the response to pathogens by virus-like organisms in the human genome. This paper was cited over 50 times before it was retracted.

When COVID-19 become a major topic of 2020, multiple papers were published on it. According to Retraction Watch, an online blog that tracks the scientific papers that have been called off, as of March 2021 more than 86 of them were retracted.

In 2019, more than 1,400 science papers were retracted by multiple publishers. This number is huge and has been steadily growing, compared to only 50 papers in the early 2000s. This raises awareness of a so-called reproducibility crisis in science. While not all papers are retracted for that reason, oftentimes it happens because of that.

Data fishing

Data fishing or data dredging is a method of achieving a statistically significant result of an experiment by running the computation multiple times before the desired result is achieved and only reporting these results and ignoring the inconvenient results. Sometimes, scientists unintentionally dredge the data to achieve the result they think is most probable and confirm their hypothesis. A more sinister plan can take place too—a scientist might be intentionally hacking the result of the experiment to achieve a predefined conclusion.

An example of such a misuse of data analysis would be if you decided to prove that there is a correlation between banana consumption and an increased level of IQ in children of age 10 and older. This is a completely made-up example, but say you wanted to establish this connection. You would need to get information about IQ level and banana consumption of a big enough sample of children – let's say 5,000.

Then, you would run tests, such as: do kids who eat bananas and exercise have a higher IQ level than the ones who only exercise? Do kids who watch TV and eat bananas have a higher level of IQ compared to the ones who do not? After conducting these tests enough times, you most likely would get some kind of correlation. However, this result would not be significant, and using the data dredging technique is considered extremely unethical by the scientific community. In data science specifically, similar problems are being seen.

Without conducting a full investigation, detecting data dredging might be difficult. Possible factors to look for include the following:

  • Was the research conducted by a reputable institution or group of scientists?
  • What does other research in similar areas suggest?
  • Is financial interest involved?
  • Is the claim sensational?

Without a proper process, data dredging and unreliable researchers will continue to be published. Recently, Nature magazine surveyed around 1,500 researchers from different areas of science and more than 50% of respondents outlined that they have tried and failed to reproduce the results of research in the past. Even more shockingly, in many cases, they failed to reproduce the results of their own experiments.

Out of all respondents, only 24% were able to successfully publish their reproduction attempts and the majority were never contacted with a request to reproduce someone else's research.

Of course, increasing the reproducibility of experiments is a costly problem and can double the time required to conduct an experiment, which many research laboratories might not be able to afford. But if it's added to the originally planned time for the research and has a proper process, it should not be as difficult or rigorous as adding it midway in the research lifecycle.

Even worse, retracting a paper after it was published can be a tedious task. Some publishers even charge researchers a significant amount of money if a paper is retracted. Such practices are truly discouraging.

All of this negatively impacts research all over the world and results in growing mistrust in science. Organizations must take steps to improve processes in their scientific departments and scientific journals must raise the bar of publishing research.

Now that we have learned about data fishing, let's review better reproducibility guidelines.

Better reproducibility in science research guidelines

The Center for Open Science (COS), a non-profit organization that focuses on supporting and promoting open-science initiatives, reproducibility, and integrity of scientific research, has published Guidelines for Transparency and Guidelines for Transparency and Openness Promotion (TOP) in Journal Policies and Practices, or the TOP Guidelines. These guidelines emphasize the importance of transparency in published research papers. Researchers can use them to justify the necessity of sharing research artifacts publicly to avoid any possible inquiries regarding the integrity of their work.

The main principles of the TOP Guidelines include the following:

  • Proper citation and credit to original authors: All text, code, and data artifacts that belong to other authors must be outlined in the paper and credit given as needed.
  • Data, methodology, and research material transparency: The authors of the paper must share the written code, methodology, and research materials in a publicly accessible location with instructions on how to access and use them.
  • Design and analysis transparency: The authors should be transparent about the methodology as much as possible, although this might vary by industry. At a minimum, they must disclose the standards that have been applied during the research.
  • Preregistrations of the research and analysis plans: Even if research does not get published, preregistration makes it more discoverable.
  • Reproducibility of obtained results: The authors must include sufficient details on how to reproduce the original results.

There are three levels that are applied to all these metrics:

  • Not implemented—information is not included in the report
  • Level 1—available upon request
  • Level 2—access before publication
  • Level 3—verification before publication

Level 3 is the highest level of transparency that a metric can achieve. Having this level of transparency justifies the quality of submitted research. COS applies the TOP factor to rate a journal's efforts to ensure transparency and ultimately the quality of the published research.

Apart from data and code reproducibility, often the environment and software used during the research play a big role. New technologies, such as containers and virtual and cloud environments make it easy to achieve uniformity in conducted research. Of course, if we consider biochemistry or other industries that require more precise lab conditions, achieving uniformity might be even more complex.

Now let's learn about common practices that help improve reproducibility.

Common practices to improve reproducibility

Thanks to the work of reproducibility advocates and the problem being widely discussed in scientific communities in recent years, some positive tendencies in increasing reproducibility seem to be emerging. These practices include the following:

  • Request a colleague to reproduce your work.
  • Develop extensive documentation.
  • Standardize research methodology.
  • Preregister your research before publication to avoid data cherry-picking.

There are scientific groups that make it their mission to reproduce and notify researchers about mistakes in their papers. Their typical process is to try to reproduce the result of a paper and write a letter to the researchers or lab to request a correction or retraction. Some researchers willingly collaborate and correct the mistakes in the paper, but in other cases, it is unclear and difficult. One such group has identified the following problems in the 25 papers that they analyzed:

  • Lack of process or point of contact regarding to whom they should address feedback on a paper. Scientific journals do not provide a clear statement on whether feedback can be addressed to the chief editor or whether there is a feedback submission form of some sort.
  • Scientific journal editors accept and act on submissions unwillingly. In some cases, it might take up to a year to publish a warning on a paper that has received critical feedback, even if it was provided by a reputable institution.
  • Some publishers expect you to pay if you want to publish a correction letter and delay retractions.
  • Raw data is not always available publicly. In many cases, publishers did not have a unified process around a shared location for the raw data used in the research. If you have to directly contact an author, you might not be able to get the requested information and it might significantly delay the process. Moreover, they can simply deny such a request.

The lack of a standard in submitting corrections and research paper retractions contributes to the overall reproducibility crisis and knowledge sharing. The papers that used data dredging and other techniques to manipulate the results will become a source of information for future researchers, contributing to the overall misinformation and chaos. Researchers, publishers, and editors should work together on establishing unified post-publication review guidelines that encourage other scientists to participate in testing and providing feedback.

We've learned how reproducibility affects the quality of research. Now, let's review how organizations can establish a process to ensure their data science experiments adhere to best industry practices to ensure high standards.

Demystifying MLOps

This section defines Machine Learning Operations (MLOps) and describes why it is crucial to establish a reliable MLOps process within your data science department.

In many organizations, data science departments have been created fairly recently, in the last few years. The profession of data scientist is fairly new as well. Therefore, many of these departments have to find a way to integrate into the existing corporate process and devise ways to ensure the reliability and scalability of data science deliverables.

In many cases, the burden of building a suitable infrastructure falls on the shoulders of the data scientists themselves, who often are not as familiar with the latest infrastructure trends. Another problem is how to make it all work for different languages, platforms, and environments. In the end, data scientists spend more time on building the infrastructure than on working on the model itself. This is where the new discipline has emerged to help bridge the gap between data science and infra.

MLOps is a lifecycle process that identifies the stages of machine learning operations, ensuring the reliability of the data science process. MLOps is a set of practices that define the machine learning development process. Although the term was coined fairly recently, most data scientists agree that a successful MLOps process should adhere to the following principles:

  • Collaboration: This principle implies that everything that goes into developing an ML model must be shared among data scientists to preserve knowledge.
  • Reproducibility: This principle implies that not only the code but datasets, metadata, and parameters should be versioned and reproducible for all production models.
  • Continuity: This principle implies that a lifecycle of a model is a continuous process that means repetition of the lifecycle stages and improvement of the model with each iteration.
  • Testability: This principle implies that the organization implements ML testing and monitoring practices to ensure the model's quality.

Before we dive into the MLOps process stages, let's take a look at more established software development practices. DevOps is a software development practice that is used in many enterprise-level software projects. A typical DevOps lifecycle includes the following stages that continuously repeat, ensuring product improvement:

  • Planning: In this stage, the overall vision for the software is developed, and a more detailed design is devised.
  • Development: In this stage, the code is written, and the planned functionality is implemented. The code is shared through version control systems, such as Git, which ensures collaboration between software developers.
  • Testing: In this stage, the developed code is tested for defects through an automated or manual process.
  • Deployment: In this stage, the code is released to production servers, and the users have a chance to test it and provide feedback.
  • Monitoring: In this stage, the DevOps engineers focus on software performance and causes of outages, identifying possible areas of improvement.
  • Operations: This stage ensures the automated release of software updates.

The following diagram illustrates the DevOps lifecycle:

Figure 1.5 – DevOps Lifecycle

Figure 1.5 – DevOps Lifecycle

All these phases are continuously repeated, enabling communication between departments and a customer feedback loop. This practice has brought enterprises such benefits as a faster development cycle, better products, and continuous innovation. Better teamwork enabled by the close relationships between departments is one of the key factors that make this process efficient.

Data scientists deserve a process that brings the same level of reliability. One of the biggest problems of enterprise data science is that very few machine learning models make it to production. Many companies are just starting to adopt data science, and the new departments face unprecedented challenges. Often, the teams lack an understanding of the workflows that need to be implemented in order to make enterprise-level data science work.

Another important challenge is that unlike in traditional software development, data scientists operate not only with code but also with data and parameters. Data is taken from the real world, and the code is accurately developed in the office. The only time they cross is when they are combined in a data model.

The challenges that all data science departments face include the following:

  • Inconsistent or totally absent data science processes
  • No way to track data changes and reproduce past results
  • Slow performance

In many enterprises, data science departments are still small and struggle to create a reliable workflow. Building such a process requires certain expertise, such as an understanding of traditional software practices, such as DevOps, mixed with an understanding of data science challenges. That is where MLOps started to emerge, combining data science with best practices of software development.

If we try to apply similar DevOps practices to data science, here is what we might see:

  • Design: In this phase, data scientists work on acquiring the data and designing a data pipeline, also known as an Extract, Transform, Load (ETL) pipeline. A data pipeline is a sequence of transformation steps data goes through, which ends with an output result.
  • Development: In this stage, data scientists work on writing the algorithmic code for the previously developed data pipeline.
  • Training: In this stage, the model is trained with the selected or autogenerated data. During this stage, such techniques as hyperparameter tuning can be used.
  • Validation: In this stage, the trained data is validated to work with the rest of the data pipeline.
  • Deployment: In this stage, the trained and validated model is deployed into production.
  • Monitoring: In this stage, the model is constantly monitored for performance and possible flaws, and feedback is delivered directly to the data scientist for further improvement.

Similar to DevOps, the stages of MLOps are constantly repeated. The following diagram shows the stages of MLOps:

Figure 1.6 – MLOps Lifecycle

Figure 1.6 – MLOps Lifecycle

As you can see, the two practices are very similar, and the latter borrows the main concepts from the former. Using MLOps in practice has brought the following advantages to enterprise-level data science:

  • Faster go-to-market delivery: A data science model only has value when it is successfully deployed in production. With so many companies struggling to implement a proper process in their data science departments, an MLOps solution can genuinely make a difference.
  • Cross-team collaboration and communication: Software-development practices applied to data science create a common ground for developers, data scientists, and IT operations to work together and speak the same language.
  • Reproducibility and knowledge transfer: Keeping the code, the datasets, and the history of changes plays a big role in the improvement of overall model quality and enables data scientists to learn from each other's examples, contributing to innovation and feature development.
  • Automation: Automating a data pipeline helps to keep the process consistent across multiple releases and speeds up the promotion of a Proof of Concept (POC) model to a production-grade pipeline.

In this section, we've learned about the important stages of the MLOps process. In the next section, we will learn more about the types of data science platforms that can help you implement MLOps in your organization.

Types of data science platforms

This section walks you through the data science platforms that are available in the open source world and on the market today and will help you understand the difference between them.

As new fields of AI and machine learning emerge, more and more engineers are working on new ways of solving data science problems, creating an infrastructure for better, faster AI adoption. Some platforms provide end-to-end capabilities for data from a data warehouse all the way to production, while others offer partial functionality and work in combination with other tools. Generally, there is no solution that fits all use cases, and certainly not every budget.

However, all of these solutions completely or partially facilitate the following stages of a data science lifecycle:

  • Data Engineering
  • Data Acquisition and Transformation
  • Data Training
  • Model Deployment
  • Monitoring and Improvement

The following diagram shows the types of data science tools:

Figure 1.7 – Types of data science tools

Figure 1.7 – Types of data science tools

Let's take a look at the existing data science platforms that can help you to build your data science workflow at scale.

End-to-end platforms

An end-to-end data science solution should be able to provide the tooling for all the stages of the ML lifecycle listed in the previous section. However, in some use cases, the definition of the end-to-end workflow could be different and might mostly work with the ML pipelines and projects, excluding the data engineering part. Since the definition may still fluctuate, it is likely that the end-to-end tools will continue to provide different functionalities as the field evolves.

If such a platform does exist, it should bring the following benefits:

  • A unified user interface that eliminates the need to stitch multiple interfaces together
  • Collaboration for all involved individuals, including data scientists, data engineers, and IT operations
  • The convenience of infrastructure support being offloaded to the solution provider, which offers the team additional time to focus on data models rather than on infrastructure problems

However, you might find the following disadvantages of an end-to-end platform to be inconsistent with your organization's goals:

  • Portability: Such a platform would likely be proprietary, and migration to a different platform would be difficult.
  • Price: An end-to-end platform will likely be subscription-based, which many data science departments might not be able to afford. If GPU-based workflows are involved, the price increases even more.
  • Bias: When you are using a proprietary solution that offers built-in pipelines, your models are bound to inherit bias from these automated tools. The problem is that bias might be difficult to recognize and address in automated ML solutions, which could potentially have negative consequences for your business.

Now that we are aware of the advantages and disadvantages of end-to-end data science platforms, let's consider the ones that are available on the market today. Because the AI field is developing rapidly, new platforms emerge every year. We'll look into the top five such platforms.

Big tech giants, such as Microsoft, Google, and Amazon, all offer automated ML features that a lot of users might find useful. Google's AI Platform offers Kubeflow pipelines to help manage ML workflows. Amazon offers tools that assist with hyperparameter tuning and labeling. Microsoft offers Azure Machine Learning services that support GPU-based workflows and are similar to Amazon's services functionality.

However, as stated previously, all these Explainable AI (XAI) features are prone to bias and require the data science team to build additional tools that can verify model performance and reliability. For many organizations, automated ML is not the right answer. Another issue is vendor lock-in, as you will have to keep all your data in the underlying cloud storage.

The Databricks solution provides a more flexible approach as it can be deployed on any cloud. Databricks is based on Apache Spark, one of the most popular tools for AI and ML workflows and offers end-to-end ML pipeline management through a platform called MLflow. MLflow enables data scientists to track their pipeline progress from model development to deployment to production. Many users enjoy the built-in notebook interface. One disadvantage is the lack of data visualization tools, which might be added in the future.

Algorithmia is another proprietary solution that can be deployed on any cloud platform and that provides an end-to-end ML workflow with model training, deployment, versioning, and other built-in functionality. It supports batch processing and can be integrated with GitHub actions. While Algorithmia is a great tool, it has some of the traditional software developer tools built in, which some engineering teams might find redundant.

Pluggable solutions

While end-to-end platforms might sound like the right solution for your data science department, in reality, it is not always the case. Big companies often have requirements that end-to-end platforms cannot meet. These requirements might include the following:

  • Data security: Some companies might have privacy limitations on storing their data in the cloud. These limitations also apply to the use of automated ML features.
  • Pipeline outputs: Often, the final product of a pipeline is a library that is packaged and used in other projects within the organization.
  • Existing infrastructure constraints: Some existing infrastructure components might prevent the integration of an end-to-end platform. Some parts of the infrastructure might already exist and satisfy the user's needs.

Pluggable solutions give data infrastructure teams the flexibility to build their own solution, which also comes with the need to support it. However, most of the big companies end up doing just that.

Pluggable solutions can be divided into the following categories:

  • Data ingestion tools
  • Data transformation tools
  • Data serving tools
  • Data visualization and monitoring tools

Let's consider some of these tools, which can be combined together to build a data science solution.

Data ingestion tools

Data ingestion is the process of collecting data from all sources in your company, such as databases, social media, and other platforms, into a centralized location for further consumption by machine learning pipelines and other AI processes.

One of the most popular open source tools to ingest data is Apache NiFi, which can ingest data into Apache Kafka, an open source streaming platform. From there, data pipelines can consume the data for processing.

Among commercial cloud-hosted platforms, we can name Wavefront, which enables not only ingestion but data processing as well. Wavefront is notable for its ability to scale and support high query loads.

Data transformation tools

Data transformation is the process of running your code against the data you have. This includes training and testing your data as part of a data pipeline. The tool should be able to consume the data from a centralized location. Tools such as TensorFlow and Keras provide extended functionality for this type of operation.

Pachyderm is a data transformation and pipeline tool as well, although its main value is in version control for large datasets. Unlike other transformation tools, Pachyderm gives data scientists the freedom to define their own pipelines and supports any language and library.

If you have taken any data science classes, chances are you have used MATLAB or Octave for model training. These tools provide a great playground to start exploring machine learning. However, when it comes to production-grade data science that requires continuous training, collaboration, version control, and model productization, these tools might not be the best choice. MATLAB and Octave are mainly for numerical computing for academic purposes. Another issue with platforms such as MATLAB is that they often use proprietary languages, while tools like Pachyderm support any language, including the most popular ones in the data science community.

Model serving tools

After you train your model and it gives satisfactory results, you need to think about moving that model into production, which often is convenient to do in the form of a REST API or through a table that is ingested into a database. Depending on the language that is used in your model, serving a REST API can be as easy as using a web framework such as Flask.

However, there are more advanced tools that can that give data scientists end-to-end control over the machine learning process. One such open source tool is Seldon. Seldon converts REST API endpoints into a production microservice, where you can easily promote each version of your model from staging to production.

Another tool that provides similar functionality is KFServing. Both solutions use Kubernetes' Custom Resource Definition (CRD) to define a Deployment class for model serving.

Often, in big companies, different teams are responsible for training models and serving models, and therefore, decisions can be made based on the team's familiarity and preference for one or the other solution.

Data monitoring tools

After the model is deployed in production, data scientists need to continue to receive feedback about model performance, possible bias, and other metrics. For example, if you have an e-commerce website with a recommendation system that suggests to users what to buy with the current order based on their past choices, you need to make sure that the system is still on track with the latest fashion trends. You might not know the trends, but the feedback loop should signal a decrease in model performance when it occurs.

Often, enterprises fail to employ a good monitoring solution for ML workflows, which can have a potentially devastating outcome for your business. Seldon Alibi is one of the tools that provide model inspection functionality, which enables data scientists to monitor models running in production and identify areas of improvement. Seldon Alibi provides outlier detection, which helps to discover anomalies; drift detection, which helps monitor changes in correlation between input and output data; and adversarial detection, which exposes malicious changes in the original data inputs.

Fiddler is another popular tool that monitors a production model for integrity, bias, performance, and outlier anomalies.

Putting it all together

As you can see, there are multiple ways to create a production-grade data science solution, and one size likely will not fit all. Although end-to-end solutions provide the convenience of using one vendor, they also have multiple disadvantages and are likely equipped with inferior domain functionality compared to pluggable tools. Pluggable tools, on the other hand, require certain expertise and culture to be present in your organization, which will allow different departments, such as DevOps engineers, to collaborate with data scientists to build an optimal solution and workflow.

The next section will walk us through the ethical problems that plague modern AI applications, how they might affect your business, and what you can do about them.

Explaining ethical AI

This section describes aspects of ethical problems in AI and what organizations need to be aware of when they build artificial intelligence applications.

With AI and machine learning technologies becoming more widespread and accepted, it is easy to lose track of the data and the decision-making process origins. When an AI algorithm suggests which pair of shoes to buy based on your recent searches, it might not be a big deal. But suppose an AI algorithm is used to decide whether you qualify for a job, how likely you are to commit a crime, or whether you qualify for mortgage approval. In that case, it is essential to know how the algorithm was created, on which data it was trained, what was included in the dataset, and, more importantly, what was not. At a minimum, we need to question whether a proper process existed to validate the data used for producing the model. Not only is this the right thing to do, but it could also save your organization from undesirable legal consequences.

While AI applications bring certain advantages and improve the quality of our lives, they can make mistakes that sometimes can have adverse, and even devastating, effects on people's lives. These tendencies resulted in the emergence of ethical AI teams and ethical AI advocates in leading AI companies and big tech.

Ethical AI has been an increasingly discussed topic in the data science community over the last few years. According to the Artificial Intelligence Index Report 2019, ethics has been a steadily growing keyword in the total number of AI papers at leading AI conferences.

Figure 1.8 – Number of AI conference papers mentioning Ethics since 1970, from the AI Index 2019 Annual Report (p. 44)

Figure 1.8 – Number of AI conference papers mentioning Ethics since 1970, from the AI Index 2019 Annual Report (p. 44)

Let's consider one of the most widely criticized AI technologies—facial recognition. A face recognition application can identify a person in an image or video. In recent years, this technology has become widespread and is now used in home security, authentication, and other areas. In 2018-2019, more than 1,000 newspaper articles worldwide mentioned facial recognition and data privacy. One such cloud-based facial recognition technology called Rekognition, developed by Amazon, has been used by police departments in a few states. The law enforcement departments used the software to search for suspects in a database, in a video surveillance analysis, including the feed from police body cameras. Independent research showed that the software was biased against people of color when out of 120 Members of Congress, it recognized 28 of them as potential criminals. All of them had darker skin tones. The tool performed especially poorly on identifying women of color.

The problem with this and other facial recognition technologies is that it was trained on a non-inclusive dataset that had photographs of mostly white men. Such outcomes are difficult to predict, but this is what ethical AI is trying to do. Implementing a surveillance system like that in public places would have negatively affected thousands of people. Advances in AI made facial recognition technology that requires little to no human involvement in subject identification. This raises the problem of total control and privacy. While a system like that could help identify criminals, possibly prevent crimes, and make our society safer, it needs to be thoroughly audited for potential errors and protected from misuse. With great power comes great responsibility.

Another interesting example is in using Natural Language Processing (NLP) applications. NLP is an ML technology that enables machines to automatically interpret and translate texts written in one human language to another. In recent years, NLP applications have seen major advances. Tools such as Google Translate solved a problem that was unsolvable even 20 years ago. NLP breaks down a sentence into chunks and tries to make connections between those chunks to provide a meaningful interpretation. NLP applications deal not only with translations but can also summarize what is written in a lengthy research paper or convert text to speech.

But these applications can make mistakes as well. One example was discovered in translations from Turkish to English. In the Turkish language, there is only the personal pronoun o, which can mean either she/her or he/his. It was discovered that Google Translate was discriminating based on gender, diminishing women's roles based on common stereotypes. For example, it would translate She is a secretary and He is a doctor, although in Turkish, both of these sentences could be written about a man or a woman.

From these examples, you can see that bias is one of the biggest problems of AI applications. A biased dataset is a dataset that does not include enough samples of a studied phenomenon to output an objective result, like in the facial-recognition example above, which did not have enough representatives of people of color to make a correct prediction.

While many companies are becoming aware of the adverse effects and risks of bias in datasets, few of them are taking steps to mitigate the possible negative consequences. According to the Artificial Intelligence Index Report 2019, only 13% of organizations that responded were working toward improving the equity and fairness of the datasets used:

Figure 1.9 – Types of organizations taking steps to mitigate the risks of AI, from the AI Index 2019 Annual Report (p. 102)

Figure 1.9 – Types of organizations taking steps to mitigate the risks of AI, from the AI Index 2019 Annual Report (p. 102)

Another aspect of bias is financial inequality. It's not a secret that people from less economically advantageous backgrounds have harder times getting credit deals than those from a more fortunate background. Credit reports are known to have errors that cause higher borrowing rates.

Companies whose business is creating customer profiles, or personalization, go even further collecting intimate information about users and their behavior from public records, credit card transactions, sweepstakes, and other sources. These reports can be sold to marketers and even law enforcement organizations. Individuals are categorized according to their sex, age, marital status, wealth, medical conditions, and other factors. Sometimes these reports have outdated information about things such as criminal records. There was a case when an old lady could not get into a senior living house because of an arrest. However, though she was arrested, it was a case of domestic violence from her partner and she was never prosecuted. She was able to correct her police records, but not the report created by a profiling company. Correcting mistakes in the reports created by these companies is extremely difficult and they can affect people's lives for decades.

Sometimes, people get flagged because of a misidentified profile. Imagine that you are applying for a job and are denied because you have been prosecuted for theft or burglary in the past. This could come as a shock and might not make any sense, but there are cases like that with people who have common names. To clear a mistake like that you need the intervention of a person who wants to spend time correcting such mistakes for you. But do you meet people like that often?

With machine learning now being used in customer profiling, many data privacy advocates question the methods being used in these algorithms. Because these algorithms learn from past experiences, according to them anything you've done in the past, you are likely to repeat in the future. According to these algorithms, criminals will commit more crimes and the poor will get poorer. There is no room for mistakes in their reality. This means that people with prior convictions will likely get arrested again, which gives law enforcement a base for discrimination. The opposite is also true: those with a perfect record, from a better neighborhood, are not likely to commit a crime. This does not sound fair.

The problem with recidivism models is that most of them are proprietary black boxes. A black-box model is an end-to-end model that is created by an algorithm directly from the provided data and even a data scientist cannot explain how it makes decisions. When a machine learning algorithm evolves over time, since AI algorithms learn similarly to humans, they learn the same biases as us.

Figure 1.10 – Black-box model

Figure 1.10 – Black-box model

Let's move on to the next section!

Trustworthy AI

While a few years ago, ethical AI was something only a few groups of independent advocates and academics were working on, today more and more big tech companies have established ethical AI departments to protect the companies from reputational and legal risks.

Establishing standards for trustworthy AI models is an ambitious task and one size does not fit all. However, the following principles apply to most cases:

  • Create an ethical AI committee that works on discussing the AI-associated risks in alignment with the overall company strategy.
  • Raise awareness of the dangers of non-transparent machine learning algorithms and the potential risks they pose to society and your organization.
  • Create a process of identifying, communicating, and evaluating biased models, and privacy concerns. For example, in healthcare, protecting patient personal information is vitally important. Create ownership around ethical risk in the product management department.
  • Establish a process of notifying users about how their data will be used, explaining the risk of bias and other concepts in plain English. The earlier the user becomes aware of the implications of using your application, the less legal risk this will pose in the future.
  • Build a culture around praising efforts to promote ethical programs and initiatives to motivate employees to contribute to those efforts. Engage employees from different departments, including engineering, data science, product management, and others, to contribute to those efforts.

According to the Artificial Intelligence Index Report 2019, the top AI ethics challenges include fairness, interpretability and explainability, and transparency.

The following figure shows a more complete list of challenges present in the ethical AI space:

Figure 1.11 – Ethical AI challenges, from the AI Index 2019 Annual Report (p. 149)

Figure 1.11 – Ethical AI challenges, from the AI Index 2019 Annual Report (p. 149)

The following is a list of issues that non-transparent machine learning algorithms may cause:

  • Disproportional spread of economic and financial opportunities, including credit discrimination and unequal access to discounts and promotions based on predefined buying habits
  • Access to information and social circles, such as algorithms that promote news based on socio-economic groups and suggestions to join specific groups or circles
  • Employment discrimination, including algorithms that filter candidates based on their race, religion, or gender
  • Unequal use of police force and punishment, including algorithms that predict the possibility of an individual committing a crime in the future based on social status and race
  • Housing discrimination, including the denial of equal rental and mortgage opportunities to people of color, LGBT groups, and other minorities

AI has brought unprecedented benefits to our society similar to what the industrial revolution did. But with all these benefits, we should be aware of the societal changes that these benefits carry. If the future of driving is self-driving cars, this will mean that driving as a profession will disappear in the foreseeable future. Many other industries will be affected and will cease to exist. It does not mean that progress should not happen, but it needs to happen in a controlled way.

Software is only as perfect as its creators, and flaws in new AI-powered products are inevitable. But if these new applications are the first level in the decision-making process about human lives and destinies, there has to be a way to ensure that we minimize potential harmful consequences. Therefore, deeply understanding our models is paramount. Part of that is reproducibility, which is one of the key factors in minimizing the negative consequences of AI.

Summary

In this chapter, we have discussed a number of important concepts that help define why reproducibility is important and why it should be a part of a successful data science process.

We've learned that data science models are used to analyze historical data as input with a target goal to calculate the most probable and most successful result. We've established that replication, the ability to reproduce the results of a scientific experiment, is one of the fundamental principles of good research and that it is one of the best ways to ensure that your team is doing everything to reduce bias in your models. Bias can creep into a calculation from misrepresentation in a training dataset. Often, this reflects historical and social realities and norms accepted in society. Another way to reduce bias in your training data is to have a diverse team that includes representatives of all genders, races, and backgrounds.

We've learned that data dredging, or fishing, is an unethical technique used by some data scientists to prove a predefined hypothesis by cherry-picking the results of an experiment and only selecting the results that prove the desired outcome and ignoring any inconvenient trends.

We've also learned about the MLOps methodology, a lifecycle of a machine learning application, similar in its principle to the DevOps software lifecycle technique. MLOps includes the following main phases: planning, development, training, validation, deployment, and monitoring. All of the phases are continuously repeated, creating a feedback loop that ensures seamless experiment management from planning through development and testing to production and post-production phases.

We've also reviewed some of the most important aspects of ethical AI, a discipline of data science that focuses on ethical aspects of artificial intelligence, robotics, and data science. A failure to implement an ethical AI process in your organization might lead to undesirable legal consequences if deployed production models are found to be discriminatory.

In the next chapter, we will learn about the main concepts of the Pachyderm version-control system, which can help you address many of the issues described in this chapter.

Further reading

  • Raymond Perrault, Yoav Shoham, Erik Brynjolfsson, Jack Clark, John Etchemendy, Barbara Grosz, Terah Lyons, James Manyika, Saurabh Mishra, and Juan Carlos Niebles, The AI Index 2019 Annual Report, AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA, December 2019: https://hai.stanford.edu/sites/default/files/ai_index_2019_report.pdf

Figure 1.1, 1.2, 1.3, 1.8, 1.9, and 1.11 were reproduced from the AI Index 2019 Annual Report.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn how to build an enterprise-level reproducible data science platform with Pachyderm
  • Deploy Pachyderm on cloud platforms such as AWS EKS, Google Kubernetes Engine, and Microsoft Azure Kubernetes Service
  • Integrate Pachyderm with other data science tools, such as Pachyderm Notebooks

Description

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.

Who is this book for?

This book is for new as well as experienced data scientists and machine learning engineers who want to build scalable infrastructures for their data science projects. Basic knowledge of Python programming and Kubernetes will be beneficial. Familiarity with Golang will be helpful.

What you will learn

  • Understand the importance of reproducible data science for enterprise
  • Explore the basics of Pachyderm, such as commits and branches
  • Upload data to and from Pachyderm
  • Implement common pipeline operations in Pachyderm
  • Create a real-life example of hyperparameter tuning in Pachyderm
  • Combine Pachyderm with Pachyderm language clients in Python and Go
Estimated delivery fee Deliver to Australia

Economy delivery 7 - 10 business days

AU$19.95

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Mar 18, 2022
Length: 364 pages
Edition : 1st
Language : English
ISBN-13 : 9781801074483
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Australia

Economy delivery 7 - 10 business days

AU$19.95

Product Details

Publication date : Mar 18, 2022
Length: 364 pages
Edition : 1st
Language : English
ISBN-13 : 9781801074483
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
AU$24.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
AU$249.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts
AU$349.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total AU$ 219.97
Reproducible Data Science with Pachyderm
AU$67.99
Hands-On Data Preprocessing in Python
AU$75.99
Machine Learning for Time-Series with Python
AU$75.99
Total AU$ 219.97 Stars icon
Banner background image

Table of Contents

15 Chapters
Section 1: Introduction to Pachyderm and Reproducible Data Science Chevron down icon Chevron up icon
Chapter 1: The Problem of Data Reproducibility Chevron down icon Chevron up icon
Chapter 2: Pachyderm Basics Chevron down icon Chevron up icon
Chapter 3: Pachyderm Pipeline Specification Chevron down icon Chevron up icon
Section 2:Getting Started with Pachyderm Chevron down icon Chevron up icon
Chapter 4: Installing Pachyderm Locally Chevron down icon Chevron up icon
Chapter 5: Installing Pachyderm on a Cloud Platform Chevron down icon Chevron up icon
Chapter 6: Creating Your First Pipeline Chevron down icon Chevron up icon
Chapter 7: Pachyderm Operations Chevron down icon Chevron up icon
Chapter 8: Creating an End-to-End Machine Learning Workflow Chevron down icon Chevron up icon
Chapter 9: Distributed Hyperparameter Tuning with Pachyderm Chevron down icon Chevron up icon
Section 3:Pachyderm Clients and Tools Chevron down icon Chevron up icon
Chapter 10: Pachyderm Language Clients Chevron down icon Chevron up icon
Chapter 11: Using Pachyderm Notebooks Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(3 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Amazon Customer Jun 24, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Is it always intimidating to be introduced to a new DS/ML/AI tool as it might contain new concepts or style of coding I’ve never practiced with before and can’t easily identify resources to familiarize myself with it. However, Reproducible Data Science with Pachyderm has done a great job in introducing the concept of reproducibility and in a high level overview, how we can benefit from Pachyderm when we are building a complete ML pipeline involving mass data processing, modeling, hyperparameter tuning, deployment in the cloud and retraining. As I was reading the book, I specifically enjoyed just following along the code examples under each topic; moreover, before or after each code example, there is always a section dedicated to explaining the inputs to each method appearing in the code example to clear any possible confusions I might have towards understanding the codes. I will keep Pachyderm in mind and explore implementations using it in the future when my project requires it, and this book will serve as the best instruction to help me carry out my project.
Amazon Verified review Amazon
Amazon Customer Apr 04, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I particularly liked the Distributed Hyper Parameter Tuning chapter, which had really good developer experience perspective, 10/10 highly recommend the same!
Amazon Verified review Amazon
DBriggs Mar 22, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Reproducibility in data science has always been interesting to me and is not something I've had to think about in my current role. Fortunately, this book was a great primer for both subject matter and practicality in MLOps. I can't say I'll be using Pachyderm anytime soon, but it's now something I have working experience with and feel confident talking about. If you're starting from ground zero with Pachyderm, you can benefit from this book. Before you start, I'd definitely recommend having experience in Python. The Python is easy enough to understand but you'll get more out of it if you have a comprehensive understanding of the fundamentals.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela