Data Science for Decision Makers

Introducing Data Science

Data science is not a new term; in fact, it was coined in the 1960s by Peter Naur, a Danish computer science pioneer who used the term data science to describe the process of working with data in various fields, including mathematics, statistics, and computer science.

Later, the modern use of data science began to take shape in the 1990s and early 2000s, and data scientist, as a profession, became more and more common across different industries.

With the exponential rise in artificial intelligence, one may think that data science is less relevant.

However, the scientific approach to understanding data, which defines data science, is the bedrock upon which successful machine learning and artificial intelligence-based solutions can be built.

Within this book, we will explore these different terms, provide a solid foundation in core statistical and machine learning theory, and concepts that can be applied to statistical, machine learning and artificial intelligence-based models alike, and walk through how to lead data science teams and projects to successful outcomes.

This first chapter introduces the reader to how statistics and data science are intertwined, and some fundamental concepts in statistics which can help you in working with data.

We will explore the differences between data science, artificial intelligence, and machine learning, explain the relationship between statistics and data science, explain the concepts of descriptive and inferential statistics, as well as probability, and basic methods to understand the shape (distribution) of data.

While some readers may find this chapter covering basic, foundational knowledge, the aim is to provide all readers, especially those from less technical backgrounds, with a solid understanding of these concepts before diving deeper into the world of data science. For more experienced readers, this chapter serves as a quick refresher and helps establish a common language that will be used throughout the book.

In this next section, let's look at these terms of data science, artificial intelligence, and machine learning, how they are related, and how they differ.

This chapter covers the following topics:

Data science, AI, and ML – what’s the difference?
Statistics and data science
Descriptive and inferential statistics
Probability
Describing our samples
Probability distributions

Data science, AI, and ML – what’s the difference?

You may have heard the terms data science, AI, and ML used interchangeably, but they are distinct concepts with unique characteristics.

AI is a broad field that focuses on developing computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. ML is a subset of AI that involves training computer systems to learn from data and improve their performance on a specific task without being explicitly programmed.

ML algorithms enable computer systems to learn from data and identify patterns, which can then be used to make predictions or decisions. While all ML falls under the umbrella of AI, not all AI encompasses ML, as some AI systems may rely on rule-based or symbolic reasoning approaches.

Deep learning is a specific type of ML that utilizes artificial neural networks with multiple layers to extract higher-level features from raw data. This technique is highly effective for tasks such as image and speech recognition.

Data science is a multidisciplinary field that involves extracting and analyzing relevant insights from data. It focuses on discovering hidden patterns and relationships in data to derive meaningful conclusions. A data scientist leverages ML algorithms to make predictions and inform decision-making.

All these fields are grounded in the foundations of mathematics, probability theory, and statistics. Understanding these core concepts is essential for anyone interested in pursuing a career or leading initiatives in data science, AI, or ML.

The following is an attempt to visualize the relationship between these fields:

Figure 1.1: A visual representation of the relationship between data science, ML, and AI

Here, deep learning is a subset of machine learning, and artificial intelligence is a broader field which includes machine learning and other methods to perform intelligent tasks.

Data science, as a practice, overlaps with all these fields, as it can make use of whichever methods are most appropriate to extract insight, predictions, and recommendations from data.

All these fields are built upon the foundation of mathematics, probability, and statistics. For this reason, in the next section, we will investigate these mathematical and statistical underpinnings of data science.

The mathematical and statistical underpinnings of data science

This book is aimed at the business-focused decision maker, not the technical expert, so you might be wondering why are we starting by talking about mathematics.

Well, at its core, data science is based on mathematical and statistical foundations, so even if you aren’t working as a data scientist or ML/AI engineer, having a basic understanding of the important mathematical and statistical concepts that are used within data science is one of the most important tools you can have at your disposal when working with data scientists or leading data science, ML, or AI initiatives, whether that’s interpreting the models and results that data scientists and ML engineers bring your way, better understanding the limitations of certain data and models, and being able to evaluate which business use cases may or may not be appropriate for data science.

Research has found that 87% of data science projects never make it into production. In other words, only around one in ten projects get to the stage where they can provide bottom-line value for a company.

These results seem poor at first glance, but there is a silver lining. In many cases, the missing piece of the puzzle is strong executive leadership, knowing which use cases are appropriate for data science, providing the data science teams with good-quality, relevant data, and framing the use cases in a way where data science can be applied successfully.

Knowing some of the core concepts around mathematics and statistics for data science will not only give you a better appreciation of data science but also the compass to plan and navigate data science projects from the outset to reach more successful results.

Within this book, we won’t be attempting to provide anything like a comprehensive foundation into mathematics required for AI and ML as this would require an entire degree to achieve. However, within this chapter, we will provide you with an understanding of the fundamentals.

Statistics and data science

The British mathematician Karl Pearson once stated, “Statistics is the grammar of science.”

If you’re starting your journey of leading data science, ML, or AI initiatives within your organization, or just working with data scientists and ML engineers, having a foundation in statistical knowledge is essential.

Having a foundation in statistical knowledge is crucial for individuals embarking on a journey into leading projects or teams within the field of data science. It enables them to gain a competitive advantage in extracting valuable insights from data. Statistics plays a vital role as it offers various tools and techniques to identify patterns and uncover deeper insights from the available data. A good grasp of statistics allows individuals to think critically, approach problem-solving creatively, and make data-driven decisions. In this section, we aim to cover essential statistical topics that are relevant to data science.

What is statistics?

Before going further, it will be helpful to define what we mean by statistics as the term can be used in several different ways. It can be used to do the following:

Indicate the whole discipline of statistics
Refer to the methods that are used to collect, process, and interpret quantitative data
Refer to collections of gathered data
Refer to calculated figures (such as the mean) that are used to interpret the data that’s been gathered

In this case, we define statistics using the second definition – the methods that are used to collect, process, and interpret quantitative data.

Today, few industries are untouched by statistical thinking. For example, within market research, statistics is used when sampling surveys and comparing results between groups to understand which insights are statistically significant; within life sciences, statistics is used to measure and evaluate the efficacy of pharmaceuticals; and within financial services, statistics is used to model and understand risk.

I’m sure you’re familiar with many of these and other applications of statistics, and you may have studied statistics before at school, college, or in your professional career, and much of what follows in this chapter may not be brand new information. Even if this is the case, it can be useful to have a refresher as unfortunately, it’s not possible to pause a career to complete a statistics course.

When you’re leading data science, ML, or AI initiatives, understanding statistics is an essential skill, whether you’re working with simple statistical models or understanding the data being used or a model’s performance when you’re training and evaluating deep learning AI models.

With this in mind, let’s dive into some of the core concepts within probability and statistics.

Descriptive and inferential statistics

It’s important to understand that there are two different types of statistics: descriptive statistics (methods used to summarize or describe observations) and inferential statistics (using those observations as a basis for making estimates or predictions) – that is, inferences about a situation that has not been investigated yet.

Look at the following two example statements. Which of them is a “descriptive” statistic and which is “inferential?”

Based on our forecasting, we expect sales revenue next year to increase by 35%.
Our average rating within our customer base was 8 out of 10.

The first statement is inferential as it goes beyond what has been observed in the past to make inferences about the future, while the second statement is descriptive as it summarizes historical observations.

Within data science, often, data is first explored with descriptive statistics as part of what is known as exploratory data analysis (EDA), attempting to profile and understand the data. Following this, statistical or ML models trained on a set of data (known as model training) can be used to make inferences on unseen data (known as model inference or execution). We will revisit this topic later in this book when we cover the basics of ML.

The distinction between descriptive and inferential statistics depends on the differences between samples and populations, two more important terms within statistics.

In statistical terminology, population not only refers to populations of people, but it may also equally refer to populations of transactions, products, or retail stores. The main point is that “population” refers to every example within a studied group. It may not be the case that a data scientist is interested in every attribute of a population – it may be that they are only interested in the sales revenues of retail stores or the price of products.

However, even if a data scientist is interested in one characteristic of a population, they will likely not have the luxury to study all members of it. Usually, they will have to study a sample – a relatively small selection – from within a population. This is often due to the limitations of time and expense, or the availability of data, where only a sample of the data is available.

In this case, descriptive statistics can be used to summarize the sampled data, and it is with inference that a data scientist can attempt to go beyond the available data to generalize information to the entire population.

So, to summarize, descriptive statistics involves summarizing a sample, whereas inferential statistics is concerned with generalizing a sample to make inferences about the entire population.

How accurate are these generalizations from the sample to the population? This is a large part of what statistics are about: measuring uncertainty and errors. It is useful, when working with the results from statistical models or even ML models, to be comfortable with the idea of uncertainty and how to measure it, not to shy away from it. Sometimes, business stakeholders may not want to see the margins of error, along with the outputs of simple statistical techniques, as they want to know things with complete certainty. Otherwise, any degree of uncertainty shown alongside results might be blown out of proportion.

However, we can rarely observe an entire population when making inferences, or have a model generalize to every possible edge case, to have absolute certainty in any result.

However, we can do a lot better than human intuition, and it is better to take a more scientific stance to understand and measure the margin of error and uncertainty with our inferences and predictions. Unconsciously, we make decisions every day with partial information and some uncertainty. For example, if you’ve ever booked a hotel, you may have looked at a sample of hotels and read a sample of customer reviews but had to decide on which hotel to book based on this sample. You may have seen a hotel with one five-star review and another with 1,000 reviews averaging 4.8 stars. Although the first hotel had a higher average rating, which hotel would you book? Probably the latter, because you could infer that the margin of error in the rating was less, but importantly there is still some margin of error as not every customer may have given a review.

In the data science, ML, and AI worlds, this ability to investigate and understand uncertainty when working with data science and have criteria around what margin of error would be acceptable for your business use case is critical to knowing when to proceed with deploying a model to production.

Sampling strategies

In data science, sampling is the process of selecting a subset of data from a larger population. Sampling can be a powerful tool for decision-makers to draw inferences and make predictions about the population, but it is important to choose the right sampling strategy to ensure the validity and reliability of the results.

Random sampling

Random sampling is the most common and straightforward sampling strategy. In this method, each member of the population has an equal chance of being selected for the sample. This can be done through a variety of techniques, such as simple random sampling, stratified random sampling, or cluster sampling.

Simple random sampling involves randomly selecting individuals from the population without any restrictions or stratification. Stratified random sampling involves dividing the population into strata or subgroups based on certain characteristics and then randomly selecting individuals from each stratum. Cluster sampling involves dividing the population into clusters and randomly selecting entire clusters to be included in the sample.

Random sampling can be useful when the population is large and homogenous, meaning that all members have similar characteristics. However, it may not be the best strategy when the population is diverse and there are significant differences between subgroups.

Convenience sampling

Convenience sampling involves selecting individuals from the population who are easily accessible or available. This can include individuals who are in a convenient location, such as in the same office or building, or individuals who are readily available to participate in the study.

While convenience sampling can be a quick and easy way to gather data, it is not the most reliable strategy. The sample may not be representative of the population as it may exclude certain subgroups or over-represent others.

Stratified sampling

Stratified sampling involves dividing the population into subgroups based on certain characteristics and then selecting individuals from each subgroup to be included in the sample. This strategy can be useful when the population is diverse and there are significant differences between subgroups.

In stratified sampling, the size of the sample is proportional to the size of each subgroup in the population. This ensures that each subgroup is adequately represented in the sample, and the results can be extrapolated to the population with greater accuracy.

Cluster sampling

Cluster sampling involves dividing the population into clusters and randomly selecting entire clusters to be included in the sample. This strategy can be useful when the population is geographically dispersed or when it is easier to access clusters than individuals.

Cluster sampling involves dividing the population into clusters, which are typically based on geographic proximity or other shared characteristics. From these clusters, a random sample of clusters is selected, and all members within the selected clusters are included in the sample. This strategy can be useful when the population is geographically dispersed or when it is more feasible to access and survey entire clusters rather than individual participants.

Cluster sampling is often more cost-effective and efficient than other sampling methods, especially when dealing with large, spread-out populations. However, it may lead to higher sampling error compared to simple random sampling if the clusters are not representative of the entire population:

Figure 1.2: Stratified random sampling and cluster sampling

Sampling is an important tool for decision-makers to draw inferences and make predictions about a population. The choice of sampling strategy will depend on the characteristics of the population and the research question being asked. Random sampling, stratified sampling, and cluster sampling are all useful strategies, but it is important to consider the potential biases and limitations of each method. By selecting the appropriate sampling strategy, decision-makers can ensure that their results are reliable and valid and can make better-informed decisions based on the data.

Random variables

What do we do with the members of a sample once we have them?

This is where the concept of random variables comes in.

In data science, a random variable is a variable whose value is determined by chance. Random variables are often used to model uncertain events or outcomes, and they play a crucial role in statistical analysis, ML, and decision-making.

Random variables are mathematical functions that are utilized to assign a numerical value to each potential outcome of a random process. For example, when flipping a coin, the value of 0 can be assigned to tails and 1 to heads, effectively causing the random variable, X, to adopt the values of 0 or 1:

X = {1, if heads 0, if tails

There are two types of random variables: discrete and continuous. Discrete random variables can only take on a finite or countable number of values, while continuous random variables can take on any value within a specified range.

For example, the outcome of rolling a six-sided die is a discrete random variable as it can only take on the values 1, 2, 3, 4, 5, or 6. On the other hand, the height of a person is a continuous random variable as it can take on any value within a certain range.

Random variables are often used in the context of sampling strategies as they provide a way to model and analyze uncertain outcomes in a sample.

For example, suppose a decision maker wants to estimate the average height of students at a university. One possible sampling strategy would be simple random sampling, in which a random sample of students is selected from the population of all students at the university.

Probability distribution

The probability distribution of a random variable describes the likelihood of each possible value of the variable. For a discrete random variable, the probability distribution is typically represented by a probability mass function (PMF), which gives the probability of each possible value. For a continuous random variable, the probability distribution is typically represented by a probability density function (PDF), which gives the probability density at each point in the range.

Probability

Probability is a way to measure how likely something is to happen. As mentioned previously, in data science, ML, and decision-making, we often deal with uncertain events or outcomes. Probability helps us understand and quantify that uncertainty.

For example, when we flip a coin, we don’t know whether it will land heads or tails. The probability of it landing heads is 50%, and the probability of it landing tails is also 50%.

Probability distribution

A probability distribution is a way to show the likelihood of each possible outcome. For example, when we roll a six-sided die, the probability of getting each number is the same – 1/6. This means that the probability distribution is equal for each outcome.

Conditional probability

Conditional probability is the likelihood of an event or outcome happening, given that another event or outcome has already occurred. For example, if we know that a person is over six feet tall, the conditional probability of them being a basketball player is higher than the probability of a randomly selected person being a basketball player.

Let’s say there were two different events, A and B, which had some probability of occurring, within what is known as a sample space, S, of all possible events occurring.

For example, A could be the event that a consumer purchases a particular brand’s product, and B could be the event that a consumer has visited the brand’s website. In the following diagram, the probability of event A, P(A), and the probability of event B, P(B), are represented by the shaded areas in the following Venn diagram. The probability of both A and B occurring is represented by the shaded area where A and B overlap. In mathematical notation, this is written as P(A ∩ B), which means the probability of the intersection of A and B. This intersection simply means both A and B occur:

Figure 1.3: A Venn diagram visualizing the probability of two events (A and B) occurring in a sample space (S)

The conditional probability of A occurring, given that B has occurred, can be calculated as follows:

In our example, this would be the probability of a consumer purchasing a brand’s product, given they have visited the brand’s website. By understanding the probabilities of different events and how they are related, we can calculate things such as conditional probabilities, which can help us understand the chance of events happening based on our data.

Describing our samples

Now that we understand the concepts of populations, samples, and random variables, what tools can we use to describe and understand our data sample?

Measures of central tendency

The expected value is a statistical measure that represents the average value of a random variable, weighted by its probability of occurrence. It provides a way to estimate the central tendency of a probability distribution and is useful for decision-making and predicting uncertain events or outcomes.

Measures of central tendency, including mean, median, and mode, are statistical measures that describe the central or typical value of a dataset.

The mean is the arithmetic average of a dataset, calculated by adding up all the values and dividing them by the number of values. It is a common measure of central tendency and is sensitive to outliers (values that are significantly higher or lower than the majority of the data points, often falling far from the mean). The mean can be influenced by extreme values and may not be representative of the entire dataset if there are outliers.

The median is the middle value of a dataset, with an equal number of values above and below it. It is a robust measure of central tendency and is less sensitive to outliers than the mean. The median is useful for skewed datasets, where the mean may not accurately represent the center of the data.

The mode is the value that occurs most frequently in a dataset. It is another measure of central tendency and is useful for datasets with discrete values or when the most frequent value is of particular interest. The mode can be used for both categorical and numerical data.

The following figure shows the differences between the mean, median, and mode for two different distributions of data. Imagine that this dataset shows the range of prices of a consumer product, say bottles of wine on an online wine merchant.

For symmetrical distributions, these three measures are equal; however, for asymmetrical data, they differ. The choice of which measure to use may depend on the distribution of your data. The mean can often be skewed by extreme outliers – for example, one really expensive bottle of wine is not reflective of most of the bottles being sold on the site, so you may want to use the median to better understand the average value within your dataset, and not get scared away from buying from the store!

Figure 1.4: The mode, median, and mean for a symmetrical distribution and an asymmetrical distribution

Overall, the expected value and measures of central tendency are important statistical concepts that play a critical role in data science, ML, and decision-making. They provide you with a way to understand and describe the characteristics of a dataset, and they help decision-makers make better-informed decisions based on the analysis of uncertain events or outcomes.

Measures of dispersion

Measures of dispersion are statistical measures that describe how spread out or varied a dataset is. They provide us with a way to understand the variability of the data and can be used to compare datasets.

Range

The range is a simple measure of dispersion that represents the difference between the highest and lowest values in a dataset. It is easy to calculate and provides a rough estimate of the spread of the data. For example, the range of the heights of students in a class would be the difference between the tallest and shortest students.

Variance and standard deviation

Variance and standard deviation are more advanced measures of dispersion that provide a more accurate and precise estimate of the variability of the data.

Variance is a measure of how far each value in a set of data is from the mean value. It is calculated by taking the sum of the squared differences between each value and the mean, divided by the total number of values:

Standard deviation is the square root of the variance:

For example, suppose a company wants to compare the salaries of two different departments. The standard deviation of the salaries in each department can be calculated to determine the variability of the salaries within each department. The department with a higher standard deviation would have more variability in salaries than the department with a lower standard deviation.

Interquartile range

The interquartile range (IQR) is a measure of dispersion that represents the difference between the 75th and 25th percentiles of a dataset. In other words, it is the range of the middle 50% of the data. It is useful for datasets with outliers as it is less sensitive to extreme values than the range.

For example, suppose a teacher wants to compare the test scores of two different classes. One class has a few students with very high or very low scores, while the other class has a more consistent range of scores. The IQR of each class can be calculated to determine the range of scores that most students fall into.

Measures of dispersion are important statistical measures that provide insight into the variability of a dataset.

Degrees of freedom

Degrees of freedom is a fundamental concept in statistics that refers to the number of independent values or quantities that can vary in an analysis without breaking any constraints. It is essential to understand degrees of freedom when working with various statistical tests and models, such as t-tests, ANOVA, and regression analysis.

In simpler terms, degrees of freedom represents the amount of information in your data that is free to vary when estimating statistical parameters. The concept is used in hypothesis testing to determine the probability of obtaining your observed results if the null hypothesis is true.

For example, let’s say you have a sample of ten observations and you want to calculate the sample mean. Once you have calculated the mean, you have nine degrees of freedom remaining (10 - 1 = 9). This is because if you know the values of nine observations and the sample mean, you can always calculate the value of the 10th observation.

The general formula for calculating degrees of freedom is as follows:

df = n − p

Here, we have the following:

n is the number of observations in the sample
p is the number of parameters estimated from the data

Degrees of freedom is used in various statistical tests to determine the critical values for test statistics and p-values. For instance, in a t-test for comparing two sample means, degrees of freedom is used to select the appropriate critical value from the t-distribution table.

Understanding degrees of freedom is crucial for data science leaders as it helps them interpret the results of statistical tests and make informed decisions based on the data. It also plays a role in determining the complexity of models and avoiding overfitting, which occurs when a model is too complex and starts to fit the noise in the data rather than the underlying patterns.

Correlation, causation, and covariance

Correlation, causation, and covariance are important concepts in data science, ML, and decision-making. They are all related to the relationship between two or more variables and can be used to make predictions and inform decision-making.

Correlation

Correlation is a measure of the strength and direction of the relationship between two variables. It is a statistical measure that ranges from -1 to 1. A correlation of 1 indicates a perfect positive correlation, a correlation of 0 indicates no correlation, and a correlation of -1 indicates a perfect negative correlation.

For example, suppose we want to understand the relationship between a person’s age and their income. If we observe that as a person’s age increases, their income also tends to increase, this will indicate a positive correlation between age and income.

Causation

Causation refers to the relationship between two variables in which one variable causes a change in the other variable. Causation is often inferred from correlation, but it is important to note that correlation does not necessarily imply causation.

For example, suppose we observe a correlation between the number of ice cream cones sold and the number of drownings in a city. While these two variables are correlated, it would be incorrect to assume that one causes the other. Rather, there may be a third variable, such as temperature, that causes both the increase in ice cream sales and the increase in drownings.

Covariance

Covariance is a measure of the joint variability of two variables. It measures how much two variables change together. A positive covariance indicates that the two variables tend to increase or decrease together, while a negative covariance indicates that the two variables tend to change in opposite directions.

For example, suppose we want to understand the relationship between a person’s height and their weight. If we observe that as a person’s height increases, their weight also tends to increase, this will indicate a positive covariance between height and weight.

Correlation, causation, and covariance are important concepts in data science. By understanding these concepts, decision-makers can better understand the relationships between variables and make better-informed decisions based on the analysis of the data.

Covariance measures how two variables change together, indicating the direction of the linear relationship between them. However, covariance values are difficult to interpret because they are affected by the scale of the variables. Correlation, on the other hand, is a standardized measure that ranges from -1 to +1, making it easier to understand and compare the strength and direction of linear relationships between variables.

It is important to note that correlation does not necessarily imply causation and that other factors may be responsible for observed relationships between variables. A strong correlation between two variables does not automatically mean that one variable causes the other as there may be hidden confounding factors influencing both variables simultaneously.

The shape of data

When working with samples of data, it is helpful to understand the “shape” of the data, or how the data is distributed. In this respect, we can consider distributions of probabilities for both continuous and discrete data. These probability distributions can be used to describe and understand your data. Probability distributions can help you identify patterns or trends in the data. For example, if your data follows a normal distribution, it suggests that most values are clustered around the mean, with fewer values at the extremes. Recognizing these patterns can help inform decision-making or further analysis.

Probability distributions

Probability distributions are mathematical functions that describe the likelihood of different outcomes in a random event or process. They help us understand the behavior of random variables and make predictions about future events. There are two main types of probability distributions: discrete distributions and continuous distributions.

Discrete probability distributions

Discrete probability distributions are used when the possible outcomes of a random event are countable or finite. Let’s look at some common examples of discrete probability distributions

Bernoulli distribution

This is the simplest discrete probability distribution. It models a single trial with only two possible outcomes: success (usually denoted as 1) or failure (usually denoted as 0). For example, flipping a coin has a Bernoulli distribution with a probability of success (heads) of 0.5.

Binomial distribution

This distribution models the number of successes in a fixed number of independent trials, where each trial has the same probability of success. For example, if you flip a fair coin ten times, the number of heads you observe follows a binomial distribution with parameters of n = 10 (number of trials) and p = 0.5 (probability of success).

Negative binomial distribution

This distribution models the number of failures before a specified number of successes occurs in independent trials with the same probability of success. For instance, if you’re playing a game where you need to win three times before the game ends, the number of losses before the third win follows a negative binomial distribution.

Geometric distribution

This is a special case of the negative binomial distribution where the number of successes is fixed at 1. It models the number of failures before the first success in independent trials with the same probability of success. An example would be the number of times you need to roll a die before getting a 6.

Poisson distribution

This distribution models the number of events occurring in a fixed interval of time or space, given the average rate of occurrence. It is often used to model rare events, such as the number of earthquakes in a year or the number of customers arriving at a store in an hour.

Continuous probability distributions

Continuous probability distributions are used when the possible outcomes of a random event are continuous, such as measurements or time. Let’s look at some common examples of continuous probability distributions.

Normal distribution

Also known as the Gaussian distribution, this is the most well-known continuous probability distribution. It models continuous variables that have a symmetric, bell-shaped distribution, such as heights, weights, or IQ scores. Many natural phenomena follow a normal distribution.

Standard normal distribution

This is a special case of the normal distribution with a mean of zero and a standard deviation of one. It is often used to standardize variables and compare values across different normal distributions.

Student’s t-distribution

This distribution is similar to the normal distribution but has heavier tails. It is used when the sample size is small (typically less than 30) or when the population standard deviation is unknown. It is often used in hypothesis testing and constructing confidence intervals.

Gamma distribution

This distribution models continuous variables that are positive and have a skewed right distribution. It is often used to model waiting times, such as the time until a machine fails or the time until a customer arrives.

Exponential distribution

This is a special case of the gamma distribution where the shape parameter is equal to 1. It models the time between events occurring at a constant rate, such as the time between customer arrivals or the time between radioactive particle decays.

Chi-squared distribution

This distribution is used for positive variables. It is often used in hypothesis testing and to estimate the confidence interval of a sample variance. It is also used in the chi-squared test for independence and goodness of fit.

F-distribution

This distribution is used for variables that are positive or non-negative. It is often used to test the equality of two variances or the significance of a regression model. It is the ratio of two chi-squared distributions.

Probability distributions allow us to understand and quantify the probabilities of different outcomes in a random event or process. By understanding the different types of probability distributions and their applications, data science leaders can better model and analyze their data, make informed decisions, and improve their predictions. Knowing which distribution to use in a given situation is crucial for accurate data analysis and decision-making.

Steven Fernandes Aug 06, 2024

The book provides a clear and practical overview of key concepts in data science and machine learning. The book begins with a foundational understanding of how to interpret common statistical measures and make informed decisions based on data. It then covers a range of machine learning methodologies, including supervised, unsupervised, and reinforcement learning. Readers will learn how to evaluate both statistical and machine learning models effectively and understand the full data science lifecycle, from development to deployment. The guide also offers insights into choosing between ML, statistical modeling, and traditional BI methods and includes valuable advice on managing data teams and projects. A great resource for aspiring data scientists and analytics managers.

Amazon Verified review

Sai Kumar Bysani Oct 15, 2024

𝐇𝐞𝐫𝐞 𝐚𝐫𝐞 𝐦𝐲 𝐤𝐞𝐲 𝐭𝐚𝐤𝐞𝐚𝐰𝐚𝐲𝐬:1. 𝐂𝐨𝐦𝐩𝐫𝐞𝐡𝐞𝐧𝐬𝐢𝐯𝐞 𝐈𝐧𝐭𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧: The book introduces key data science concepts like exploratory data analysis (EDA) and feature engineering. It simplifies understanding of these critical steps in any machine learning pipeline.2. 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠: From regression to clustering, the book breaks down both supervised and unsupervised learning algorithms. It also covers important aspects like hyperparameter tuning and model optimization.3. 𝐀𝐯𝐨𝐢𝐝𝐢𝐧𝐠 𝐂𝐨𝐦𝐦𝐨𝐧 𝐏𝐢𝐭𝐟𝐚𝐥𝐥𝐬: The book highlights key pitfalls such as overfitting, bias, and issues related to data quality. By addressing challenges like model generalization and data consistency, it ensures that machine learning models are robust and reliable.4. 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐧𝐠 𝐌𝐨𝐝𝐞𝐥𝐬: The book provides detailed guidance on evaluating models using metrics like accuracy, precision, recall, and AUC-ROC curves. It also explains how to interpret these results to make informed decisions.5. 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐬: It covers essential statistical concepts such as hypothesis testing, p-values, and confidence intervals. These are critical for making data-driven decisions and evaluating models.6. 𝐕𝐢𝐬𝐮𝐚𝐥𝐥𝐲 𝐄𝐧𝐠𝐚𝐠𝐢𝐧𝐠 𝐏𝐢𝐜𝐭𝐮𝐫𝐞𝐬: The book uses clear visuals like charts, graphs, and confusion matrices to break down complex topics. These images make it easier to understand the material and apply it practically.Additionally, the book includes practical case studies such as fraud detection, churn prediction, forecasting, and many more. These real-world examples help illustrate how data science techniques are applied to solve complex business problems. I had a great time reading this book!

Om S Aug 07, 2024

Stepping into a leadership role, "Data Science for Decision Makers" guided me through the complexities of data science and AI. It covers everything from collecting and analyzing data to understanding machine learning concepts. With practical examples, I learned to interpret models and make informed decisions. The book also provides tools for managing data science projects and teams, making it an invaluable resource for executives looking to leverage data for impactful results.

Kog Aug 30, 2024

This has something for everyone from Data Science world—whether you're an executive, a manager, a consultant, an AI expert, a Data scientist, a researcher or just someone curious about how data can shape business strategy and strategic decisions! 📚#DataScience #DecisionMaking #BusinessStrategy #Leadership #ContinuousLearning

Amazon Customer Sep 10, 2024

This book excels in its mission to elevate leadership skills by demystifying the core concepts of data science and AI. It offers a clear and practical guide for understanding and utilizing data to drive business value. The author break down intricate topics into accessible explanations, making complex concepts like statistical quantities, machine learning techniques, and model evaluation comprehensible for those without a deep technical background.The book equips leaders with the tools to make informed decisions about when to deploy ML, statistical models, or traditional business intelligence methods.In summary, this book is an invaluable resource for executives and emerging leaders in data science.It provides a thorough grounding in essential data science concepts while offering actionable insights and practical strategies for applying this knowledge in a business context.