The Data Analysis Workshop

2. Absenteeism at Work

Overview

In this chapter, you will perform standard data analysis techniques, such as estimating conditional probabilities, Bayes' theorem, and Kolmogorov-Smirnov tests, for distribution comparison. You will also implement data transformation techniques, such as the Box-Cox and Yeo-Johnson transformations, and apply these techniques to a given dataset.

Introduction

In the previous chapter, we looked at some of the main techniques that are used in data analysis. We saw how hypothesis testing can be used when analyzing data, we got a brief introduction to visualizations, and finally, we explored some concepts related to time series analysis. In this chapter, we will elaborate on some of the topics we've already looked at (such as plotting and hypothesis testing) while introducing new ones coming from probability theory and data transformations.

Nowadays, work relationships are becoming more and more trust-oriented, and conservative contracts (in which working time is strictly monitored) are being replaced with more agile ones in which the employee themselves is responsible for accounting working time. This liberty may lead to unregulated absenteeism and may reflect poorly on an employee's candidature, even if absent hours can be accounted for with genuine reasons. This can significantly undermine healthy working relationships. Furthermore, unregulated absenteeism can also have a negative impact on work productivity.

In this chapter, we'll analyze absenteeism data from a Brazilian courier company, collected between July 2007 and July 2010.

Note

The original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work.

If you're interested, take a look at the following paper, which talks about the problem from a machine learning perspective: Martiniano, A., Ferreira, R.P., Sassi, R.J., & Affonso, C. (2012). Application of neuro fuzz network on prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.

This dataset can also be found on our GitHub repository here: https://packt.live/3e4rorX.

Our goal is to discover hidden patterns in the data, which might be useful for distinguishing genuine work absences from fraudulent ones. During this chapter, the following topics will be addressed:

Introduction to probability, conditional probability, and Bayes' theorem
Kolmogorov-Smirnov tests for equality of probability distributions
Box-Cox and Yeo-Johnson transformations

We will apply these techniques to our analysis as we try to identify the main drivers for absenteeism.

Initial Data Analysis

As a rule of thumb, when starting the analysis of a new dataset, it is good practice to check the dimensionality of the data, type of columns, possible missing values, and some generic statistics on the numerical columns. We can also get the first 5 to 10 entries in order to acquire a feeling for the data itself. We'll perform these steps in the following code snippets:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# import data from the GitHub page of the book
data = pd.read_csv('https://raw.githubusercontent.com'\
                   '/PacktWorkshops/The-Data-Analysis-Workshop'\
                   '/master/Chapter02/data/'\
                   'Absenteeism_at_work.csv', sep=";")

Note that we are providing the separator parameter when reading the data because, although the original data file is in the CSV format, the ";" symbol has been used to separate the various fields.

In order to print the dimensionality of the data, column types, and the number of missing values, we can use the following code:

"""
print dimensionality of the data, columns, types and missing values
"""
print(f"Data dimension: {data.shape}")
for col in data.columns:
    print(f"Column: {col:35} | type: {str(data[col].dtype):7} \
| missing values: {data[col].isna().sum():3d}")

This returns the following output:

Figure 2.1: Dimensions of the Absenteeism_at_work dataset

As we can see from these 21 columns, only one (Work Load Average/day) does not contain integer values. Since no missing values are present in the data, we can consider it quite clean. We can also derive some basic statistics by using the describe method:

# compute statistics on numerical features
data.describe().T

The output will be as follows:

Figure 2.2: Output of the describe() method

Note that some of the columns, such as Month of absence, Day of the week, Seasons, Education, Disciplinary failure, Social drinker, and Social smoker, are encoding categorical values. So, we can back-transform the numerical values to their original categories so that we have better plotting features. We will perform the transformation by defining a Python dict object containing the mapping and then applying the apply() function to each feature, which applies the provided function to each of the values in the column. First, let's define the encoding dict objects:

# define encoding dictionaries
month_encoding = {1: "January", 2: "February", 3: "March", \
                  4: "April", 5: "May", 6: "June", 7: "July", \
                  8: "August", 9: "September", 10: "October", \
                  11: "November", 12: "December", 0: "Unknown"}
dow_encoding = {2: "Monday", 3: "Tuesday", 4: "Wednesday", \
                5: "Thursday", 6: "Friday"}
season_encoding = {1: "Spring", 2: "Summer", 3: "Fall", 4: "Winter"}
education_encoding = {1: "high_school", 2: "graduate", \
                      3: "postgraduate", 4: "master_phd"}
yes_no_encoding = {0: "No", 1: "Yes"}

Afterward, we apply the encoding dictionaries to the relevant features:

# backtransform numerical variables to categorical
preprocessed_data = data.copy()
preprocessed_data["Month of absence"] = preprocessed_data\
                                        ["Month of absence"]\
                                        .apply(lambda x: \
                                               month_encoding[x])
preprocessed_data["Day of the week"] = preprocessed_data\
                                       ["Day of the week"]\
                                       .apply(lambda x: \
                                              dow_encoding[x])
preprocessed_data["Seasons"] = preprocessed_data["Seasons"]\
                              .apply(lambda x: season_encoding[x])
preprocessed_data["Education"] = preprocessed_data["Education"]\
                                 .apply(lambda x: \
                                        education_encoding[x])
preprocessed_data["Disciplinary failure"] = \
preprocessed_data["Disciplinary failure"].apply(lambda x: \
                                                yes_no_encoding[x])
preprocessed_data["Social drinker"] = \
preprocessed_data["Social drinker"].apply(lambda x: \
                                          yes_no_encoding[x])
preprocessed_data["Social smoker"] = \
preprocessed_data["Social smoker"].apply(lambda x: \
                                         yes_no_encoding[x])
# transform columns
preprocessed_data.head().T

The output will be as follows:

Figure 2.3: Transformation of columns

In the previous code snippet, we created a clean copy of the original dataset by calling the .copy() method on the data object. In this way, a new copy of the original data is created. This is a convenient way to create new pandas DataFrames, without taking the risk of modifying the original raw data (as it might serve us later). Afterward, we created a set of dictionaries where the numerical values are keys and the categorical values are values. Finally, we used the .apply() method on each column we wanted to encode by mapping each value in the original column to its corresponding value in the encoding dictionary, which contains the target values. Note that in the Month of absence column, a 0 value is present, which is encoded as Unknown as no month corresponds to 0.

Based on the description of the data, the Reason for absence column contains information about the absence, which is encoded based on the International Code of Diseases (ICD). The following table represents the various encodings:

Figure 2.4: Reason for absence encoding

Note that only values 1 to 21 represent ICD encoding; values 22 to 28 are separate reasons, which do not represent a disease, while value 0 is not defined—hence the encoded reason Unknown. As all values contained in the ICD represent some type of disease, it makes sense to create a new binary variable that indicates whether the current reason for absence is related to some sort of disease or not. We will do this in the following exercise.

Exercise 2.01: Identifying Reasons for Absence

In this exercise, you will create a new variable, called Disease, which indicates whether a specific reason for absence is present in the ICD table or not. Please complete the initial data analysis before you begin this exercise. Now, follow these steps:

First, define a function that returns Yes if a provided encoded value is contained in the ICD (values 1 to 21); otherwise, No:

"""
define function, which checks if the provided integer value 
is contained in the ICD or not
"""
def in_icd(val):
    return "Yes" if val >= 1 and val <= 21 else "No"

Combine the .apply() method with the previously defined in_icd() function in order to create the new Disease column in the preprocessed dataset:
```
# add Disease column
preprocessed_data["Disease"] = \
preprocessed_data["Reason for absence"].apply(in_icd)
```

Use bar plots in order to compare the absences due to disease reasons:

plt.figure(figsize=(10, 8))
sns.countplot(data=preprocessed_data, x='Disease')
plt.savefig('figs/disease_plot.png', format='png', dpi=300)

The output will be as follows:

Figure 2.5: Comparing absence count to disease

Here, we are using the seaborn .countplot() function, which is quite handy when creating this type of bar plot, in which we want to know the total number of entries for each specific class. As we can see, the number of reasons for absence that are not listed in the ICD table is almost twice the number of listed ones.

Note

To access the source code for this specific section, please refer to https://packt.live/2B9AqVJ.

You can also run this example online at https://packt.live/2UPwIr1. You must execute the entire Notebook in order to get the desired result.

In this section, we performed some simple data exploration and transformations on the initial absenteeism dataset. In the next section, we will go deeper into our data exploration and analyze some of the possible reasons for absence.

Initial Analysis of the Reason for Absence

Let's start with a simple analysis of the Reason for absence column. We will try to address questions such as, what is the most common reason for absence? Does being a drinker or smoker have some effect on the causes? Does the distance to work have some effect on the reasons? And so on. Starting with these types of questions is often important when performing data analysis, as this is a good way to obtain confidence and understanding of the data.

The first thing we are interested in is the overall distribution of the absence reasons in the data—that is, how many entries we have for a specific reason for absence in our dataset. We can easily address this question by using the countplot() function from the seaborn package:

# get the number of entries for each reason for absence
plt.figure(figsize=(10, 5))
ax = sns.countplot(data=preprocessed_data, x="Reason for absence")
ax.set_ylabel("Number of entries per reason of absence")
plt.savefig('figs/absence_reasons_distribution.png', \
            format='png', dpi=300)

The output will be as follows:

Figure 2.6: Number of entries for all reasons for absence

Note that we also used the Disease column as the hue parameter. This helps us to distinguish between disease-related reasons (listed in the ICD encoding) and those that aren't. Following Figure 2.3, we can assert that the most frequent reasons for absence are related to medical consultations (23), dental consultations (28), and physiotherapy (27). On the other hand, the most frequent reasons for absence encoded in the ICD encoding are related to diseases of the musculoskeletal system and connective tissue (13) and injury, poisoning, and certain other consequences of external causes (19).

In order to perform a more accurate and in-depth analysis of the data, we will investigate the impact of the various features on the Reason for absence and Absenteeism in hours columns in the following sections. First, we will analyze the data on social drinkers and smokers in the next section.

Analysis of Social Drinkers and Smokers

Let's begin with an analysis of the impact of being a drinker or smoker on employee absenteeism. As smoking and frequent drinking have a negative impact on health conditions, we would expect that certain diseases are more frequent in smokers and drinkers than others. Note that in the absenteeism dataset, 56% of the registered employees are drinkers, while only 7% are smokers. We can produce a figure, similar to Figure 2.6 for the social drinkers and smokers with the following code:

# plot reasons for absence against being a social drinker/smoker
plt.figure(figsize=(8, 6))
sns.countplot(data=preprocessed_data, x="Reason for absence", \
              hue="Social drinker", hue_order=["Yes", "No"])
plt.savefig('figs/absence_reasons_drinkers.png', \
            format='png', dpi=300)
plt.figure(figsize=(8, 6))
sns.countplot(data=preprocessed_data, x="Reason for absence", \
              hue="Social smoker", hue_order=["Yes", "No"])
plt.savefig('figs/absence_reasons_smokers.png', \
            format='png', dpi=300)

The following is the output of the preceding code:

Figure 2.7: Distribution of diseases over social drinkers

Similarly, the distribution of diseases for social smokers can be visualized as follows:

Figure 2.8: Distribution of diseases over social smokers

Next, calculate the actual count for social drinkers and smokers from the preprocessed data:

print(preprocessed_data["Social drinker"]\
      .value_counts(normalize=True))
print(preprocessed_data["Social smoker"]\
      .value_counts(normalize=True))

The output will be as follows:

Yes    0.567568
No     0.432432
Name: Social drinker, dtype: float64
No     0.927027
Yes    0.072973
Name: Social smoker, dtype: float64

As we can see from the resulting plots, a significant difference between drinkers and non-drinkers can be observed in absences related to Dental consultations (28). Furthermore, as the number of social smokers is quite small (only 7% of the entries), it is very hard to say whether there is actually a relationship between the absence reasons and smoking. A more rigorous approach in this direction would be to analyze the conditional probabilities of the different absence reasons, which are based on being a social drinker or smoker.

Conditional probability is a measure that tells us the probability of an event's occurrence, assuming that another event has occurred. From a mathematical perspective, given a set of events Ω and a probability measure P on Ω and given two events A and B in Ω with the unconditional probability of B being greater than zero (that is, P(B) > 0), we can define the conditional probability of A given B as follows:

Figure 2.9: Formula for conditional probability

In other words, the probability of A given B is equal to the probability of A and B both happening, divided by the probability of B happening. Let's consider a simple example that will help us understand the usage of conditional probability. This is a classic probability problem. Suppose that your friend has two children, and you know that one of them is male. We want to know what the probability is that your friend has two sons. First, we have to identify all the possible events in our event space Ω. If we denote with B the event of having a boy, and with G the event of having a girl, then the event space contains four possible events:

Figure 2.10: Event space Ω

They each have a probability of 0.25. Following the notations from the definition, we can define the first event like so:

Figure 2.11: Event A

We can define the latter event like so:

Figure 2.12: Event B

Now, our initial problem translates into computing P(A|B). With this, we get the following equation:

Figure 2.13: Probability of event A conditioned to B

We can also perform this example computationally:

# computation of conditional probability
sample_space = set(["BB", "BG", "GB", "GG"])
event_a = set(["BB"])
event_b = set(["BB", "BG", "GB"])
cond_prob = (0.25*len(event_a.intersection(event_b))) \
            / (0.25*len(event_b))
print(round(cond_prob, 4))

The output will be as follows:

0.3333

Note that by using the definition of conditional probability, we could address questions such as, "What is the probability of a reason for absence being related to laboratory examinations, assuming that an employee is a social drinker?" In other words, if we denote the "employee is absent for laboratory examinations" event with A, and the "employee is a social drinker" event with B, the probability of the "employee is absent due to laboratory examination reasons, given that employee is a social drinker" event can be computed by the previous formula.

The following exercise illustrates how the conditional probability formula can identify reasons for absence with higher probability among smokers and drinkers.

Exercise 2.02: Identifying Reasons of Absence with Higher Probability Among Drinkers and Smokers

In this exercise, you will compute the conditional probabilities of the different reasons for absence, assuming that the employee is a social drinker or smoker. Please execute the code mentioned in the previous section and Exercise 2.01, Identifying Disease Reasons for Absence before attempting this exercise. Now, follow these steps:

To identify the conditional probabilities, first compute the unconditional probabilities of being a social drinker or smoker. Verify that both the probabilities are greater than zero, as they appear in the denominator of the conditional probabilities. Do this by counting the number of social drinkers and smokers and dividing these values by the total number of entries, like so:
Figure 2.14: Probability of being a social drinker
Figure 2.15: Probability of being a social smoker
The following code snippet does this for you:
```
# compute probabilities of being a drinker and smoker
drinker_prob = preprocessed_data["Social drinker"]\
               .value_counts(normalize=True)["Yes"]
smoker_prob = preprocessed_data["Social smoker"]\
              .value_counts(normalize=True)["Yes"]
print(f"P(social drinker) = {drinker_prob:.3f} \
| P(social smoker) = {smoker_prob:.3f}")
```
The output will be as follows:
```
P(social drinker) = 0.568 | P(social smoker) = 0.073
```
As you can see, the probability of being a drinker is almost 57%, while the probability of being a smoker is quite low (only 7.3%).
Next, compute the probabilities of being a social drinker/smoker and being absent for each reason of absence. For a specific reason of absence (say Ri), these probabilities are defined as follows:
Figure 2.16: Probability of being a drinker and absent
Figure 2.17: Probability of being a smoker and absent

In order to carry the required computations, define masks in the data, which only account for entries where employees are drinkers or smokers:

#create mask for social drinkers/smokers drinker_mask = preprocessed_data["Social drinker"] == "Yes"
smoker_mask = preprocessed_data["Social smoker"] == "Yes"

Compute the total number of entries and the number of absence reasons, masked by drinkers/smokers:

total_entries = preprocessed_data.shape[0]
absence_drinker_prob = preprocessed_data["Reason for absence"]\
                       [drinker_mask].value_counts()/total_entries
absence_smoker_prob = preprocessed_data["Reason for absence"]\
                      [smoker_mask].value_counts()/total_entries

Compute the conditional probabilities by dividing the computed probabilities for each reason of absence in Step 2 by the unconditional probabilities obtained in Step 1:

# compute conditional probabilities
cond_prob = pd.DataFrame(index=range(0,29))
cond_prob["P(Absence | social drinker)"] = absence_drinker_prob\
                                           /drinker_prob
cond_prob["P(Absence | social smoker)"] = absence_smoker_prob\
                                          /smoker_prob

Create bar plots for the conditional probabilities:

# plot probabilities
plt.figure()
ax = cond_prob.plot.bar(figsize=(10,6))
ax.set_ylabel("Conditional probability")
plt.savefig('figs/conditional_probabilities.png', \
            format='png', dpi=300)

The output will be as follows:

Figure 2.18: Bar plots for conditional probabilities

As we can observe from the previous plot, the highest reason for absence for drinkers is dental consultations (28), followed by medical consultations (23). Smokers' absences, however, are mostly due to unknown reasons (0) and laboratory examinations (25).

Note

To access the source code for this specific section, please refer to https://packt.live/2Y7KQhv.

You can also run this example online at https://packt.live/3d7pFk3. You must execute the entire Notebook in order to get the desired result.

In the previous exercise, we saw how to compute the conditional probabilities of the reason for absence, conditioned on the employee being a social smoker or drinker. Furthermore, we saw that in order to perform the computation, we had to compute the probability of being absent and being a social smoker/drinker. Due to the nature of the problem, computing this value might be difficult, or we may only have one conditional probability (say, P(A|B)) where we actually need P(B|A). In these cases, the Bayesian theorem can be used:

Let Ω denote a set of events with probability measure P on Ω. Given two events A and B in Ω , with (P(B) > 0) the Bayesian theorem states the following:

Figure 2.19: Bayesian theorem

Before proceeding further, we will provide a practical example of applying the Bayesian theorem in practice. Suppose that we have two bags. The first one contains four blue and three red balls, while the second one contains two blue and five red balls. Let's assume that a ball is drawn at random from one of the two bags, and its color is blue. We want to know what the probability is that the ball has been drawn from the first bag. Let's use B1 to denote the "ball is drawn from the first bag" event and B2 to denote the "ball is drawn from the second bag" event. Since the number of balls is equal in both bags, the probability of the two events is equal to 0.5, as follows:

Figure 2.20: Probability of both events

If we use A to denote the "a blue ball has been drawn" event, then we have the following:

Figure 2.21: Probability of event A, where a blue ball is drawn

This is because we have four balls in the first bag and only two in the second one. Furthermore, based on the defined events, the probability we need to compute translates into P(B1 | A). By applying Bayes' theorem, we get the following:

Figure 2.22: Probability of the event that a blue ball is drawn

Now, let's apply Bayes' theorem to our dataset in the following exercise. In addition to applying Bayes' theorem, we will also be using the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test is used to determine whether two samples are statistically different from each other, i.e. whether or not they follow the same distribution. We can implement the Kolmogorov-Smirnov test directly from SciPy, as we will see in the exercise.

Exercise 2.03: Identifying the Probability of Being a Drinker/Smoker, Conditioned to Absence Reason

In this exercise, you will compute the conditional probability of being a social drinker or smoker, conditioned on the reason for absence. In other words (where Ri is the reason for which an employee is absent), we want to compute the probabilities of an employee being a social drinker P(social drinker |Ri), or smoker P(social smoker |Ri), as follows:

Figure 2.23: Conditional probability of being a drinker, conditioned to an absence reason Ri

Figure 2.24: Conditional probability of being a smoker, conditioned to an absence reason Ri

Execute the code mentioned in the previous section, as well as the previous exercises, before attempting this exercise. Now, follow these steps:

Since you already computed P(Ri | social drinker), P(Ri | social smoker), P(social drinker), and P(social smoker), in the previous exercise, you only need to compute P(Ri) for each reason of absence R_i:
```
# compute reason for absence probabilities
absence_prob = preprocessed_data["Reason for absence"]\
               .value_counts(normalize=True)
```
Now that you have all the necessary values, compute the conditional probabilities according to the equations in Step 1:
```
# compute conditional probabilities for drinker/smoker
cond_prob_drinker_smoker = pd.DataFrame(index=range(0,29))
cond_prob_drinker_smoker["P(social drinker | Absence)"] = \
cond_prob["P(Absence | social drinker)"]*drinker_prob/absence_prob
cond_prob_drinker_smoker["P(social smoker | Absence)"] = \
cond_prob["P(Absence | social smoker)"]*smoker_prob/absence_prob
plt.figure()
ax = cond_prob_drinker_smoker.plot.bar(figsize=(10,6))
ax.set_ylabel("Conditional probability")
plt.savefig('figs/conditional_probabilities_drinker_smoker.png', \
            format='png', dpi=300)
```
The following is the output of the preceding code:
Figure 2.25: Conditional probabilities of being a drinker/smoker, conditioned to being absent
As you can see from the resulting plot, the conditional probabilities of being a social drinker/smoker are quite high, once an absence with a certain reason occurs. This is due to the fact that the number of entries is very small; as such, if all the registered employees who were absent for a certain reason are smokers, the probability of being a smoker, once that reason has been registered, will be equal to one (based on the available data).
To complete your analysis on the social drinkers and smokers, analyze the distribution of the hours of absenteeism based on the two classes (being a social drinker/smoker versus not being). A useful type of plot for this type of comparison is the violin plot, which can be produced using the seaborn violinplot() function:
```
# create violin plots of the absenteeism time in hours
plt.figure(figsize=(8,6))
sns.violinplot(x="Social drinker", y="Absenteeism time in hours", \
               data=preprocessed_data, order=["No", "Yes"])
plt.savefig('figs/drinkers_hour_distribution.png', \
            format='png', dpi=300)
plt.figure(figsize=(8,6))
sns.violinplot(x="Social smoker", y="Absenteeism time in hours", \
               data=preprocessed_data, order=["No", "Yes"])
plt.savefig('figs/smokers_hour_distribution.png', \
            format='png', dpi=300)
```
The following is the output of the preceding code:
Figure 2.26: Violin plots of the absenteeism time in hours for social drinkers
Figure 2.27: Violin plots of the absenteeism time in hours for social smokers
As you can observe from Figure 2.27, despite some differences in the outliers between smokers and non-smokers, there is no substantial difference in the distribution of absenteeism hours in drinkers and smokers.

To assess this statement in a rigorous statistical way, perform hypothesis testing on the absenteeism hours (with a null hypothesis stating that the average absenteeism time in hours is the same for drinkers and non-drinkers):

from scipy.stats import ttest_ind
hours_col = "Absenteeism time in hours"
# test mean absenteeism time for drinkers
drinkers_mask = preprocessed_data["Social drinker"] == "Yes"
hours_drinkers = preprocessed_data.loc[drinker_mask, hours_col]
hours_non_drinkers = preprocessed_data\
                     .loc[~drinker_mask, hours_col]
drinkers_test = ttest_ind(hours_drinkers, hours_non_drinkers)
print(f"Statistic value: {drinkers_test[0]}, \
p-value: {drinkers_test[1]}")

The output will be as follows:

Statistic value: 1.7713833295243993, p-value: 0.07690961828294651

Perform the same test on the social smokers:
```
# test mean absenteeism time for smokers
smokers_mask = preprocessed_data["Social smoker"] == "Yes"
hours_smokers = preprocessed_data.loc[smokers_mask, hours_col]
hours_non_smokers = preprocessed_data\
                    .loc[~smokers_mask, hours_col]
smokers_test = ttest_ind(hours_smokers, hours_non_smokers)
print(f"Statistic value: {smokers_test[0]}, \
p-value: {smokers_test[1]}")
```
The output will be as follows:
```
Statistic value: -0.24277795417700243, p-value: 0.8082448720154971
```
As you can see, the p-value of both tests is above the critical value of 0.05, which means that you cannot reject the null hypothesis. In other words, you cannot say that there is a statistically significant difference in the absenteeism hours between drinkers (and smokers) and non-drinkers (and non-smokers).
Note that in the previous paragraph, you performed hypothesis tests, with a null hypothesis for the average absenteeism hours being equal for drinkers (and smokers) and non-drinkers (and non-smokers). Nevertheless, the average hours may still be equal, but their distributions may be different.

Perform a Kolmogorov-Smirnov test to assess the difference in the distributions of two samples:

# perform Kolmogorov-Smirnov test for comparing the distributions
from scipy.stats import ks_2samp
ks_drinkers = ks_2samp(hours_drinkers, hours_non_drinkers)
ks_smokers = ks_2samp(hours_smokers, hours_non_smokers)
print(f"Drinkers comparison: statistics={ks_drinkers[0]:.3f}, \
pvalue={ks_drinkers[1]:.3f}")
print(f"Smokers comparison:  statistics={ks_smokers[0]:.3f}, \
pvalue={ks_smokers[1]:.3f}")

The output will be as follows:

Drinkers comparison: statistics=0.135, pvalue=0.002
Smokers comparison:  statistics=0.104, pvalue=0.607

The p-value for the drinkers dataset is lower than the critical 0.05, which is strong evidence against the null hypothesis of the two distributions being equal. On the other hand, as the p-value for the smokers dataset is higher than 0.05, you cannot reject the null hypothesis.

Note

To access the source code for this specific section, please refer to https://packt.live/3hxt3I6.

You can also run this example online at https://packt.live/2BeAweq. You must execute the entire Notebook in order to get the desired result.

In this section, we investigated the relationship between the different reasons for absence, as well as social information about the employees (such as being smokers or drinkers). In the next section, we will analyze the impact of the employees' body mass index on their absenteeism.

Body Mass Index

The Body Mass Index (BMI) is defined as a person's weight in kilograms, divided by the square of their height in meters:

Figure 2.28: Expression for BMI

BMI is a universal way to classify people as underweight, healthy weight, overweight, and obese, based on tissue mass (muscle, fat, and bone) and height. The following plot indicates the relationship between weight and height for the various categories:

Figure 2.29: Body Mass Index categories (source: https://en.wikipedia.org/wiki/Body_mass_index)

According to the preceding plot, we can build the four categories (underweight, healthy weight, overweight, and obese) based on the BMI values:

"""
define function for computing the BMI category, based on BMI value
"""
def get_bmi_category(bmi):
    if bmi < 18.5:
        category = "underweight"
    elif bmi >= 18.5 and bmi < 25:
        category = "healthy weight"
    elif bmi >= 25 and bmi < 30:
        category = "overweight"
    else:
        category = "obese"
    return category
# compute BMI category
preprocessed_data["BMI category"] = preprocessed_data\
                                    ["Body mass index"]\
                                    .apply(get_bmi_category)

We can plot the number of entries for each category:

# plot number of entries for each category
plt.figure(figsize=(10, 6))
sns.countplot(data=preprocessed_data, x='BMI category', \
              order=["underweight", "healthy weight", \
                     "overweight", "obese"], \
              palette="Set2")
plt.savefig('figs/bmi_categories.png', format='png', dpi=300)

The following is the output of the preceding code:

Figure 2.30: BMI categories

We can see that no entries for the underweight category are present, with the data being almost uniformly distributed among the remaining three categories. Of course, this is an alarming indicator, as more than 60% of the employees are either overweight or obese.

Now, let's check how the different BMI categories are related to the reason for absence. More precisely, we would like to see how many employees there are based on their body mass index and their reason for absence. This can be done with the following code:

# plot BMI categories vs Reason for absence
plt.figure(figsize=(10, 16))
ax = sns.countplot(data=preprocessed_data, \
                   y="Reason for absence", hue="BMI category", \
                   hue_order=["underweight", "healthy weight", \
                              "overweight", "obese"], \
                   palette="Set2")
ax.set_xlabel("Number of employees")
plt.savefig('figs/reasons_bmi.png', format='png', dpi=300)

The output will be as follows:

Figure 2.31: Absence reasons, based on BMI category

Unfortunately, no clear pattern arises from the preceding plot. In other words, for each reason for absence, an (almost) equal number of employees with different body mass indexes are present.

We can also investigate the distribution of absence hours for the different BMI categories:

# plot distribution of absence time, based on BMI category
plt.figure(figsize=(8,6))
sns.violinplot(x="BMI category", \
               y="Absenteeism time in hours", \
               data=preprocessed_data, \
               order=["healthy weight", "overweight", "obese"])
plt.savefig('figs/bmi_hour_distribution.png', format='png')

The output will be as follows:

Figure 2.32: Absence time in hours, based on the BMI category

As we can observe from Figure 2.31 and Figure 2.32, no evidence states that BMI and obesity levels influence the employees' absenteeism.

Age and Education Factors

Age and education may also influence employees' absenteeism. For instance, older employees might need more frequent medical treatment, while employees with higher education degrees, covering positions of higher responsibility, might be less prone to being absent.

First, let's investigate the correlation between age and absence hours. We will create a regression plot, in which we'll plot the Age column on the x axis and Absenteeism time in hours on the y axis. We'll also include the Pearson's correlation coefficient and its p-value, where the null hypothesis is that the correlation coefficient between the two features is equal to zero:

from scipy.stats import pearsonr
# compute Pearson's correlation coefficient and p-value
pearson_test = pearsonr(preprocessed_data["Age"], \
               preprocessed_data["Absenteeism time in hours"])
"""
create regression plot and add correlation coefficient in the title
"""
plt.figure(figsize=(10, 6))
ax = sns.regplot(x="Age", y="Absenteeism time in hours", \
                 data=preprocessed_data, scatter_kws={"alpha":0.1})
ax.set_title(f"Correlation={pearson_test[0]:.03f} \
| p-value={pearson_test[1]:.03f}")
plt.savefig('figs/correlation_age_hours.png', \
            format='png', dpi=300)

The output will be as follows:

Figure 2.33: Correlation plot for absenteeism time and age

As we can observe from the resulting plot, no significant pattern occurs. Furthermore, the correlation coefficient is extremely small (0.066), and its p-value is above the threshold of 0.05, which is an additional indicator that no relationship is present between the Age and Absenteeism time in hours features.

We can also check whether age has some impact on the reason for absence. We'll perform this analysis in the next exercise.

Exercise 2.04: Investigating the Impact of Age on Reason for Absence

In this exercise, we'll investigate the relationship between the Age feature and the various reasons for absence. Please execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

First, create a violin plot between the Age and Disease features. This will give you your first insight into the relationship between the two columns:

# create violin plot between the Age and Disease columns
plt.figure(figsize=(8,6))
sns.violinplot(x="Disease", y="Age", data=preprocessed_data)
plt.savefig('figs/exercise_204_age_disease.png', \
            format='png', dpi=300)

The output will be as follows:

Figure 2.34: Violin plot for disease versus age

From Step 1, you can see some differences between the two distributions of age. In fact, for samples with ICD encoded reasons for absence (labeled Yes in the Disease column), you can observe that slightly more samples are present for older employees. To confirm this difference in distributions, perform hypothesis tests on the means and distributions of the two groups:

"""
get Age entries for employees with Disease == Yes and Disease == No
"""
disease_mask = preprocessed_data["Disease"] == "Yes"
disease_ages = preprocessed_data["Age"][disease_mask]
no_disease_ages = preprocessed_data["Age"][~disease_mask]
# perform hypothesis test for equality of means
test_res = ttest_ind(disease_ages, no_disease_ages)
print(f"Test for equality of means: \
statistic={test_res[0]:0.3f}, pvalue={test_res[1]:0.3f}")
# test equality of distributions via Kolmogorov-Smirnov test
ks_res = ks_2samp(disease_ages, no_disease_ages)
print(f"KS test for equality of distributions: \
statistic={ks_res[0]:0.3f}, pvalue={ks_res[1]:0.3f}")

The output will be as follows:

Test for equality of means: statistic=0.630, pvalue=0.529
KS test for equality of distributions: statistic=0.057, 
pvalue=0.619

From the results of the two tests, you can conclude that there is no statistically significant difference between the two distributions. Thus, age is neither an indicator for the length of an absence nor for its type.

Now investigate the relationship between age and reason for absence:

# violin plot of reason for absence vs age
plt.figure(figsize=(20,8))
sns.violinplot(x="Reason for absence", y="Age", \
               data=preprocessed_data)
plt.savefig('figs/exercise_204_age_reason.png', format='png')

The output will be as follows:

Figure 2.35: Violin plot for age and reason for absence

In light of the previously performed analysis, you can conclude that age has no impact on the employees' absenteeism.

Note

To access the source code for this specific section, please refer to https://packt.live/2Y7jEj6.

You can also run this example online at https://packt.live/3d7q5qD. You must execute the entire Notebook in order to get the desired result.

Now, let's analyze the impact of education level on absenteeism.

Exercise 2.05: Investigating the Impact of Education on Reason for Absence

In this exercise, you will analyze the existing relationship between the Reason for absence and Education columns. You will start by looking at the percentage of employees with a certain educational degree, and then relate those numbers to the various reasons for absence. Please execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

Before starting the analysis, check the percentage of employees in the data that hold a certain degree:

# compute percentage of employees per education level
education_types = ["high_school", "graduate", \
                   "postgraduate", "master_phd"]
counts = preprocessed_data["Education"].value_counts()
percentages = preprocessed_data["Education"]\
              .value_counts(normalize=True)
for educ_type in education_types:
    print(f"Education type: {educ_type:12s} \
| Counts : {counts[educ_type]:6.0f} \
| Percentage: {100*percentages[educ_type]:4.1f}")

The output will be as follows:

Education type: high_school  | Counts :    611 | Percentage: 82.6
Education type: graduate     | Counts :     46 | Percentage:  6.2
Education type: postgraduate | Counts :     79 | Percentage: 10.7
Education type: master_phd   | Counts :      4 | Percentage:  0.5

You can see that most of the employees in the data have a high school degree (82.6%), which means that the data is highly biased toward these employees.

Create a distribution plot of the number of hours of absence, based on the level of education of the employees:

# distribution of absence hours, based on education level
plt.figure(figsize=(8,6))
sns.violinplot(x="Education", y="Absenteeism time in hours",\
               data=preprocessed_data, \
               order=["high_school", "graduate", \
                      "postgraduate", "master_phd"])
plt.savefig('figs/exercise_205_education_hours.png', format='png')

The output will be as follows:

Figure 2.36: Violin plot for number of hours of absence for each level of education

It seems most of the extreme cases of absence are among employees with lower education levels. Compute the mean and standard deviation of the absence duration for the different levels of education:

# compute mean and standard deviation of absence hours
education_types = ["high_school", "graduate", \
                   "postgraduate", "master_phd"]
for educ_type in education_types:
    mask = preprocessed_data["Education"] == educ_type
    hours = preprocessed_data["Absenteeism time in hours"][mask]
    mean = hours.mean()
    stddev = hours.std()
    print(f"Education type: {educ_type:12s} | Mean : {mean:.03f} \
| Stddev: {stddev:.03f}")

The output will be as follows:

Education type: high_school  | Mean : 7.190 | Stddev: 14.259
Education type: graduate     | Mean : 6.391 | Stddev: 6.754
Education type: postgraduate | Mean : 5.266 | Stddev: 7.963
Education type: master_phd   | Mean : 5.250 | Stddev: 3.202

You can see that both the mean and standard deviation of the hours of absence are decreasing, meaning that highly educated employees tend to have shorter absences. Of course, a higher degree of education is not a cause for such a phenomenon and is more of an indication of it.

Now, check the reasons for absence based on the education level:

# plot reason for absence, based on education level
plt.figure(figsize=(10, 16))
sns.countplot(data=preprocessed_data, y="Reason for absence",\
              hue="Education", \
              hue_order=["high_school", "graduate", \
                         "postgraduate", "master_phd"])
plt.savefig('figs/exercise_205_education_reason.png', format='png')

The output will be as follows:

Figure 2.37: Reasons for absence for each level of education

From the preceding plot, you can observe that most of the absences relate to employees with a high_school level of education. This is, of course, due to the fact that most of the employees only have a high school degree (as observed in Step 1). Furthermore, from our analysis in Step 2, we saw that most of the absences that consisted of a greater number of hours were among employees with a high_school education level.

One question that comes to mind is whether the probability of being absent for more than one working week (40 hours) is greater for employees with a high school degree compared to graduates. In order to address this question, use the definition of conditional probability:

Figure 2.38: Conditional probability for extreme absences by employees with a high school degree

Figure 2.39: Conditional probability for extreme absences by employees without a high school degree

The following code snippet computes the conditional probabilities:

"""
define threshold for extreme hours of absenteeism and get total number of entries
"""
threshold = 40
total_entries = len(preprocessed_data)
# find entries with Education == high_school
high_school_mask = preprocessed_data["Education"] == "high_school"
# find entries with absenteeism time in hours more than threshold
extreme_mask = preprocessed_data\
               ["Absenteeism time in hours"] > threshold
# compute probability of having high school degree
prob_high_school = len(preprocessed_data[high_school_mask])\
                   /total_entries
# compute probability of having more than high school degree
prob_graduate = len(preprocessed_data[~high_school_mask])\
                /total_entries
"""
compute probability of having high school and being absent for more than "threshold" hours
"""
prob_extreme_high_school = len(preprocessed_data\
                               [high_school_mask & extreme_mask])\
                               /total_entries
"""
compute probability of having more than high school and being absent for more than "threshold" hours
"""
prob_extreme_graduate = len(preprocessed_data\
                            [~high_school_mask & extreme_mask])\
                            /total_entries
# compute and print conditional probabilities
cond_prob_extreme_high_school = prob_extreme_high_school\
                                /prob_high_school
cond_prob_extreme_graduate = prob_extreme_graduate/prob_graduate
print(f"P(extreme absence | degree = high_school) = \
{100*cond_prob_extreme_high_school:3.2f}")
print(f"P(extreme absence | degree != high_school) = \
{100*cond_prob_extreme_graduate:3.2f}")
preprocessed_data.head().T

The output will be as follows:

P(extreme absence | degree = high_school) = 2.29
P(extreme absence | degree != high_school) = 0.78

The preprocessed data now looks as follows:

Figure 2.40: Analysis of data

Note

To access the source code for this specific section, please refer to https://packt.live/3fxhorg.

You can also run this example online at https://packt.live/2YDVBr0. You must execute the entire Notebook in order to get the desired result.

From the preceding computations, we can see that the probability of having an absence of more than 40 hours for employees with a high school education degree is 2.29%, which is approximately three times greater than the same probability for employees with a university degree (0.78%).

Transportation Costs and Distance to Work Factors

Two possible indicators for absenteeism may also be the distance between home and work (the Distance from Residence to Work column) and transportation costs (the Transportation expense column). Employees who have to travel longer, or whose costs for commuting to work are high, might be more prone to absenteeism.

In this section, we will investigate the relationship between these variables and the absence time in hours. Since we do not believe the aforementioned factors might be indicative of disease problems, we will not consider a possible relationship with the Reason for absence column.

First, let's start our analysis by plotting the previously mentioned columns (Distance from Residence to Work and Transportation expense) against the Absenteeism time in hours column:

# plot transportation costs and distance to work against hours
plt.figure(figsize=(10, 6))
sns.jointplot(x="Distance from Residence to Work", \
              y="Absenteeism time in hours", \
              data=preprocessed_data, kind="reg")
plt.savefig('figs/distance_vs_hours.png', format='png')
plt.figure(figsize=(10, 6))
sns.jointplot(x="Transportation expense", \
              y="Absenteeism time in hours", \
              data=preprocessed_data, kind="reg")
plt.savefig('figs/costs_vs_hours.png', format='png')

Note that, here, we used the seaborn jointplot() function, which not only produces the regression plot between the two variables but also estimates their distribution. The output will be as follows:

Figure 2.41: Regression plot of distance from work versus absenteeism in hours

Figure 2.42: Regression plot of transportation costs versus absenteeism in hours (on the right)

As we can see, the distributions of Distance from Residence to Work and Transportation expense look close to normal distributions, while the absenteeism time in hours is heavily right-skewed. This makes the comparison between the variables difficult to interpret. One solution to this problem is to transform the data into something close to a normal distribution. A handy way to perform this transformation is to use the Box-Cox or Yeo-Johnson transformations. Both are defined as a family of functions, depending on a parameter λ, under which the transformed data is as close to normal as possible.

The Box-Cox transformation is defined as follows:

Figure 2.43: Expression for Box-Cox transformation if λ is not equal to 0

Figure 2.44: Expression for Box-Cox transformation if λ is equal to 0

The optimal value of the parameter λ is the one that results in the best approximation of a normal distribution. Note that the Box-Cox transformation fails if the data assumes negative values or zero. If this is the case, the Yeo-Johnson transformation can be used:

Figure 2.45: Expression for Yeo-Johnson transformation

In Python, both transformations can be found in the scipy.stats module (in the boxcox() and yeojohnson() functions, respectively).

Since the Absenteeism time in hours column contains zeros, we will apply the Yeo-Johnson transformation in order to reproduce the plots from Figure 2.42:

# run Yeo-Johnson transformation and recreate previous plots
from scipy.stats import yeojohnson
hours = yeojohnson(preprocessed_data\
                   ["Absenteeism time in hours"].apply(float))
distances = preprocessed_data["Distance from Residence to Work"]
expenses = preprocessed_data["Transportation expense"]
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=distances, y=hours[0], kind="reg")
ax.set_axis_labels("Distance from Residence to Work",\
                   "Transformed absenteeism time in hours")
plt.savefig('figs/distance_vs_hours_transformed.png', format='png')
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=expenses, y=hours[0], kind="reg")
ax.set_axis_labels("Transportation expense", \
                   "Transformed absenteeism time in hours")
plt.savefig('figs/costs_vs_hours_transformed.png', format='png')

The output will be as follows:

Figure 2.46: Regression plot of distance from work versus transformed absenteeism in hours

Figure 2.47: Regression plot of transportation costs versus transformed absenteeism in hours

We can also produce kernel density estimation plots (that is, plots that help us visualize the probability density functions of continuous variables) by just changing the type of the jointplot() function to kde.

# produce KDE plots 
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=distances, y=hours[0], kind="kde")
ax.set_axis_labels("Distance from Residence to Work",\
                   "Transformed absenteeism time in hours")
plt.savefig('figs/distance_vs_hours_transformed_kde.png', \
            format='png')
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=expenses, y=hours[0], kind="kde")
ax.set_axis_labels("Transportation expense", \
                   "Transformed absenteeism time in hours")
plt.savefig('figs/costs_vs_hours_transformed_kde.png', \
            format='png')

The KDE plot for distance from residence to work versus absent hours will be as follows:

Figure 2.48: KDE plot for distance from residence to work versus absent hours

The KDE plot for transport expense versus absent hours will be as follows:

Figure 2.49: KDE plot for transport expense versus absent hours

From Figure 2.46, we can also see that the regression line between the variables is almost flat for the Distance from Residence to Work column (which is a clear indicator of zero correlation) but has a slight upward slope for the Transportation Expense column. Therefore, we can expect a small positive correlation:

# investigate correlation between the columns
distance_corr = pearsonr(hours[0], distances)
expenses_corr = pearsonr(hours[0], expenses)
print(f"Distances correlation: corr={distance_corr[0]:.3f}, \
pvalue={distance_corr[1]:.3f}")
print(f"Expenses comparison:  corr={expenses_corr[0]:.3f}, \
pvalue={expenses_corr[1]:.3f}")

The output will be as follows:

Distances correlation: corr=-0.000, pvalue=0.999
Expenses comparison: corr=0.113, pvalue=0.002

These results confirm our observation, stating that there is a slight positive correlation between Transportation expense and Absenteeism time in hours.

Temporal Factors

Factors such as day of the week and month may also be indicators for absenteeism. For instance, employees might prefer to have their medical examinations on Friday when the workload is lower, and it is closer to the weekend. In this section, we will analyze the impact of the Day of the week and Month of absence columns, and their impact on the employees' absenteeism.

Let's begin with an analysis of the number of entries for each day of the week and each month:

# count entries per day of the week and month
plt.figure(figsize=(12, 5))
ax = sns.countplot(data=preprocessed_data, \
                   x='Day of the week', \
                   order=["Monday", "Tuesday", \
                          "Wednesday", "Thursday", "Friday"])
ax.set_title("Number of absences per day of the week")
plt.savefig('figs/dow_counts.png', format='png', dpi=300)
plt.figure(figsize=(12, 5))
ax = sns.countplot(data=preprocessed_data, \
                   x='Month of absence', \
                   order=["January", "February", "March", \
                          "April", "May", "June", "July", \
                          "August", "September", "October", \
                          "November", "December", "Unknown"])
ax.set_title("Number of absences per month")
plt.savefig('figs/month_counts.png', format='png', dpi=300)

The output will be as follows:

Figure 2.50: Number of absences per day of the week

The number of absences per month can be visualized as follows:

Figure 2.51: Number of absences per month

From the preceding plots, we can't really see a substantial difference between the different days of the week or months. It seems that fewer absences occur on Thursday, while the month with the most absences is March, but it is hard to say that the difference is significant.

Now, let's focus on the distribution of absence hours among the days of the week and the months of the year. This analysis will be performed in the following exercise.

Exercise 2.06: Investigating Absence Hours, Based on the Day of the Week and the Month of the Year

In this exercise, you will be looking at the hours during which the employees were absent for days of the week and months of the year. Execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

Consider the distribution of absence hours among the days of the week and months of the year:

# analyze average distribution of absence hours 
plt.figure(figsize=(12,5))
sns.violinplot(x="Day of the week", \
               y="Absenteeism time in hours",\
               data=preprocessed_data, \
               order=["Monday", "Tuesday", \
                      "Wednesday", "Thursday", "Friday"])
plt.savefig('figs/exercise_206_dow_hours.png', \
            format='png', dpi=300)
plt.figure(figsize=(12,5))
sns.violinplot(x="Month of absence", \
               y="Absenteeism time in hours",\
               data=preprocessed_data, \
               order=["January", "February", \
                      "March", "April", "May", "June", "July",\
                      "August", "September", "October", \
                      "November", "December", "Unknown"])
plt.savefig('figs/exercise_206_month_hours.png', \
            format='png', dpi=300)

The output will be as follows:

Figure 2.52: Average absent hours during the week

The violin plot for the average absent hours over the year can be visualized as follows:

Figure 2.53: Average absent hours over the year

Compute the mean and standard deviation of the absences based on the day of the week:

"""
compute mean and standard deviation of absence hours per day of the week
"""
dows = ["Monday", "Tuesday", "Wednesday", \
        "Thursday", "Friday"]
for dow in dows:
    mask = preprocessed_data["Day of the week"] == dow
    hours = preprocessed_data["Absenteeism time in hours"][mask]
    mean = hours.mean()
    stddev = hours.std()
    print(f"Day of the week: {dow:10s} | Mean : {mean:.03f} \
| Stddev: {stddev:.03f}")

The output will be as follows:

Figure 2.54: Mean and standard deviation of absent hours per day of the week

Similarly, compute the mean and standard deviation based on the month, as follows:

"""
compute mean and standard deviation of absence hours per day of the month
"""
months = ["January", "February", "March", "April", "May", \
          "June", "July", "August", "September", "October", \
          "November", "December"]
for month in months:
    mask = preprocessed_data["Month of absence"] == month
    hours = preprocessed_data["Absenteeism time in hours"][mask]
    mean = hours.mean()
    stddev = hours.std()
    print(f"Month: {month:10s} | Mean : {mean:8.03f} \
| Stddev: {stddev:8.03f}")

The output will be as follows:

Figure 2.55: Mean and standard deviation of absent hours per month

Observe that the average duration of the absences is slightly shorter on Thursday (4.424 hours), while absences during July have the longest average duration (10.955 hours). To determine whether these values are statistically significant—that is, whether there is a statistically significant difference regarding the rest of the days/months—use the following code snippet:

# perform statistical test for avg duration difference
thursday_mask = preprocessed_data\
                ["Day of the week"] == "Thursday"
july_mask = preprocessed_data\
            ["Month of absence"] == "July"
thursday_data = preprocessed_data\
                ["Absenteeism time in hours"][thursday_mask]
no_thursday_data = preprocessed_data\
                   ["Absenteeism time in hours"][~thursday_mask]
july_data = preprocessed_data\
            ["Absenteeism time in hours"][july_mask]
no_july_data = preprocessed_data\
               ["Absenteeism time in hours"][~july_mask]
thursday_res = ttest_ind(thursday_data, no_thursday_data)
july_res = ttest_ind(july_data, no_july_data)
print(f"Thursday test result: statistic={thursday_res[0]:.3f}, \
pvalue={thursday_res[1]:.3f}")
print(f"July test result: statistic={july_res[0]:.3f}, \
pvalue={july_res[1]:.3f}")

The output will be as follows:

Thursday test result: statistic=-2.307, pvalue=0.021
July test result: statistic=2.605, pvalue=0.009

Summarize and visualize the data as follows:
```
preprocessed_data.head().T
preprocessed_data["Service time"].hist()
```
The output will be as follows:

Figure 2.56: Statistics of data
Visualize the plot as follows:

Figure 2.57: Histogram for preprocessed data

Note

To access the source code for this specific section, please refer to https://packt.live/2AIFO1X.

You can also run this example online at https://packt.live/37y5omt. You must execute the entire Notebook in order to get the desired result.

Since the p-values from both the statistical tests are below the critical value of 0.05, we can conclude the following:

There is a statistically significant difference between Thursdays and other days of the week. Absences on Thursday have a shorter duration, on average.
Absences during July are the longest over the year. Also, in this case, we can reject the null hypothesis of having no difference.

From the analysis we've performed in this exercise, we can conclude that our initial observations about the difference in absenteeism during the month of July and on Thursdays are correct. Of course, we cannot claim that this is the cause, but only state that certain trends exist in the data.

Activity 2.01: Analyzing the Service Time and Son Columns

In this activity, you will extend the analysis of the absenteeism dataset by exploring the impact of two additional columns: Service time and Son.

This activity is based on the techniques that have been presented in this chapter—that is, distribution analysis, hypothesis testing, and conditional probability estimation.

The following steps will help you complete this activity:

Import the data and the necessary libraries:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Analyze the distribution of the Service time column by creating a kernel density estimation plot (use the seaborn.kdeplot() function). Perform a hypothesis test for normality (that is, a Kolmogorov-Smirnov test with the scipy.stats.kstest() function). The KDE plot will be as follows:
Figure 2.58: KDE plot for service time
Create a violin plot of the Service time column and the Reason for absence column. Draw a conclusion about the observed relationship.
The output will be as follows:
Figure 2.59: Violin plot for the Service time column
Create a correlation plot between the Service time and Absenteeism time in hours columns, similar to the one in Figure 2.47. The output will be as follows:
Figure 2.60: Correlation plot for service time
Analyze the distributions of Absenteeism time in hours for employees with a different number of children (the Son column).
The output will be as follows:

Figure 2.61: Distribution of absent time for employees with a different number of children

Note

The solution for this activity can be found via this link.

From this analysis, we can infer that the number of absence hours for employees with a greater number of children lies in the range of 10-15 hours. Employees with less than three children appear to be absent in a varying range of 1-20 hours. To be specific, employees with no children still have a varying number of absent hours within the range of 10-15 hours, owing to other reasons, which now opens up a new area of analysis. On the contrary, employees with one child are absent only for an average of 5 hours. Employees with two children have an average of 15-25 absent hours, which could be analyzed further.

Thus, we have successfully drawn measurable conclusions to help us understand employee behavior in an organization to tackle unregulated absenteeism and take necessary measures to ensure the optimal utilization of human resources.

Filter reviews by

All

Amazon verified reviews

Nithin Feb 15, 2021

I found the book "The Data Analysis Workshop" really helpful. I like the approach author has taken to go step by step on the process. Every problem solving follows the data exploration and preprocessing to data visualization in Python.The book uses real world and variety datasets with great well formatted colored visualizations. Code snippets are clear and explains the problem statement with clarity across the entire book.The book covers a lot of important concepts sklearn, classification, regression, hypothesis testing, clustering, time series, and many more. It also features "Activities" for every section which helps with better understanding of problem statement.I would highly recommend this book.

Amazon Verified review

Gennaro Maida, MS, BSBME -- CTO/Co-founder Vital Intelligence, Inc. Oct 05, 2020

This book does exactly what the author intends. There are many real world examples that are used to build knowledge and expertise in some of todays most powerful python tools. Jupyter notebooks, matplotlib, seaborn, scikit-learn, numpy, scipy, and pandas are employed diving into many of the most useful methods. The author takes the time to walk the reader through the steps of solving analytical problems from data exploration and preprocessing to data visualization. Additionally, the author lightly digs into the statistics and probability behind the analyses to build concepts rather then repetitive memorization. I would recommend this book, not only as a quick reference for intermediate to advanced data scientists, but as a book to introduce beginners (with python knowledge) to the world of data science. A definite thumbs up.

Richard Dec 28, 2020

Excellent book that clearly explains vital data analysis techniques using real-world examples. The author does a particularly great job highlighting the use of Python-based statistical analyses via Jupyter notebooks by importing pandas, matplotlib, seaborn, scikit-learn, numpy, and scipy.This book also covers some great data visualization techniques, which are must-haves for anyone who is interested in crafting meaningful data storytelling.Overall, this is a great read for anyone who is interested in expanding their data analysis skill set. A highly recommended read!

Alan Dec 09, 2020

This book is a good learning tool for beginners as well as a great reference book for the more experienced. You do need to be versed in Python before reading, although being a programmer in VBA, I was able to get the gist of what the programs were doing (programming fundamentals). It covers many different and increasingly complex datasets and shows how to turn them into meaningful data insights. I would recommend it to anyone that wants to gain serious knowledge about data analysis.

Nhikki v May 04, 2021

Great data analysis with python book with 10 different scenarios. Datasets provided too so you can do your own analysis and the exercises. Perfect to practice and gain experience.

The Data Analysis Workshop: Solve business problems with state-of-the-art data analysis models, developing expert data analysis skills along the way

What do you get with Print?

The Data Analysis Workshop

2. Absenteeism at Work

Introduction

Initial Data Analysis

Exercise 2.01: Identifying Reasons for Absence

Initial Analysis of the Reason for Absence

Analysis of Social Drinkers and Smokers

Exercise 2.02: Identifying Reasons of Absence with Higher Probability Among Drinkers and Smokers

Exercise 2.03: Identifying the Probability of Being a Drinker/Smoker, Conditioned to Absence Reason

Body Mass Index

Age and Education Factors

Exercise 2.04: Investigating the Impact of Age on Reason for Absence

Exercise 2.05: Investigating the Impact of Education on Reason for Absence

Transportation Costs and Distance to Work Factors

Temporal Factors

Exercise 2.06: Investigating Absence Hours, Based on the Day of the Week and the Month of the Year

Activity 2.01: Analyzing the Service Time and Son Columns

Summary

Page 1 of 10

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the 9 authors

FAQs

The Data Analysis Workshop: Solve business problems with state-of-the-art data analysis models, developing expert data analysis skills along the way

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the 9 authors

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access