Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
The Data Analysis Workshop
The Data Analysis Workshop

The Data Analysis Workshop: Solve business problems with state-of-the-art data analysis models, developing expert data analysis skills along the way

Arrow left icon
Profile Icon Gururajan Govindan Profile Icon Shubhangi Hora Profile Icon Konstantin Palagachev Profile Icon Brent Broadnax Profile Icon John Wesley Doyle Profile Icon Ashish Jain Profile Icon Robert Thas John Profile Icon Ravi Ranjan Prasad Karn Profile Icon Pritesh Tiwari +5 more Show less
Arrow right icon
€30.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.4 (21 Ratings)
Paperback Jul 2020 626 pages 1st Edition
eBook
€22.49 €24.99
Paperback
€30.99
Hardcover
€30.99
Arrow left icon
Profile Icon Gururajan Govindan Profile Icon Shubhangi Hora Profile Icon Konstantin Palagachev Profile Icon Brent Broadnax Profile Icon John Wesley Doyle Profile Icon Ashish Jain Profile Icon Robert Thas John Profile Icon Ravi Ranjan Prasad Karn Profile Icon Pritesh Tiwari +5 more Show less
Arrow right icon
€30.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.4 (21 Ratings)
Paperback Jul 2020 626 pages 1st Edition
eBook
€22.49 €24.99
Paperback
€30.99
Hardcover
€30.99
eBook
€22.49 €24.99
Paperback
€30.99
Hardcover
€30.99

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

The Data Analysis Workshop

2. Absenteeism at Work

Overview

In this chapter, you will perform standard data analysis techniques, such as estimating conditional probabilities, Bayes' theorem, and Kolmogorov-Smirnov tests, for distribution comparison. You will also implement data transformation techniques, such as the Box-Cox and Yeo-Johnson transformations, and apply these techniques to a given dataset.

Introduction

In the previous chapter, we looked at some of the main techniques that are used in data analysis. We saw how hypothesis testing can be used when analyzing data, we got a brief introduction to visualizations, and finally, we explored some concepts related to time series analysis. In this chapter, we will elaborate on some of the topics we've already looked at (such as plotting and hypothesis testing) while introducing new ones coming from probability theory and data transformations.

Nowadays, work relationships are becoming more and more trust-oriented, and conservative contracts (in which working time is strictly monitored) are being replaced with more agile ones in which the employee themselves is responsible for accounting working time. This liberty may lead to unregulated absenteeism and may reflect poorly on an employee's candidature, even if absent hours can be accounted for with genuine reasons. This can significantly undermine healthy working relationships. Furthermore, unregulated absenteeism can also have a negative impact on work productivity.

In this chapter, we'll analyze absenteeism data from a Brazilian courier company, collected between July 2007 and July 2010.

Note

The original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work.

If you're interested, take a look at the following paper, which talks about the problem from a machine learning perspective: Martiniano, A., Ferreira, R.P., Sassi, R.J., & Affonso, C. (2012). Application of neuro fuzz network on prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.

This dataset can also be found on our GitHub repository here: https://packt.live/3e4rorX.

Our goal is to discover hidden patterns in the data, which might be useful for distinguishing genuine work absences from fraudulent ones. During this chapter, the following topics will be addressed:

  • Introduction to probability, conditional probability, and Bayes' theorem
  • Kolmogorov-Smirnov tests for equality of probability distributions
  • Box-Cox and Yeo-Johnson transformations

We will apply these techniques to our analysis as we try to identify the main drivers for absenteeism.

Initial Data Analysis

As a rule of thumb, when starting the analysis of a new dataset, it is good practice to check the dimensionality of the data, type of columns, possible missing values, and some generic statistics on the numerical columns. We can also get the first 5 to 10 entries in order to acquire a feeling for the data itself. We'll perform these steps in the following code snippets:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# import data from the GitHub page of the book
data = pd.read_csv('https://raw.githubusercontent.com'\
                   '/PacktWorkshops/The-Data-Analysis-Workshop'\
                   '/master/Chapter02/data/'\
                   'Absenteeism_at_work.csv', sep=";")

Note that we are providing the separator parameter when reading the data because, although the original data file is in the CSV format, the ";" symbol has been used to separate the various fields.

In order to print the dimensionality of the data, column types, and the number of missing values, we can use the following code:

"""
print dimensionality of the data, columns, types and missing values
"""
print(f"Data dimension: {data.shape}")
for col in data.columns:
    print(f"Column: {col:35} | type: {str(data[col].dtype):7} \
| missing values: {data[col].isna().sum():3d}")

This returns the following output:

Figure 2.1: Dimensions of the Absenteeism_at_work dataset

Figure 2.1: Dimensions of the Absenteeism_at_work dataset

As we can see from these 21 columns, only one (Work Load Average/day) does not contain integer values. Since no missing values are present in the data, we can consider it quite clean. We can also derive some basic statistics by using the describe method:

# compute statistics on numerical features
data.describe().T

The output will be as follows:

Figure 2.2: Output of the describe() method

Figure 2.2: Output of the describe() method

Note that some of the columns, such as Month of absence, Day of the week, Seasons, Education, Disciplinary failure, Social drinker, and Social smoker, are encoding categorical values. So, we can back-transform the numerical values to their original categories so that we have better plotting features. We will perform the transformation by defining a Python dict object containing the mapping and then applying the apply() function to each feature, which applies the provided function to each of the values in the column. First, let's define the encoding dict objects:

# define encoding dictionaries
month_encoding = {1: "January", 2: "February", 3: "March", \
                  4: "April", 5: "May", 6: "June", 7: "July", \
                  8: "August", 9: "September", 10: "October", \
                  11: "November", 12: "December", 0: "Unknown"}
dow_encoding = {2: "Monday", 3: "Tuesday", 4: "Wednesday", \
                5: "Thursday", 6: "Friday"}
season_encoding = {1: "Spring", 2: "Summer", 3: "Fall", 4: "Winter"}
education_encoding = {1: "high_school", 2: "graduate", \
                      3: "postgraduate", 4: "master_phd"}
yes_no_encoding = {0: "No", 1: "Yes"}

Afterward, we apply the encoding dictionaries to the relevant features:

# backtransform numerical variables to categorical
preprocessed_data = data.copy()
preprocessed_data["Month of absence"] = preprocessed_data\
                                        ["Month of absence"]\
                                        .apply(lambda x: \
                                               month_encoding[x])
preprocessed_data["Day of the week"] = preprocessed_data\
                                       ["Day of the week"]\
                                       .apply(lambda x: \
                                              dow_encoding[x])
preprocessed_data["Seasons"] = preprocessed_data["Seasons"]\
                              .apply(lambda x: season_encoding[x])
preprocessed_data["Education"] = preprocessed_data["Education"]\
                                 .apply(lambda x: \
                                        education_encoding[x])
preprocessed_data["Disciplinary failure"] = \
preprocessed_data["Disciplinary failure"].apply(lambda x: \
                                                yes_no_encoding[x])
preprocessed_data["Social drinker"] = \
preprocessed_data["Social drinker"].apply(lambda x: \
                                          yes_no_encoding[x])
preprocessed_data["Social smoker"] = \
preprocessed_data["Social smoker"].apply(lambda x: \
                                         yes_no_encoding[x])
# transform columns
preprocessed_data.head().T

The output will be as follows:

Figure 2.3: Transformation of columns

Figure 2.3: Transformation of columns

In the previous code snippet, we created a clean copy of the original dataset by calling the .copy() method on the data object. In this way, a new copy of the original data is created. This is a convenient way to create new pandas DataFrames, without taking the risk of modifying the original raw data (as it might serve us later). Afterward, we created a set of dictionaries where the numerical values are keys and the categorical values are values. Finally, we used the .apply() method on each column we wanted to encode by mapping each value in the original column to its corresponding value in the encoding dictionary, which contains the target values. Note that in the Month of absence column, a 0 value is present, which is encoded as Unknown as no month corresponds to 0.

Based on the description of the data, the Reason for absence column contains information about the absence, which is encoded based on the International Code of Diseases (ICD). The following table represents the various encodings:

Figure 2.4: Reason for absence encoding

Figure 2.4: Reason for absence encoding

Note that only values 1 to 21 represent ICD encoding; values 22 to 28 are separate reasons, which do not represent a disease, while value 0 is not defined—hence the encoded reason Unknown. As all values contained in the ICD represent some type of disease, it makes sense to create a new binary variable that indicates whether the current reason for absence is related to some sort of disease or not. We will do this in the following exercise.

Exercise 2.01: Identifying Reasons for Absence

In this exercise, you will create a new variable, called Disease, which indicates whether a specific reason for absence is present in the ICD table or not. Please complete the initial data analysis before you begin this exercise. Now, follow these steps:

  1. First, define a function that returns Yes if a provided encoded value is contained in the ICD (values 1 to 21); otherwise, No:
    """
    define function, which checks if the provided integer value 
    is contained in the ICD or not
    """
    def in_icd(val):
        return "Yes" if val >= 1 and val <= 21 else "No"
  2. Combine the .apply() method with the previously defined in_icd() function in order to create the new Disease column in the preprocessed dataset:
    # add Disease column
    preprocessed_data["Disease"] = \
    preprocessed_data["Reason for absence"].apply(in_icd)
  3. Use bar plots in order to compare the absences due to disease reasons:
    plt.figure(figsize=(10, 8))
    sns.countplot(data=preprocessed_data, x='Disease')
    plt.savefig('figs/disease_plot.png', format='png', dpi=300)

    The output will be as follows:

    Figure 2.5: Comparing absence count to disease

Figure 2.5: Comparing absence count to disease

Here, we are using the seaborn .countplot() function, which is quite handy when creating this type of bar plot, in which we want to know the total number of entries for each specific class. As we can see, the number of reasons for absence that are not listed in the ICD table is almost twice the number of listed ones.

Note

To access the source code for this specific section, please refer to https://packt.live/2B9AqVJ.

You can also run this example online at https://packt.live/2UPwIr1. You must execute the entire Notebook in order to get the desired result.

In this section, we performed some simple data exploration and transformations on the initial absenteeism dataset. In the next section, we will go deeper into our data exploration and analyze some of the possible reasons for absence.

Initial Analysis of the Reason for Absence

Let's start with a simple analysis of the Reason for absence column. We will try to address questions such as, what is the most common reason for absence? Does being a drinker or smoker have some effect on the causes? Does the distance to work have some effect on the reasons? And so on. Starting with these types of questions is often important when performing data analysis, as this is a good way to obtain confidence and understanding of the data.

The first thing we are interested in is the overall distribution of the absence reasons in the data—that is, how many entries we have for a specific reason for absence in our dataset. We can easily address this question by using the countplot() function from the seaborn package:

# get the number of entries for each reason for absence
plt.figure(figsize=(10, 5))
ax = sns.countplot(data=preprocessed_data, x="Reason for absence")
ax.set_ylabel("Number of entries per reason of absence")
plt.savefig('figs/absence_reasons_distribution.png', \
            format='png', dpi=300)

The output will be as follows:

Figure 2.6: Number of entries for all reasons for absence

Figure 2.6: Number of entries for all reasons for absence

Note that we also used the Disease column as the hue parameter. This helps us to distinguish between disease-related reasons (listed in the ICD encoding) and those that aren't. Following Figure 2.3, we can assert that the most frequent reasons for absence are related to medical consultations (23), dental consultations (28), and physiotherapy (27). On the other hand, the most frequent reasons for absence encoded in the ICD encoding are related to diseases of the musculoskeletal system and connective tissue (13) and injury, poisoning, and certain other consequences of external causes (19).

In order to perform a more accurate and in-depth analysis of the data, we will investigate the impact of the various features on the Reason for absence and Absenteeism in hours columns in the following sections. First, we will analyze the data on social drinkers and smokers in the next section.

Analysis of Social Drinkers and Smokers

Let's begin with an analysis of the impact of being a drinker or smoker on employee absenteeism. As smoking and frequent drinking have a negative impact on health conditions, we would expect that certain diseases are more frequent in smokers and drinkers than others. Note that in the absenteeism dataset, 56% of the registered employees are drinkers, while only 7% are smokers. We can produce a figure, similar to Figure 2.6 for the social drinkers and smokers with the following code:

# plot reasons for absence against being a social drinker/smoker
plt.figure(figsize=(8, 6))
sns.countplot(data=preprocessed_data, x="Reason for absence", \
              hue="Social drinker", hue_order=["Yes", "No"])
plt.savefig('figs/absence_reasons_drinkers.png', \
            format='png', dpi=300)
plt.figure(figsize=(8, 6))
sns.countplot(data=preprocessed_data, x="Reason for absence", \
              hue="Social smoker", hue_order=["Yes", "No"])
plt.savefig('figs/absence_reasons_smokers.png', \
            format='png', dpi=300)

The following is the output of the preceding code:

Figure 2.7: Distribution of diseases over social drinkers

Figure 2.7: Distribution of diseases over social drinkers

Similarly, the distribution of diseases for social smokers can be visualized as follows:

Figure 2.8: Distribution of diseases over social smokers

Figure 2.8: Distribution of diseases over social smokers

Next, calculate the actual count for social drinkers and smokers from the preprocessed data:

print(preprocessed_data["Social drinker"]\
      .value_counts(normalize=True))
print(preprocessed_data["Social smoker"]\
      .value_counts(normalize=True))

The output will be as follows:

Yes    0.567568
No     0.432432
Name: Social drinker, dtype: float64
No     0.927027
Yes    0.072973
Name: Social smoker, dtype: float64	

As we can see from the resulting plots, a significant difference between drinkers and non-drinkers can be observed in absences related to Dental consultations (28). Furthermore, as the number of social smokers is quite small (only 7% of the entries), it is very hard to say whether there is actually a relationship between the absence reasons and smoking. A more rigorous approach in this direction would be to analyze the conditional probabilities of the different absence reasons, which are based on being a social drinker or smoker.

Conditional probability is a measure that tells us the probability of an event's occurrence, assuming that another event has occurred. From a mathematical perspective, given a set of events Ω and a probability measure P on Ω and given two events A and B in Ω with the unconditional probability of B being greater than zero (that is, P(B) > 0), we can define the conditional probability of A given B as follows:

Figure 2.9: Formula for conditional probability

Figure 2.9: Formula for conditional probability

In other words, the probability of A given B is equal to the probability of A and B both happening, divided by the probability of B happening. Let's consider a simple example that will help us understand the usage of conditional probability. This is a classic probability problem. Suppose that your friend has two children, and you know that one of them is male. We want to know what the probability is that your friend has two sons. First, we have to identify all the possible events in our event space Ω. If we denote with B the event of having a boy, and with G the event of having a girl, then the event space contains four possible events:

Figure 2.10: Event space Ω

Figure 2.10: Event space Ω

They each have a probability of 0.25. Following the notations from the definition, we can define the first event like so:

Figure 2.11: Event A

Figure 2.11: Event A

We can define the latter event like so:

Figure 2.12: Event B

Figure 2.12: Event B

Now, our initial problem translates into computing P(A|B). With this, we get the following equation:

Figure 2.13: Probability of event A conditioned to B

Figure 2.13: Probability of event A conditioned to B

We can also perform this example computationally:

# computation of conditional probability
sample_space = set(["BB", "BG", "GB", "GG"])
event_a = set(["BB"])
event_b = set(["BB", "BG", "GB"])
cond_prob = (0.25*len(event_a.intersection(event_b))) \
            / (0.25*len(event_b))
print(round(cond_prob, 4))

The output will be as follows:

0.3333

Note that by using the definition of conditional probability, we could address questions such as, "What is the probability of a reason for absence being related to laboratory examinations, assuming that an employee is a social drinker?" In other words, if we denote the "employee is absent for laboratory examinations" event with A, and the "employee is a social drinker" event with B, the probability of the "employee is absent due to laboratory examination reasons, given that employee is a social drinker" event can be computed by the previous formula.

The following exercise illustrates how the conditional probability formula can identify reasons for absence with higher probability among smokers and drinkers.

Exercise 2.02: Identifying Reasons of Absence with Higher Probability Among Drinkers and Smokers

In this exercise, you will compute the conditional probabilities of the different reasons for absence, assuming that the employee is a social drinker or smoker. Please execute the code mentioned in the previous section and Exercise 2.01, Identifying Disease Reasons for Absence before attempting this exercise. Now, follow these steps:

  1. To identify the conditional probabilities, first compute the unconditional probabilities of being a social drinker or smoker. Verify that both the probabilities are greater than zero, as they appear in the denominator of the conditional probabilities. Do this by counting the number of social drinkers and smokers and dividing these values by the total number of entries, like so:
    Figure 2.14: Probability of being a social drinker

    Figure 2.14: Probability of being a social drinker

    Figure 2.15: Probability of being a social smoker

    Figure 2.15: Probability of being a social smoker

    The following code snippet does this for you:

    # compute probabilities of being a drinker and smoker
    drinker_prob = preprocessed_data["Social drinker"]\
                   .value_counts(normalize=True)["Yes"]
    smoker_prob = preprocessed_data["Social smoker"]\
                  .value_counts(normalize=True)["Yes"]
    print(f"P(social drinker) = {drinker_prob:.3f} \
    | P(social smoker) = {smoker_prob:.3f}")

    The output will be as follows:

    P(social drinker) = 0.568 | P(social smoker) = 0.073

    As you can see, the probability of being a drinker is almost 57%, while the probability of being a smoker is quite low (only 7.3%).

  2. Next, compute the probabilities of being a social drinker/smoker and being absent for each reason of absence. For a specific reason of absence (say Ri), these probabilities are defined as follows:
    Figure 2.16: Probability of being a drinker and absent

    Figure 2.16: Probability of being a drinker and absent

    Figure 2.17: Probability of being a smoker and absent

    Figure 2.17: Probability of being a smoker and absent

  3. In order to carry the required computations, define masks in the data, which only account for entries where employees are drinkers or smokers:
    #create mask for social drinkers/smokers drinker_mask = preprocessed_data["Social drinker"] == "Yes"
    smoker_mask = preprocessed_data["Social smoker"] == "Yes"
  4. Compute the total number of entries and the number of absence reasons, masked by drinkers/smokers:
    total_entries = preprocessed_data.shape[0]
    absence_drinker_prob = preprocessed_data["Reason for absence"]\
                           [drinker_mask].value_counts()/total_entries
    absence_smoker_prob = preprocessed_data["Reason for absence"]\
                          [smoker_mask].value_counts()/total_entries
  5. Compute the conditional probabilities by dividing the computed probabilities for each reason of absence in Step 2 by the unconditional probabilities obtained in Step 1:
    # compute conditional probabilities
    cond_prob = pd.DataFrame(index=range(0,29))
    cond_prob["P(Absence | social drinker)"] = absence_drinker_prob\
                                               /drinker_prob
    cond_prob["P(Absence | social smoker)"] = absence_smoker_prob\
                                              /smoker_prob
  6. Create bar plots for the conditional probabilities:
    # plot probabilities
    plt.figure()
    ax = cond_prob.plot.bar(figsize=(10,6))
    ax.set_ylabel("Conditional probability")
    plt.savefig('figs/conditional_probabilities.png', \
                format='png', dpi=300)

    The output will be as follows:

    Figure 2.18: Bar plots for conditional probabilities

Figure 2.18: Bar plots for conditional probabilities

As we can observe from the previous plot, the highest reason for absence for drinkers is dental consultations (28), followed by medical consultations (23). Smokers' absences, however, are mostly due to unknown reasons (0) and laboratory examinations (25).

Note

To access the source code for this specific section, please refer to https://packt.live/2Y7KQhv.

You can also run this example online at https://packt.live/3d7pFk3. You must execute the entire Notebook in order to get the desired result.

In the previous exercise, we saw how to compute the conditional probabilities of the reason for absence, conditioned on the employee being a social smoker or drinker. Furthermore, we saw that in order to perform the computation, we had to compute the probability of being absent and being a social smoker/drinker. Due to the nature of the problem, computing this value might be difficult, or we may only have one conditional probability (say, P(A|B)) where we actually need P(B|A). In these cases, the Bayesian theorem can be used:

Let Ω denote a set of events with probability measure P on Ω. Given two events A and B in Ω , with (P(B) > 0) the Bayesian theorem states the following:

Figure 2.19: Bayesian theorem

Figure 2.19: Bayesian theorem

Before proceeding further, we will provide a practical example of applying the Bayesian theorem in practice. Suppose that we have two bags. The first one contains four blue and three red balls, while the second one contains two blue and five red balls. Let's assume that a ball is drawn at random from one of the two bags, and its color is blue. We want to know what the probability is that the ball has been drawn from the first bag. Let's use B1 to denote the "ball is drawn from the first bag" event and B2 to denote the "ball is drawn from the second bag" event. Since the number of balls is equal in both bags, the probability of the two events is equal to 0.5, as follows:

Figure 2.20: Probability of both events

Figure 2.20: Probability of both events

If we use A to denote the "a blue ball has been drawn" event, then we have the following:

Figure 2.21: Probability of event A, where a blue ball is drawn

Figure 2.21: Probability of event A, where a blue ball is drawn

This is because we have four balls in the first bag and only two in the second one. Furthermore, based on the defined events, the probability we need to compute translates into P(B1 | A). By applying Bayes' theorem, we get the following:

Figure 2.22: Probability of the event that a blue ball is drawn

Figure 2.22: Probability of the event that a blue ball is drawn

Now, let's apply Bayes' theorem to our dataset in the following exercise. In addition to applying Bayes' theorem, we will also be using the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test is used to determine whether two samples are statistically different from each other, i.e. whether or not they follow the same distribution. We can implement the Kolmogorov-Smirnov test directly from SciPy, as we will see in the exercise.

Exercise 2.03: Identifying the Probability of Being a Drinker/Smoker, Conditioned to Absence Reason

In this exercise, you will compute the conditional probability of being a social drinker or smoker, conditioned on the reason for absence. In other words (where Ri is the reason for which an employee is absent), we want to compute the probabilities of an employee being a social drinker P(social drinker |Ri), or smoker P(social smoker |Ri), as follows:

Figure 2.23: Conditional probability of being a drinker, conditioned to an absence reason Ri

Figure 2.23: Conditional probability of being a drinker, conditioned to an absence reason Ri

Figure 2.24: Conditional probability of being a smoker, conditioned to an absence reason Ri

Figure 2.24: Conditional probability of being a smoker, conditioned to an absence reason Ri

Execute the code mentioned in the previous section, as well as the previous exercises, before attempting this exercise. Now, follow these steps:

  1. Since you already computed P(Ri | social drinker), P(Ri | social smoker), P(social drinker), and P(social smoker), in the previous exercise, you only need to compute P(Ri) for each reason of absence R_i:
    # compute reason for absence probabilities
    absence_prob = preprocessed_data["Reason for absence"]\
                   .value_counts(normalize=True)
  2. Now that you have all the necessary values, compute the conditional probabilities according to the equations in Step 1:
    # compute conditional probabilities for drinker/smoker
    cond_prob_drinker_smoker = pd.DataFrame(index=range(0,29))
    cond_prob_drinker_smoker["P(social drinker | Absence)"] = \
    cond_prob["P(Absence | social drinker)"]*drinker_prob/absence_prob
    cond_prob_drinker_smoker["P(social smoker | Absence)"] = \
    cond_prob["P(Absence | social smoker)"]*smoker_prob/absence_prob
    plt.figure()
    ax = cond_prob_drinker_smoker.plot.bar(figsize=(10,6))
    ax.set_ylabel("Conditional probability")
    plt.savefig('figs/conditional_probabilities_drinker_smoker.png', \
                format='png', dpi=300)

    The following is the output of the preceding code:

    Figure 2.25: Conditional probabilities of being a drinker/smoker, conditioned to being absent

    Figure 2.25: Conditional probabilities of being a drinker/smoker, conditioned to being absent

    As you can see from the resulting plot, the conditional probabilities of being a social drinker/smoker are quite high, once an absence with a certain reason occurs. This is due to the fact that the number of entries is very small; as such, if all the registered employees who were absent for a certain reason are smokers, the probability of being a smoker, once that reason has been registered, will be equal to one (based on the available data).

  3. To complete your analysis on the social drinkers and smokers, analyze the distribution of the hours of absenteeism based on the two classes (being a social drinker/smoker versus not being). A useful type of plot for this type of comparison is the violin plot, which can be produced using the seaborn violinplot() function:
    # create violin plots of the absenteeism time in hours
    plt.figure(figsize=(8,6))
    sns.violinplot(x="Social drinker", y="Absenteeism time in hours", \
                   data=preprocessed_data, order=["No", "Yes"])
    plt.savefig('figs/drinkers_hour_distribution.png', \
                format='png', dpi=300)
    plt.figure(figsize=(8,6))
    sns.violinplot(x="Social smoker", y="Absenteeism time in hours", \
                   data=preprocessed_data, order=["No", "Yes"])
    plt.savefig('figs/smokers_hour_distribution.png', \
                format='png', dpi=300)

    The following is the output of the preceding code:

    Figure 2.26: Violin plots of the absenteeism time in hours for social drinkers

    Figure 2.26: Violin plots of the absenteeism time in hours for social drinkers

    Figure 2.27: Violin plots of the absenteeism time in hours for social smokers

    Figure 2.27: Violin plots of the absenteeism time in hours for social smokers

    As you can observe from Figure 2.27, despite some differences in the outliers between smokers and non-smokers, there is no substantial difference in the distribution of absenteeism hours in drinkers and smokers.

  4. To assess this statement in a rigorous statistical way, perform hypothesis testing on the absenteeism hours (with a null hypothesis stating that the average absenteeism time in hours is the same for drinkers and non-drinkers):
    from scipy.stats import ttest_ind
    hours_col = "Absenteeism time in hours"
    # test mean absenteeism time for drinkers
    drinkers_mask = preprocessed_data["Social drinker"] == "Yes"
    hours_drinkers = preprocessed_data.loc[drinker_mask, hours_col]
    hours_non_drinkers = preprocessed_data\
                         .loc[~drinker_mask, hours_col]
    drinkers_test = ttest_ind(hours_drinkers, hours_non_drinkers)
    print(f"Statistic value: {drinkers_test[0]}, \
    p-value: {drinkers_test[1]}")

    The output will be as follows:

    Statistic value: 1.7713833295243993, p-value: 0.07690961828294651
  5. Perform the same test on the social smokers:
    # test mean absenteeism time for smokers
    smokers_mask = preprocessed_data["Social smoker"] == "Yes"
    hours_smokers = preprocessed_data.loc[smokers_mask, hours_col]
    hours_non_smokers = preprocessed_data\
                        .loc[~smokers_mask, hours_col]
    smokers_test = ttest_ind(hours_smokers, hours_non_smokers)
    print(f"Statistic value: {smokers_test[0]}, \
    p-value: {smokers_test[1]}")

    The output will be as follows:

    Statistic value: -0.24277795417700243, p-value: 0.8082448720154971

    As you can see, the p-value of both tests is above the critical value of 0.05, which means that you cannot reject the null hypothesis. In other words, you cannot say that there is a statistically significant difference in the absenteeism hours between drinkers (and smokers) and non-drinkers (and non-smokers).

    Note that in the previous paragraph, you performed hypothesis tests, with a null hypothesis for the average absenteeism hours being equal for drinkers (and smokers) and non-drinkers (and non-smokers). Nevertheless, the average hours may still be equal, but their distributions may be different.

  6. Perform a Kolmogorov-Smirnov test to assess the difference in the distributions of two samples:
    # perform Kolmogorov-Smirnov test for comparing the distributions
    from scipy.stats import ks_2samp
    ks_drinkers = ks_2samp(hours_drinkers, hours_non_drinkers)
    ks_smokers = ks_2samp(hours_smokers, hours_non_smokers)
    print(f"Drinkers comparison: statistics={ks_drinkers[0]:.3f}, \
    pvalue={ks_drinkers[1]:.3f}")
    print(f"Smokers comparison:  statistics={ks_smokers[0]:.3f}, \
    pvalue={ks_smokers[1]:.3f}")

    The output will be as follows:

    Drinkers comparison: statistics=0.135, pvalue=0.002
    Smokers comparison:  statistics=0.104, pvalue=0.607

The p-value for the drinkers dataset is lower than the critical 0.05, which is strong evidence against the null hypothesis of the two distributions being equal. On the other hand, as the p-value for the smokers dataset is higher than 0.05, you cannot reject the null hypothesis.

Note

To access the source code for this specific section, please refer to https://packt.live/3hxt3I6.

You can also run this example online at https://packt.live/2BeAweq. You must execute the entire Notebook in order to get the desired result.

In this section, we investigated the relationship between the different reasons for absence, as well as social information about the employees (such as being smokers or drinkers). In the next section, we will analyze the impact of the employees' body mass index on their absenteeism.

Body Mass Index

The Body Mass Index (BMI) is defined as a person's weight in kilograms, divided by the square of their height in meters:

Figure 2.28: Expression for BMI

Figure 2.28: Expression for BMI

BMI is a universal way to classify people as underweight, healthy weight, overweight, and obese, based on tissue mass (muscle, fat, and bone) and height. The following plot indicates the relationship between weight and height for the various categories:

Figure 2.29: Body Mass Index categories 
(source: https://en.wikipedia.org/wiki/Body_mass_index)

Figure 2.29: Body Mass Index categories (source: https://en.wikipedia.org/wiki/Body_mass_index)

According to the preceding plot, we can build the four categories (underweight, healthy weight, overweight, and obese) based on the BMI values:

"""
define function for computing the BMI category, based on BMI value
"""
def get_bmi_category(bmi):
    if bmi < 18.5:
        category = "underweight"
    elif bmi >= 18.5 and bmi < 25:
        category = "healthy weight"
    elif bmi >= 25 and bmi < 30:
        category = "overweight"
    else:
        category = "obese"
    return category
# compute BMI category
preprocessed_data["BMI category"] = preprocessed_data\
                                    ["Body mass index"]\
                                    .apply(get_bmi_category)

We can plot the number of entries for each category:

# plot number of entries for each category
plt.figure(figsize=(10, 6))
sns.countplot(data=preprocessed_data, x='BMI category', \
              order=["underweight", "healthy weight", \
                     "overweight", "obese"], \
              palette="Set2")
plt.savefig('figs/bmi_categories.png', format='png', dpi=300)

The following is the output of the preceding code:

Figure 2.30: BMI categories

Figure 2.30: BMI categories

We can see that no entries for the underweight category are present, with the data being almost uniformly distributed among the remaining three categories. Of course, this is an alarming indicator, as more than 60% of the employees are either overweight or obese.

Now, let's check how the different BMI categories are related to the reason for absence. More precisely, we would like to see how many employees there are based on their body mass index and their reason for absence. This can be done with the following code:

# plot BMI categories vs Reason for absence
plt.figure(figsize=(10, 16))
ax = sns.countplot(data=preprocessed_data, \
                   y="Reason for absence", hue="BMI category", \
                   hue_order=["underweight", "healthy weight", \
                              "overweight", "obese"], \
                   palette="Set2")
ax.set_xlabel("Number of employees")
plt.savefig('figs/reasons_bmi.png', format='png', dpi=300)

The output will be as follows:

Figure 2.31: Absence reasons, based on BMI category

Figure 2.31: Absence reasons, based on BMI category

Unfortunately, no clear pattern arises from the preceding plot. In other words, for each reason for absence, an (almost) equal number of employees with different body mass indexes are present.

We can also investigate the distribution of absence hours for the different BMI categories:

# plot distribution of absence time, based on BMI category
plt.figure(figsize=(8,6))
sns.violinplot(x="BMI category", \
               y="Absenteeism time in hours", \
               data=preprocessed_data, \
               order=["healthy weight", "overweight", "obese"])
plt.savefig('figs/bmi_hour_distribution.png', format='png')

The output will be as follows:

Figure 2.32: Absence time in hours, based on the BMI category

Figure 2.32: Absence time in hours, based on the BMI category

As we can observe from Figure 2.31 and Figure 2.32, no evidence states that BMI and obesity levels influence the employees' absenteeism.

Age and Education Factors

Age and education may also influence employees' absenteeism. For instance, older employees might need more frequent medical treatment, while employees with higher education degrees, covering positions of higher responsibility, might be less prone to being absent.

First, let's investigate the correlation between age and absence hours. We will create a regression plot, in which we'll plot the Age column on the x axis and Absenteeism time in hours on the y axis. We'll also include the Pearson's correlation coefficient and its p-value, where the null hypothesis is that the correlation coefficient between the two features is equal to zero:

from scipy.stats import pearsonr
# compute Pearson's correlation coefficient and p-value
pearson_test = pearsonr(preprocessed_data["Age"], \
               preprocessed_data["Absenteeism time in hours"])
"""
create regression plot and add correlation coefficient in the title
"""
plt.figure(figsize=(10, 6))
ax = sns.regplot(x="Age", y="Absenteeism time in hours", \
                 data=preprocessed_data, scatter_kws={"alpha":0.1})
ax.set_title(f"Correlation={pearson_test[0]:.03f} \
| p-value={pearson_test[1]:.03f}")
plt.savefig('figs/correlation_age_hours.png', \
            format='png', dpi=300)

The output will be as follows:

Figure 2.33: Correlation plot for absenteeism time and age

Figure 2.33: Correlation plot for absenteeism time and age

As we can observe from the resulting plot, no significant pattern occurs. Furthermore, the correlation coefficient is extremely small (0.066), and its p-value is above the threshold of 0.05, which is an additional indicator that no relationship is present between the Age and Absenteeism time in hours features.

We can also check whether age has some impact on the reason for absence. We'll perform this analysis in the next exercise.

Exercise 2.04: Investigating the Impact of Age on Reason for Absence

In this exercise, we'll investigate the relationship between the Age feature and the various reasons for absence. Please execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

  1. First, create a violin plot between the Age and Disease features. This will give you your first insight into the relationship between the two columns:
    # create violin plot between the Age and Disease columns
    plt.figure(figsize=(8,6))
    sns.violinplot(x="Disease", y="Age", data=preprocessed_data)
    plt.savefig('figs/exercise_204_age_disease.png', \
                format='png', dpi=300)

    The output will be as follows:

    Figure 2.34: Violin plot for disease versus age

    Figure 2.34: Violin plot for disease versus age

  2. From Step 1, you can see some differences between the two distributions of age. In fact, for samples with ICD encoded reasons for absence (labeled Yes in the Disease column), you can observe that slightly more samples are present for older employees. To confirm this difference in distributions, perform hypothesis tests on the means and distributions of the two groups:
    """
    get Age entries for employees with Disease == Yes and Disease == No
    """
    disease_mask = preprocessed_data["Disease"] == "Yes"
    disease_ages = preprocessed_data["Age"][disease_mask]
    no_disease_ages = preprocessed_data["Age"][~disease_mask]
    # perform hypothesis test for equality of means
    test_res = ttest_ind(disease_ages, no_disease_ages)
    print(f"Test for equality of means: \
    statistic={test_res[0]:0.3f}, pvalue={test_res[1]:0.3f}")
    # test equality of distributions via Kolmogorov-Smirnov test
    ks_res = ks_2samp(disease_ages, no_disease_ages)
    print(f"KS test for equality of distributions: \
    statistic={ks_res[0]:0.3f}, pvalue={ks_res[1]:0.3f}")

    The output will be as follows:

    Test for equality of means: statistic=0.630, pvalue=0.529
    KS test for equality of distributions: statistic=0.057, 
    pvalue=0.619

    From the results of the two tests, you can conclude that there is no statistically significant difference between the two distributions. Thus, age is neither an indicator for the length of an absence nor for its type.

  3. Now investigate the relationship between age and reason for absence:
    # violin plot of reason for absence vs age
    plt.figure(figsize=(20,8))
    sns.violinplot(x="Reason for absence", y="Age", \
                   data=preprocessed_data)
    plt.savefig('figs/exercise_204_age_reason.png', format='png')

    The output will be as follows:

    Figure 2.35: Violin plot for age and reason for absence

Figure 2.35: Violin plot for age and reason for absence

In light of the previously performed analysis, you can conclude that age has no impact on the employees' absenteeism.

Note

To access the source code for this specific section, please refer to https://packt.live/2Y7jEj6.

You can also run this example online at https://packt.live/3d7q5qD. You must execute the entire Notebook in order to get the desired result.

Now, let's analyze the impact of education level on absenteeism.

Exercise 2.05: Investigating the Impact of Education on Reason for Absence

In this exercise, you will analyze the existing relationship between the Reason for absence and Education columns. You will start by looking at the percentage of employees with a certain educational degree, and then relate those numbers to the various reasons for absence. Please execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

  1. Before starting the analysis, check the percentage of employees in the data that hold a certain degree:
    # compute percentage of employees per education level
    education_types = ["high_school", "graduate", \
                       "postgraduate", "master_phd"]
    counts = preprocessed_data["Education"].value_counts()
    percentages = preprocessed_data["Education"]\
                  .value_counts(normalize=True)
    for educ_type in education_types:
        print(f"Education type: {educ_type:12s} \
    | Counts : {counts[educ_type]:6.0f} \
    | Percentage: {100*percentages[educ_type]:4.1f}")

    The output will be as follows:

    Education type: high_school  | Counts :    611 | Percentage: 82.6
    Education type: graduate     | Counts :     46 | Percentage:  6.2
    Education type: postgraduate | Counts :     79 | Percentage: 10.7
    Education type: master_phd   | Counts :      4 | Percentage:  0.5

    You can see that most of the employees in the data have a high school degree (82.6%), which means that the data is highly biased toward these employees.

  2. Create a distribution plot of the number of hours of absence, based on the level of education of the employees:
    # distribution of absence hours, based on education level
    plt.figure(figsize=(8,6))
    sns.violinplot(x="Education", y="Absenteeism time in hours",\
                   data=preprocessed_data, \
                   order=["high_school", "graduate", \
                          "postgraduate", "master_phd"])
    plt.savefig('figs/exercise_205_education_hours.png', format='png')

    The output will be as follows:

    Figure 2.36: Violin plot for number of hours of absence for each level of education

    Figure 2.36: Violin plot for number of hours of absence for each level of education

  3. It seems most of the extreme cases of absence are among employees with lower education levels. Compute the mean and standard deviation of the absence duration for the different levels of education:
    # compute mean and standard deviation of absence hours
    education_types = ["high_school", "graduate", \
                       "postgraduate", "master_phd"]
    for educ_type in education_types:
        mask = preprocessed_data["Education"] == educ_type
        hours = preprocessed_data["Absenteeism time in hours"][mask]
        mean = hours.mean()
        stddev = hours.std()
        print(f"Education type: {educ_type:12s} | Mean : {mean:.03f} \
    | Stddev: {stddev:.03f}")

    The output will be as follows:

    Education type: high_school  | Mean : 7.190 | Stddev: 14.259
    Education type: graduate     | Mean : 6.391 | Stddev: 6.754
    Education type: postgraduate | Mean : 5.266 | Stddev: 7.963
    Education type: master_phd   | Mean : 5.250 | Stddev: 3.202

    You can see that both the mean and standard deviation of the hours of absence are decreasing, meaning that highly educated employees tend to have shorter absences. Of course, a higher degree of education is not a cause for such a phenomenon and is more of an indication of it.

  4. Now, check the reasons for absence based on the education level:
    # plot reason for absence, based on education level
    plt.figure(figsize=(10, 16))
    sns.countplot(data=preprocessed_data, y="Reason for absence",\
                  hue="Education", \
                  hue_order=["high_school", "graduate", \
                             "postgraduate", "master_phd"])
    plt.savefig('figs/exercise_205_education_reason.png', format='png')

    The output will be as follows:

    Figure 2.37: Reasons for absence for each level of education

Figure 2.37: Reasons for absence for each level of education

From the preceding plot, you can observe that most of the absences relate to employees with a high_school level of education. This is, of course, due to the fact that most of the employees only have a high school degree (as observed in Step 1). Furthermore, from our analysis in Step 2, we saw that most of the absences that consisted of a greater number of hours were among employees with a high_school education level.

One question that comes to mind is whether the probability of being absent for more than one working week (40 hours) is greater for employees with a high school degree compared to graduates. In order to address this question, use the definition of conditional probability:

Figure 2.38: Conditional probability for extreme absences by employees with a high school degree

Figure 2.38: Conditional probability for extreme absences by employees with a high school degree

Figure 2.39: Conditional probability for extreme absences by employees without a high school degree

Figure 2.39: Conditional probability for extreme absences by employees without a high school degree

The following code snippet computes the conditional probabilities:

"""
define threshold for extreme hours of absenteeism and get total number of entries
"""
threshold = 40
total_entries = len(preprocessed_data)
# find entries with Education == high_school
high_school_mask = preprocessed_data["Education"] == "high_school"
# find entries with absenteeism time in hours more than threshold
extreme_mask = preprocessed_data\
               ["Absenteeism time in hours"] > threshold
# compute probability of having high school degree
prob_high_school = len(preprocessed_data[high_school_mask])\
                   /total_entries
# compute probability of having more than high school degree
prob_graduate = len(preprocessed_data[~high_school_mask])\
                /total_entries
"""
compute probability of having high school and being absent for more than "threshold" hours
"""
prob_extreme_high_school = len(preprocessed_data\
                               [high_school_mask & extreme_mask])\
                               /total_entries
"""
compute probability of having more than high school and being absent for more than "threshold" hours
"""
prob_extreme_graduate = len(preprocessed_data\
                            [~high_school_mask & extreme_mask])\
                            /total_entries
# compute and print conditional probabilities
cond_prob_extreme_high_school = prob_extreme_high_school\
                                /prob_high_school
cond_prob_extreme_graduate = prob_extreme_graduate/prob_graduate
print(f"P(extreme absence | degree = high_school) = \
{100*cond_prob_extreme_high_school:3.2f}")
print(f"P(extreme absence | degree != high_school) = \
{100*cond_prob_extreme_graduate:3.2f}")
preprocessed_data.head().T

The output will be as follows:

P(extreme absence | degree = high_school) = 2.29
P(extreme absence | degree != high_school) = 0.78

The preprocessed data now looks as follows:

Figure 2.40: Analysis of data

Figure 2.40: Analysis of data

Note

To access the source code for this specific section, please refer to https://packt.live/3fxhorg.

You can also run this example online at https://packt.live/2YDVBr0. You must execute the entire Notebook in order to get the desired result.

From the preceding computations, we can see that the probability of having an absence of more than 40 hours for employees with a high school education degree is 2.29%, which is approximately three times greater than the same probability for employees with a university degree (0.78%).

Transportation Costs and Distance to Work Factors

Two possible indicators for absenteeism may also be the distance between home and work (the Distance from Residence to Work column) and transportation costs (the Transportation expense column). Employees who have to travel longer, or whose costs for commuting to work are high, might be more prone to absenteeism.

In this section, we will investigate the relationship between these variables and the absence time in hours. Since we do not believe the aforementioned factors might be indicative of disease problems, we will not consider a possible relationship with the Reason for absence column.

First, let's start our analysis by plotting the previously mentioned columns (Distance from Residence to Work and Transportation expense) against the Absenteeism time in hours column:

# plot transportation costs and distance to work against hours
plt.figure(figsize=(10, 6))
sns.jointplot(x="Distance from Residence to Work", \
              y="Absenteeism time in hours", \
              data=preprocessed_data, kind="reg")
plt.savefig('figs/distance_vs_hours.png', format='png')
plt.figure(figsize=(10, 6))
sns.jointplot(x="Transportation expense", \
              y="Absenteeism time in hours", \
              data=preprocessed_data, kind="reg")
plt.savefig('figs/costs_vs_hours.png', format='png')

Note that, here, we used the seaborn jointplot() function, which not only produces the regression plot between the two variables but also estimates their distribution. The output will be as follows:

Figure 2.41: Regression plot of distance from work versus absenteeism in hours

Figure 2.41: Regression plot of distance from work versus absenteeism in hours

Figure 2.42: Regression plot of transportation costs versus absenteeism in hours (on the right)

Figure 2.42: Regression plot of transportation costs versus absenteeism in hours (on the right)

As we can see, the distributions of Distance from Residence to Work and Transportation expense look close to normal distributions, while the absenteeism time in hours is heavily right-skewed. This makes the comparison between the variables difficult to interpret. One solution to this problem is to transform the data into something close to a normal distribution. A handy way to perform this transformation is to use the Box-Cox or Yeo-Johnson transformations. Both are defined as a family of functions, depending on a parameter λ, under which the transformed data is as close to normal as possible.

The Box-Cox transformation is defined as follows:

Figure 2.43: Expression for Box-Cox transformation if λ is not equal to 0

Figure 2.43: Expression for Box-Cox transformation if λ is not equal to 0

Figure 2.44: Expression for Box-Cox transformation if λ is equal to 0

Figure 2.44: Expression for Box-Cox transformation if λ is equal to 0

The optimal value of the parameter λ is the one that results in the best approximation of a normal distribution. Note that the Box-Cox transformation fails if the data assumes negative values or zero. If this is the case, the Yeo-Johnson transformation can be used:

Figure 2.45: Expression for Yeo-Johnson transformation

Figure 2.45: Expression for Yeo-Johnson transformation

In Python, both transformations can be found in the scipy.stats module (in the boxcox() and yeojohnson() functions, respectively).

Since the Absenteeism time in hours column contains zeros, we will apply the Yeo-Johnson transformation in order to reproduce the plots from Figure 2.42:

# run Yeo-Johnson transformation and recreate previous plots
from scipy.stats import yeojohnson
hours = yeojohnson(preprocessed_data\
                   ["Absenteeism time in hours"].apply(float))
distances = preprocessed_data["Distance from Residence to Work"]
expenses = preprocessed_data["Transportation expense"]
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=distances, y=hours[0], kind="reg")
ax.set_axis_labels("Distance from Residence to Work",\
                   "Transformed absenteeism time in hours")
plt.savefig('figs/distance_vs_hours_transformed.png', format='png')
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=expenses, y=hours[0], kind="reg")
ax.set_axis_labels("Transportation expense", \
                   "Transformed absenteeism time in hours")
plt.savefig('figs/costs_vs_hours_transformed.png', format='png')

The output will be as follows:

Figure 2.46: Regression plot of distance from work versus transformed absenteeism in hours

Figure 2.46: Regression plot of distance from work versus transformed absenteeism in hours

Figure 2.47: Regression plot of transportation costs versus transformed absenteeism in hours

Figure 2.47: Regression plot of transportation costs versus transformed absenteeism in hours

We can also produce kernel density estimation plots (that is, plots that help us visualize the probability density functions of continuous variables) by just changing the type of the jointplot() function to kde.

# produce KDE plots 
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=distances, y=hours[0], kind="kde")
ax.set_axis_labels("Distance from Residence to Work",\
                   "Transformed absenteeism time in hours")
plt.savefig('figs/distance_vs_hours_transformed_kde.png', \
            format='png')
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=expenses, y=hours[0], kind="kde")
ax.set_axis_labels("Transportation expense", \
                   "Transformed absenteeism time in hours")
plt.savefig('figs/costs_vs_hours_transformed_kde.png', \
            format='png')

The KDE plot for distance from residence to work versus absent hours will be as follows:

Figure 2.48: KDE plot for distance from residence to work versus absent hours

Figure 2.48: KDE plot for distance from residence to work versus absent hours

The KDE plot for transport expense versus absent hours will be as follows:

Figure 2.49: KDE plot for transport expense versus absent hours

Figure 2.49: KDE plot for transport expense versus absent hours

From Figure 2.46, we can also see that the regression line between the variables is almost flat for the Distance from Residence to Work column (which is a clear indicator of zero correlation) but has a slight upward slope for the Transportation Expense column. Therefore, we can expect a small positive correlation:

# investigate correlation between the columns
distance_corr = pearsonr(hours[0], distances)
expenses_corr = pearsonr(hours[0], expenses)
print(f"Distances correlation: corr={distance_corr[0]:.3f}, \
pvalue={distance_corr[1]:.3f}")
print(f"Expenses comparison:  corr={expenses_corr[0]:.3f}, \
pvalue={expenses_corr[1]:.3f}")

The output will be as follows:

Distances correlation: corr=-0.000, pvalue=0.999
Expenses comparison: corr=0.113, pvalue=0.002

These results confirm our observation, stating that there is a slight positive correlation between Transportation expense and Absenteeism time in hours.

Temporal Factors

Factors such as day of the week and month may also be indicators for absenteeism. For instance, employees might prefer to have their medical examinations on Friday when the workload is lower, and it is closer to the weekend. In this section, we will analyze the impact of the Day of the week and Month of absence columns, and their impact on the employees' absenteeism.

Let's begin with an analysis of the number of entries for each day of the week and each month:

# count entries per day of the week and month
plt.figure(figsize=(12, 5))
ax = sns.countplot(data=preprocessed_data, \
                   x='Day of the week', \
                   order=["Monday", "Tuesday", \
                          "Wednesday", "Thursday", "Friday"])
ax.set_title("Number of absences per day of the week")
plt.savefig('figs/dow_counts.png', format='png', dpi=300)
plt.figure(figsize=(12, 5))
ax = sns.countplot(data=preprocessed_data, \
                   x='Month of absence', \
                   order=["January", "February", "March", \
                          "April", "May", "June", "July", \
                          "August", "September", "October", \
                          "November", "December", "Unknown"])
ax.set_title("Number of absences per month")
plt.savefig('figs/month_counts.png', format='png', dpi=300)

The output will be as follows:

Figure 2.50: Number of absences per day of the week

Figure 2.50: Number of absences per day of the week

The number of absences per month can be visualized as follows:

Figure 2.51: Number of absences per month

Figure 2.51: Number of absences per month

From the preceding plots, we can't really see a substantial difference between the different days of the week or months. It seems that fewer absences occur on Thursday, while the month with the most absences is March, but it is hard to say that the difference is significant.

Now, let's focus on the distribution of absence hours among the days of the week and the months of the year. This analysis will be performed in the following exercise.

Exercise 2.06: Investigating Absence Hours, Based on the Day of the Week and the Month of the Year

In this exercise, you will be looking at the hours during which the employees were absent for days of the week and months of the year. Execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

  1. Consider the distribution of absence hours among the days of the week and months of the year:
    # analyze average distribution of absence hours 
    plt.figure(figsize=(12,5))
    sns.violinplot(x="Day of the week", \
                   y="Absenteeism time in hours",\
                   data=preprocessed_data, \
                   order=["Monday", "Tuesday", \
                          "Wednesday", "Thursday", "Friday"])
    plt.savefig('figs/exercise_206_dow_hours.png', \
                format='png', dpi=300)
    plt.figure(figsize=(12,5))
    sns.violinplot(x="Month of absence", \
                   y="Absenteeism time in hours",\
                   data=preprocessed_data, \
                   order=["January", "February", \
                          "March", "April", "May", "June", "July",\
                          "August", "September", "October", \
                          "November", "December", "Unknown"])
    plt.savefig('figs/exercise_206_month_hours.png', \
                format='png', dpi=300)

    The output will be as follows:

    Figure 2.52: Average absent hours during the week

    Figure 2.52: Average absent hours during the week

    The violin plot for the average absent hours over the year can be visualized as follows:

    Figure 2.53: Average absent hours over the year

    Figure 2.53: Average absent hours over the year

  2. Compute the mean and standard deviation of the absences based on the day of the week:
    """
    compute mean and standard deviation of absence hours per day of the week
    """
    dows = ["Monday", "Tuesday", "Wednesday", \
            "Thursday", "Friday"]
    for dow in dows:
        mask = preprocessed_data["Day of the week"] == dow
        hours = preprocessed_data["Absenteeism time in hours"][mask]
        mean = hours.mean()
        stddev = hours.std()
        print(f"Day of the week: {dow:10s} | Mean : {mean:.03f} \
    | Stddev: {stddev:.03f}")

    The output will be as follows:

    Figure 2.54: Mean and standard deviation of absent hours per day of the week

    Figure 2.54: Mean and standard deviation of absent hours per day of the week

  3. Similarly, compute the mean and standard deviation based on the month, as follows:
    """
    compute mean and standard deviation of absence hours per day of the month
    """
    months = ["January", "February", "March", "April", "May", \
              "June", "July", "August", "September", "October", \
              "November", "December"]
    for month in months:
        mask = preprocessed_data["Month of absence"] == month
        hours = preprocessed_data["Absenteeism time in hours"][mask]
        mean = hours.mean()
        stddev = hours.std()
        print(f"Month: {month:10s} | Mean : {mean:8.03f} \
    | Stddev: {stddev:8.03f}")

    The output will be as follows:

    Figure 2.55: Mean and standard deviation of absent hours per month

    Figure 2.55: Mean and standard deviation of absent hours per month

  4. Observe that the average duration of the absences is slightly shorter on Thursday (4.424 hours), while absences during July have the longest average duration (10.955 hours). To determine whether these values are statistically significant—that is, whether there is a statistically significant difference regarding the rest of the days/months—use the following code snippet:
    # perform statistical test for avg duration difference
    thursday_mask = preprocessed_data\
                    ["Day of the week"] == "Thursday"
    july_mask = preprocessed_data\
                ["Month of absence"] == "July"
    thursday_data = preprocessed_data\
                    ["Absenteeism time in hours"][thursday_mask]
    no_thursday_data = preprocessed_data\
                       ["Absenteeism time in hours"][~thursday_mask]
    july_data = preprocessed_data\
                ["Absenteeism time in hours"][july_mask]
    no_july_data = preprocessed_data\
                   ["Absenteeism time in hours"][~july_mask]
    thursday_res = ttest_ind(thursday_data, no_thursday_data)
    july_res = ttest_ind(july_data, no_july_data)
    print(f"Thursday test result: statistic={thursday_res[0]:.3f}, \
    pvalue={thursday_res[1]:.3f}")
    print(f"July test result: statistic={july_res[0]:.3f}, \
    pvalue={july_res[1]:.3f}")

    The output will be as follows:

    Thursday test result: statistic=-2.307, pvalue=0.021
    July test result: statistic=2.605, pvalue=0.009
  5. Summarize and visualize the data as follows:
    preprocessed_data.head().T
    preprocessed_data["Service time"].hist()

    The output will be as follows:

    Figure 2.56: Statistics of data

    Figure 2.56: Statistics of data

  6. Visualize the plot as follows:
    Figure 2.57: Histogram for preprocessed data

Figure 2.57: Histogram for preprocessed data

Note

To access the source code for this specific section, please refer to https://packt.live/2AIFO1X.

You can also run this example online at https://packt.live/37y5omt. You must execute the entire Notebook in order to get the desired result.

Since the p-values from both the statistical tests are below the critical value of 0.05, we can conclude the following:

  • There is a statistically significant difference between Thursdays and other days of the week. Absences on Thursday have a shorter duration, on average.
  • Absences during July are the longest over the year. Also, in this case, we can reject the null hypothesis of having no difference.

From the analysis we've performed in this exercise, we can conclude that our initial observations about the difference in absenteeism during the month of July and on Thursdays are correct. Of course, we cannot claim that this is the cause, but only state that certain trends exist in the data.

Activity 2.01: Analyzing the Service Time and Son Columns

In this activity, you will extend the analysis of the absenteeism dataset by exploring the impact of two additional columns: Service time and Son.

This activity is based on the techniques that have been presented in this chapter—that is, distribution analysis, hypothesis testing, and conditional probability estimation.

The following steps will help you complete this activity:

  1. Import the data and the necessary libraries:
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Analyze the distribution of the Service time column by creating a kernel density estimation plot (use the seaborn.kdeplot() function). Perform a hypothesis test for normality (that is, a Kolmogorov-Smirnov test with the scipy.stats.kstest() function). The KDE plot will be as follows:
    Figure 2.58: KDE plot for service time

    Figure 2.58: KDE plot for service time

  3. Create a violin plot of the Service time column and the Reason for absence column. Draw a conclusion about the observed relationship.

    The output will be as follows:

    Figure 2.59: Violin plot for the Service time column

    Figure 2.59: Violin plot for the Service time column

  4. Create a correlation plot between the Service time and Absenteeism time in hours columns, similar to the one in Figure 2.47. The output will be as follows:
    Figure 2.60: Correlation plot for service time

    Figure 2.60: Correlation plot for service time

  5. Analyze the distributions of Absenteeism time in hours for employees with a different number of children (the Son column).

    The output will be as follows:

    Figure 2.61: Distribution of absent time for employees with a different number of children

Figure 2.61: Distribution of absent time for employees with a different number of children

Note

The solution for this activity can be found via this link.

From this analysis, we can infer that the number of absence hours for employees with a greater number of children lies in the range of 10-15 hours. Employees with less than three children appear to be absent in a varying range of 1-20 hours. To be specific, employees with no children still have a varying number of absent hours within the range of 10-15 hours, owing to other reasons, which now opens up a new area of analysis. On the contrary, employees with one child are absent only for an average of 5 hours. Employees with two children have an average of 15-25 absent hours, which could be analyzed further.

Thus, we have successfully drawn measurable conclusions to help us understand employee behavior in an organization to tackle unregulated absenteeism and take necessary measures to ensure the optimal utilization of human resources.

Summary

In this chapter, we analyzed a dataset containing employees' absences and their relationship to additional health and socially related factors. We introduced various data analysis techniques, such as distribution plots, conditional probabilities, Bayes' theorem, data transformation techniques (such as Box-Cox and Yeo-Johnson), and the Kolmogorov-Smirnov test, and applied these to the dataset.

In the next chapter, we will be analyzing the marketing campaign dataset of a Portuguese bank and the impact it had on acquiring new customers.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get to grips with data analysis by studying use cases from different fields
  • Develop your critical thinking skills by following tried-and-true data analysis
  • Learn how to use conclusions from data analyses to make better business decisions

Description

Businesses today operate online and generate data almost continuously. While not all data in its raw form may seem useful, if processed and analyzed correctly, it can provide you with valuable hidden insights. The Data Analysis Workshop will help you learn how to discover these hidden patterns in your data, to analyze them, and leverage the results to help transform your business. The book begins by taking you through the use case of a bike rental shop. You'll be shown how to correlate data, plot histograms, and analyze temporal features. As you progress, you’ll learn how to plot data for a hydraulic system using the Seaborn and Matplotlib libraries, and explore a variety of use cases that show you how to join and merge databases, prepare data for analysis, and handle imbalanced data. By the end of the book, you'll have learned different data analysis techniques, including hypothesis testing, correlation, and null-value imputation, and will have become a confident data analyst.

Who is this book for?

The Data Analysis Workshop is for programmers who already know how to code in Python and want to use it to perform data analysis. If you are looking to gain practical experience in data science with Python, this book is for you.

What you will learn

  • Get to grips with the fundamental concepts and conventions of data analysis
  • Understand how different algorithms help you to analyze the data effectively
  • Determine the variation between groups of data using hypothesis testing
  • Visualize your data correctly using appropriate plotting points
  • Use correlation techniques to uncover the relationship between variables
  • Find hidden patterns in data using advanced techniques and strategies
Estimated delivery fee Deliver to Malta

Premium delivery 7 - 10 business days

€32.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 29, 2020
Length: 626 pages
Edition : 1st
Language : English
ISBN-13 : 9781839211386
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Malta

Premium delivery 7 - 10 business days

€32.95
(Includes tracking information)

Product Details

Publication date : Jul 29, 2020
Length: 626 pages
Edition : 1st
Language : English
ISBN-13 : 9781839211386
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 96.97
The Data Science Workshop
€34.99
The Data Analysis Workshop
€30.99
The Data Wrangling Workshop
€30.99
Total 96.97 Stars icon

Table of Contents

10 Chapters
1. Bike Sharing Analysis Chevron down icon Chevron up icon
2. Absenteeism at Work Chevron down icon Chevron up icon
3. Analyzing Bank Marketing Campaign Data Chevron down icon Chevron up icon
4. Tackling Company Bankruptcy Chevron down icon Chevron up icon
5. Analyzing the Online Shopper's Purchasing Intention Chevron down icon Chevron up icon
6. Analysis of Credit Card Defaulters Chevron down icon Chevron up icon
7. Analyzing the Heart Disease Dataset Chevron down icon Chevron up icon
8. Analyzing Online Retail II Dataset Chevron down icon Chevron up icon
9. Analysis of the Energy Consumed by Appliances Chevron down icon Chevron up icon
10. Analyzing Air Quality Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.4
(21 Ratings)
5 star 57.1%
4 star 23.8%
3 star 19%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Nithin Feb 15, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I found the book "The Data Analysis Workshop" really helpful. I like the approach author has taken to go step by step on the process. Every problem solving follows the data exploration and preprocessing to data visualization in Python.The book uses real world and variety datasets with great well formatted colored visualizations. Code snippets are clear and explains the problem statement with clarity across the entire book.The book covers a lot of important concepts sklearn, classification, regression, hypothesis testing, clustering, time series, and many more. It also features "Activities" for every section which helps with better understanding of problem statement.I would highly recommend this book.
Amazon Verified review Amazon
Gennaro Maida, MS, BSBME -- CTO/Co-founder Vital Intelligence, Inc. Oct 05, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book does exactly what the author intends. There are many real world examples that are used to build knowledge and expertise in some of todays most powerful python tools. Jupyter notebooks, matplotlib, seaborn, scikit-learn, numpy, scipy, and pandas are employed diving into many of the most useful methods. The author takes the time to walk the reader through the steps of solving analytical problems from data exploration and preprocessing to data visualization. Additionally, the author lightly digs into the statistics and probability behind the analyses to build concepts rather then repetitive memorization. I would recommend this book, not only as a quick reference for intermediate to advanced data scientists, but as a book to introduce beginners (with python knowledge) to the world of data science. A definite thumbs up.
Amazon Verified review Amazon
Richard Dec 28, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Excellent book that clearly explains vital data analysis techniques using real-world examples. The author does a particularly great job highlighting the use of Python-based statistical analyses via Jupyter notebooks by importing pandas, matplotlib, seaborn, scikit-learn, numpy, and scipy.This book also covers some great data visualization techniques, which are must-haves for anyone who is interested in crafting meaningful data storytelling.Overall, this is a great read for anyone who is interested in expanding their data analysis skill set. A highly recommended read!
Amazon Verified review Amazon
Alan Dec 09, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is a good learning tool for beginners as well as a great reference book for the more experienced. You do need to be versed in Python before reading, although being a programmer in VBA, I was able to get the gist of what the programs were doing (programming fundamentals). It covers many different and increasingly complex datasets and shows how to turn them into meaningful data insights. I would recommend it to anyone that wants to gain serious knowledge about data analysis.
Amazon Verified review Amazon
Nhikki v May 04, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great data analysis with python book with 10 different scenarios. Datasets provided too so you can do your own analysis and the exercises. Perfect to practice and gain experience.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
Modal Close icon
Modal Close icon