Initial Analysis of the Reason for Absence
Let's start with a simple analysis of the Reason for absence
column. We will try to address questions such as, what is the most common reason for absence? Does being a drinker or smoker have some effect on the causes? Does the distance to work have some effect on the reasons? And so on. Starting with these types of questions is often important when performing data analysis, as this is a good way to obtain confidence and understanding of the data.
The first thing we are interested in is the overall distribution of the absence reasons in the data—that is, how many entries we have for a specific reason for absence in our dataset. We can easily address this question by using the countplot()
function from the seaborn
package:
# get the number of entries for each reason for absence plt.figure(figsize=(10, 5)) ax = sns.countplot(data=preprocessed_data, x="Reason for absence") ax.set_ylabel("Number of entries per...