Chapter 21 Absenteeism in manufacturing
Employee absenteeism is a significant problem for most organizations. Absenteeism excludes paid leave and occasions where an employer has granted an employee time off.
If you’re a manager, supervisor, or team leader, you’ll have likely experienced cases of absenteeism. You should know that staff will be absent from time to time, such as for illness, jury duty, or bereavement. However, when absenteeism is frequent and excessive it begins to become a problem.
Absenteeism at the workplace affects both employers and employees alike.
Some consequences for employers include: * Reduced productivity levels. * High administration costs. * Increased labour costs if you hire replacement temporary workers. * Understaffing which could lead to poor customer service. * Poor morale among colleagues. This will be particularly prevalent if employees constantly have to fill in for absent staff and if they don’t see any retributions for absence.
To address this issue, HR professionals are encouraged to develop a business case and to devise and implement plans for managing employee attendance in their organizations. The following topics related to managing employee attendance are presented in this article:
The direct and indirect costs of absenteeism. Collecting attendance data for the organization. Measuring the cost of absenteeism. Calculating rates of absenteeism. Strategies aimed at addressing the organization’s particular attendance issues. Legal issues that may be implicated in designing and implementing absence control strategies.
##1.Define the goal (or HR business problem/issue).
A hypothetical company MFG has decided that it needs to do something about absenteeism. To start with, it wants answers to the following questions:
- What is its rate of absenteeism?
- Does anyone have excessive absenteeism?
- Is is the same across the organization?
- Does it vary by gender?
- Does it vary by length of service or age? Its guesses are that initially age and length of service may be related to absenteeism rates.
- Can it predict next year’s absenteeism?
- If so, how well can it predict?
- Can we reduce our absenteeism?
If they can make future People Management decisions “driven” by what the data is telling them, then they will feel they have started the People Analytics journey.
##2.Collect and Manage Data.
Let us suppose this is a skunkworks project. Formal separate resources have not be identified for this initiative. Only an initial look at recent data is possible. The HRIS system is able to provide some rudimentary information covering absences only for 2015 It was able to generate the following information as a CSV file (comma separated values):
- EmployeeNumber
- Surname
- GivenName
- Gender
- City
- JobTitle
- DepartmentName
- StoreLocation
- Division
- Age
- LengthService
- AbsentHours
- BusinessUnit
Ensure all needed libraries are installed
###Let’s read in the data provided
<- read_csv("https://hranalytics.netlify.com/data/MFGEmployees4.csv") MFGEmployees
The first thing we should do is check on quality of data. Data will rarely be clean or perfect when we receive it. Either questionnable data should be corrected (preferrred) or deleted.
summary(MFGEmployees)
EmployeeNumber Surname GivenName Gender
Min. : 1 Length:8336 Length:8336 Length:8336
1st Qu.:2085 Class :character Class :character Class :character
Median :4168 Mode :character Mode :character Mode :character
Mean :4168
3rd Qu.:6252
Max. :8336
City JobTitle DepartmentName StoreLocation
Length:8336 Length:8336 Length:8336 Length:8336
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Division Age LengthService AbsentHours
Length:8336 Min. : 3.5 Min. : 0.0 Min. : 0.0
Class :character 1st Qu.:35.3 1st Qu.: 3.6 1st Qu.: 19.1
Mode :character Median :42.1 Median : 4.6 Median : 56.0
Mean :42.0 Mean : 4.8 Mean : 61.3
3rd Qu.:48.7 3rd Qu.: 5.6 3rd Qu.: 94.3
Max. :77.9 Max. :43.7 Max. :272.5
BusinessUnit
Length:8336
Class :character
Mode :character
The only thing that stands out initially is that age has some questionable data- some one who is 3 and someone who is 77. The range for purposes of this example should be 18 to 65. Normally you would want to clean the data by getting the correct information and then changing it.
21.0.1 For expediency we will delete the problem records in the dataset.
<-
MFGEmployees%>%
MFGEmployees filter(Age>=18 & Age<=65)
Now let’s summarize again with cleaned up data
summary(MFGEmployees)
EmployeeNumber Surname GivenName Gender
Min. : 1 Length:8165 Length:8165 Length:8165
1st Qu.:2081 Class :character Class :character Class :character
Median :4166 Mode :character Mode :character Mode :character
Mean :4165
3rd Qu.:6245
Max. :8336
City JobTitle DepartmentName StoreLocation
Length:8165 Length:8165 Length:8165 Length:8165
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Division Age LengthService AbsentHours
Length:8165 Min. :18.2 Min. : 0.1 Min. : 0.0
Class :character 1st Qu.:35.5 1st Qu.: 3.6 1st Qu.: 20.1
Mode :character Median :42.1 Median : 4.6 Median : 55.9
Mean :42.0 Mean : 4.8 Mean : 60.5
3rd Qu.:48.5 3rd Qu.: 5.6 3rd Qu.: 93.4
Max. :65.0 Max. :43.7 Max. :252.2
BusinessUnit
Length:8165
Class :character
Mode :character
###Transform the data
Now calculate absenteeism rate by dividing the absent hours by total standard hours for the year (52 weeks/year * 40 hours/week = 2080 hours/year)
<-
MFGEmployees %>%
MFGEmployees mutate(AbsenceRate = AbsentHours /2080*100)
str(MFGEmployees)
tibble [8,165 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ EmployeeNumber: num [1:8165] 1 2 3 4 5 6 7 8 9 10 ...
$ Surname : chr [1:8165] "Gutierrez" "Hardwick" "Delgado" "Simon" ...
$ GivenName : chr [1:8165] "Molly" "Stephen" "Chester" "Irene" ...
$ Gender : chr [1:8165] "F" "M" "M" "F" ...
$ City : chr [1:8165] "Burnaby" "Courtenay" "Richmond" "Victoria" ...
$ JobTitle : chr [1:8165] "Baker" "Baker" "Baker" "Baker" ...
$ DepartmentName: chr [1:8165] "Bakery" "Bakery" "Bakery" "Bakery" ...
$ StoreLocation : chr [1:8165] "Burnaby" "Nanaimo" "Richmond" "Victoria" ...
$ Division : chr [1:8165] "Stores" "Stores" "Stores" "Stores" ...
$ Age : num [1:8165] 32 40.3 48.8 44.6 35.7 ...
$ LengthService : num [1:8165] 6.02 5.53 4.39 3.08 3.62 ...
$ AbsentHours : num [1:8165] 36.6 30.2 83.8 70 0 ...
$ BusinessUnit : chr [1:8165] "Stores" "Stores" "Stores" "Stores" ...
$ AbsenceRate : num [1:8165] 1.76 1.45 4.03 3.37 0 ...
- attr(*, "spec")=
.. cols(
.. EmployeeNumber = col_double(),
.. Surname = col_character(),
.. GivenName = col_character(),
.. Gender = col_character(),
.. City = col_character(),
.. JobTitle = col_character(),
.. DepartmentName = col_character(),
.. StoreLocation = col_character(),
.. Division = col_character(),
.. Age = col_double(),
.. LengthService = col_double(),
.. AbsentHours = col_double(),
.. BusinessUnit = col_character()
.. )
We can now see our metric AbsenceRate has been calculated and created.
###Explore The Data
Part of collecting and managing data is the Exploratory Data Analysis.
Let’s start with bar graphs of some of the categorical data
%>%
MFGEmployees ggplot() +
aes(x=BusinessUnit) +
geom_bar() +
labs(x="Business Units",
y="Count",
title="Employee Count By Business Units") +
theme_minimal() +
coord_flip()
%>%
MFGEmployees ggplot() +
aes(x=Gender) +
geom_bar() +
labs(x="Business Units",
y="Count",
title="Employee Count By Gender") +
theme_minimal() +
coord_flip()
%>%
MFGEmployees ggplot() +
aes(x=Division) +
geom_bar() +
labs(x="Division",
y="Count",
title="Employee Count By Division") +
theme_minimal() +
coord_flip()
Let’s ask some of our questions answered through this exploratory analysis.
First of all, what is our absenteeism rate?
%>%
MFGEmployees summarise(median = median(AbsenceRate), mean = mean(AbsenceRate)) #Average absence rate and number of observations
# A tibble: 1 x 2
median mean
<dbl> <dbl>
1 2.68572 2.90726
<- aggregate(AbsenceRate ~ 1, MFGEmployees, mean)
means
ggplot() +
geom_boxplot(aes(y = AbsenceRate, x =1), data = MFGEmployees) +
coord_flip() +
geom_text(data = round(means, digits = 3), aes(y = AbsenceRate, x =1, label = AbsenceRate), check_overlap=TRUE) + #Add average absence rate as a text
theme_minimal()
The average absence rate is ‘r scales::percent(mean(MFGEmployees$AbsenceRate).’
Does anyone have excessive absenteeism?
The boxplot shows the average (three digit number), the mean and standard deviation (through lines) of the data. Any observations beyond 3 standard deviations shows up as dots. So at least under that definition of outliers, some people show way more absenteeism than 99% of employees.
Does it vary across the organization?
When you conduct a piece of quantitative research, you are inevitably attempting to answer a research question or hypothesis that you have set. One method of evaluating this research question is via a process called hypothesis testing, which is sometimes also referred to as significance testing.
In order to undertake hypothesis testing you need to express your research hypothesis as a null and alternative hypothesis. The null hypothesis and alternative hypothesis are statements regarding the differences or effects that occur in the population. You will use your sample to test which statement (i.e., the null hypothesis or alternative hypothesis) is most likely (although technically, you test the evidence against the null hypothesis). So, with respect to our example, the null and alternative hypothesis will reflect statements about absences among the organization.
In the following we will use ANOVA, as statistical test.
ggplot() +
geom_boxplot(aes(y = AbsenceRate, x = Gender), data = MFGEmployees) +
coord_flip() +
theme_minimal()
.1 <- lm(AbsenceRate ~ Gender, data=MFGEmployees) %>%
AnovaModelAnova()
.1 AnovaModel
Anova Table (Type II tests)
Response: AbsenceRate
Sum Sq Df F value Pr(>F)
Gender 496 1 97.8 <0.0000000000000002 ***
Residuals 41379 8163
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#means
%>% group_by(Gender) %>% summarise(avg=mean(AbsenceRate)) MFGEmployees
# A tibble: 2 x 2
Gender avg
<chr> <dbl>
1 F 3.15762
2 M 2.66481
The output of the function is a classical ANOVA table with the following data: Df = degree of freedom Sum Sq = deviance (within groups, and residual) Mean Sq = variance (within groups, and residual) F value = the value of the Fisher statistic test, so computed (variance within groups) / (variance residual) Pr(>F) = p-value
Please note that the value 2.2e-16 actually means 2.2 X 10 ^ -16. It is just a way R prints numbers that are either too big or too small.
Since the p-Value is much less than 0.05, we reject the null hypothesis and accept the alternative hypothesis, i.e. the absence levels among are significantly different among genders.
ggplot() +
geom_boxplot(aes(y = AbsenceRate, x = Division), data = MFGEmployees) +
coord_flip() +
theme_minimal()
.2 <- lm(AbsenceRate ~ Division, data=MFGEmployees) %>%
AnovaModelAnova()
.2 AnovaModel
Anova Table (Type II tests)
Response: AbsenceRate
Sum Sq Df F value Pr(>F)
Division 91 5 3.56 0.0032 **
Residuals 41783 8159
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# means
%>% group_by(Division) %>% summarise(avg=mean(AbsenceRate)) MFGEmployees
# A tibble: 6 x 2
Division avg
<chr> <dbl>
1 Executive 2.32358
2 FinanceAndAccounting 1.92189
3 HumanResources 2.65174
4 InfoTech 1.92600
5 Legal 2.47172
6 Stores 2.92086
The absence levels among are significantly different among Division.
.3 <- lm(AbsenceRate ~ Division*Gender, data=MFGEmployees) %>%
AnovaModelAnova()
.3 AnovaModel
Anova Table (Type II tests)
Response: AbsenceRate
Sum Sq Df F value Pr(>F)
Division 92 5 3.61 0.0029 **
Gender 496 1 97.94 <0.0000000000000002 ***
Division:Gender 5 5 0.18 0.9708
Residuals 41283 8153
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#means
%>% group_by(Division, Gender) %>% summarise(avg=mean(AbsenceRate)) MFGEmployees
# A tibble: 12 x 3
# Groups: Division [6]
Division Gender avg
<chr> <chr> <dbl>
1 Executive F 2.97642
2 Executive M 1.77955
3 FinanceAndAccounting F 2.17280
4 FinanceAndAccounting M 1.63408
5 HumanResources F 3.01449
6 HumanResources M 2.21431
7 InfoTech F 3.29811
8 InfoTech M 1.77354
9 Legal F 3.29811
10 Legal M 2.05853
11 Stores F 3.16905
12 Stores M 2.68079
If varies significantly by the interaction of gender and division.
These are just a handful of the categorical summaries we could do.
Does AbsenceRate vary by length of service and age?
Scatterplots and correlations help answer this.
# basic scatterplot
%>%
MFGEmployees ggplot() +
aes(x=Age, y=AbsenceRate) +
geom_point(colour = "blue") +
theme_minimal()
cor(MFGEmployees$Age, MFGEmployees$AbsenceRate)
[1] 0.825
There is a strong correlation of Age and Absence Rate
%>%
MFGEmployees ggplot() +
aes(x=LengthService, y=AbsenceRate) +
geom_point(colour = "blue") +
geom_smooth(method='lm', se = FALSE, color='red') +
theme_minimal()
`geom_smooth()` using formula 'y ~ x'
cor(MFGEmployees$LengthService, MFGEmployees$AbsenceRate)
[1] -0.0467
There is not a strong correlation between length of service and Absence Rate.
%>%
MFGEmployees ggplot() +
aes(x=Age, y=LengthService) +
geom_point(colour = "blue") +
geom_smooth(method='lm', se = FALSE, color='red') +
theme_minimal()
`geom_smooth()` using formula 'y ~ x'
cor(MFGEmployees$Age, MFGEmployees$LengthService)
[1] 0.0562
There is not much correlation between age and length of service either.
So far, we have done data exploration, now we are going to start building a model and starting to answer more difficult questions.
##3. Build The model One of the questions asked in the defining the goal step was ‘whether it was possible to predict absenteeism?’
Absence Rate is a numeric continuous value. In the ‘Building a model’ step we have to chose which models/statistical algorithms to use. Prediction of a numerics continous values suggests a couple of models that could be brought to bear: Regression trees and linear regression. There are many more but for purposes of this article we will look at these.
###3.1 Regression Trees Regression Trees will allow for use of both categorical and numeric values as predictors. Let’s choose the following data as potential predictors in this analysis:
- Gender
- Department Name
- Store Location
- Division
- Age
- Length of Service
- Business Unit
Absence Rate will be the the ‘target’ or thing to be predicted.
<- c("Gender", "DepartmentName", "StoreLocation", "Division",
MYinput "Age", "LengthService", "BusinessUnit")
<- "AbsenceRate"
MYtarget
library(rpart, quietly=TRUE)
set.seed(crv$seed)
<- rpart(AbsenceRate ~ .,
MYrpart data=MFGEmployees[, c(MYinput, MYtarget)],
method="anova",
parms=list(split="information"),
control=rpart.control(minsplit=10,
maxdepth=10,
usesurrogate=0,
maxsurrogate=0))
fancyRpartPlot(MYrpart, main="Decision Tree MFGEmployees and AbsenceRate")
The regression decision tree shows that age is a big factor in determining absence rate with gender playing a small part in one of the age ranges: >43 and <52 with males having a lower absence rate in this group. Almost all categorical information other than gender doesn’t look like to help in the prediction.
Now let’s look at linear regression as another model. The restriction in linear regression is that it can only accept non-categorical variables. Categorical variables can sometimes be made numeric through transformation, but that is beyond the scope of this article.
###3.2 Linear Regression
In linear regression, then, we will need to restrict it to numeric variables:
- Age
- Length of Service
being used to predict absence rate.
#Linear Regression Model
<- lm(AbsenceRate~Age+LengthService, data=MFGEmployees)
RegressionCurrentData
summary(RegressionCurrentData)
Call:
lm(formula = AbsenceRate ~ Age + LengthService, data = MFGEmployees)
Residuals:
Min 1Q Median 3Q Max
-5.358 -0.847 -0.023 0.852 5.103
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.19021 0.06896 -75.3 <0.0000000000000002 ***
Age 0.20259 0.00151 134.2 <0.0000000000000002 ***
LengthService -0.08531 0.00565 -15.1 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.26 on 8162 degrees of freedom
Multiple R-squared: 0.689, Adjusted R-squared: 0.689
F-statistic: 9.03e+03 on 2 and 8162 DF, p-value: <0.0000000000000002
The summary shows an adjusted R-squared of 0.689 which means approximately 69% of the variance is accounted by age and length of service. The variables are both significant at Pr(>|t|) of ’r summary(RegressionCurrentData)$coefficients[3,4]`. These results are using the entirety of the existing data to predict itself.
Graphically it look like this:
#2D plot of Age and AbsenceRate
ggplot() +
geom_point(aes(x = Age,y = AbsenceRate),data=MFGEmployees) +
geom_smooth(aes(x = Age,y = AbsenceRate),data=MFGEmployees,method = 'lm') +
theme_minimal()
`geom_smooth()` using formula 'y ~ x'
#3D Scatterplot of Age and Length of Service with Absence Rate - with Coloring and Vertical Lines
# and Regression Plane
library(scatterplot3d)
<-scatterplot3d(MFGEmployees$Age,MFGEmployees$LengthService,MFGEmployees$AbsenceRate, pch=16, highlight.3d=TRUE,
s3d type="h", main="Absence Rate By Age And Length of Service")
<- lm(MFGEmployees$AbsenceRate ~ MFGEmployees$Age+MFGEmployees$LengthService)
fit
$plane3d(fit) s3d
##4.Evaluate and evaluate the model
Up till now we have concentrated on producing a couple of models. The effort so far has had one weakness. We have used all of our data for 2015 to generate the models. They can both predict, but the prediction are based on existing data - that are already known. We don’t know how well it will predict on data it hasn’t seen yet.
To evaluate and critique the models, we need to train the model using part of the data and hold out a portion to test on. We will divide the data into 10 parts- using 9 parts as training data and 1 part as testing data, and alternate which are the 9 and the 1, so that each of the 10 parts gets to be training data 9 times and testing data once.
The R “caret” library helps us do that. We will run both a regression tree and linear regression and compare how they do against each other.
First the Linear Regression
library(caret)
set.seed(998)
<- createDataPartition(MFGEmployees$BusinessUnit, p = .75, list = FALSE)
inTraining
<- MFGEmployees[inTraining,]
training <- MFGEmployees[ - inTraining,]
testing
<- trainControl(## 10-fold CV
fitControl method = "repeatedcv", number = 10,
repeats = 10 ## repeated ten times
)
set.seed(825)
<- train(AbsenceRate ~ Age + LengthService, data = training,
lmFit1 method = "lm",
trControl = fitControl)
lmFit1
Linear Regression
6124 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 5511, 5511, 5512, 5512, 5511, 5512, ...
Resampling results:
RMSE Rsquared MAE
1.27 0.691 1.01
Tuning parameter 'intercept' was held constant at a value of TRUE
<- testing %>% add_predictions(lmFit1, type = "raw")
Testingdatasetandpredictions
$pred[Testingdatasetandpredictions$pred<0] <- 0 #Put a zero to all negative predictions Testingdatasetandpredictions
The Rsquared shows a value of 0.688 which means even with sampling different parts of the data on 10 fold cross validation the use of age and length of service seems to be pretty robust so far.
Next the decision tree. The first time with just the numeric variables.
set.seed(825)
<- train(AbsenceRate ~ Age + LengthService, data = training,
rpartFit1 method = "rpart",
trControl = fitControl,
maxdepth = 5)
rpartFit1
CART
6124 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 5511, 5511, 5512, 5512, 5511, 5512, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.0633 1.43 0.606 1.14
0.0945 1.56 0.534 1.26
0.4892 1.96 0.474 1.61
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.0633.
You will notice that the decision tree with 10 fold cross validation didn’t perform as well with an RSquared of approximately 0.60
The second time with the original categorical and numeric varibles used.
set.seed(825)
<- train(AbsenceRate ~ Gender + DepartmentName + StoreLocation + Division + Age + LengthService + BusinessUnit, data = training,
rpartFit2 method = "rpart",
trControl = fitControl,
maxdepth = 5)
rpartFit2
CART
6124 samples
7 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 5511, 5511, 5512, 5512, 5511, 5512, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.0633 1.43 0.606 1.14
0.0945 1.56 0.534 1.26
0.4892 1.96 0.474 1.61
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.0633.
Here when you include all originally used vaiables in 10 fold cross validation, the Rsquared changed little and is still around 0.60.
So far, the linear regression is performing better.
##5.Present Results and Document
The presenting of results and documenting is something that R helps in. You may not have realized it, but the R Markdown language has been used to create the full layout of these two blog articles. HTML,PDF and Word formats can be produced.
R Markdown allows the reader to see exactly what you having been doing, so that an independent person can replicate your results, to confirm what you have done. It shows the R code/commands, the statistical results and graphics, all is included in the Absenteeism-Part.Rmd file.
##6.Deploy Model
Once you have evaluated your model(s) and chosen to use them, they need to be deployed so that they can be used. At the simplest level, ‘deploy’ can mean using the ‘predict’ function in R (where applicable) in conjunction with your model.
Can it predict next year absenteeism?
Let’s predict the 2016 Absenteeism from the 2015 model.
If we make the simplifying assumption that nobody quits and nobody new comes in, we can take the 2015 data and add 1 to age and 1 to years of service for an approximation of new 2016 data before we get to 2016.
#Apply model
#Generate 2016 data
<-MFGEmployees
Absence2016Data$Age<-Absence2016Data$Age+1
Absence2016Data$LengthService<-Absence2016Data$LengthService+1
Absence2016Data
<- Absence2016Data %>% add_predictions(lmFit1, type = "raw")
Absence2016Data $pred[Absence2016Data$pred<0] <- 0 #Put a zero to all negative predictions Absence2016Data
To get single estimate for 2016 we ask for mean of absence rate.
mean(Absence2016Data$pred)
[1] 3.08
mean(MFGEmployees$AbsenceRate)
[1] 2.91
The first figure above is the 2016 prediction, the second is the 2015 actual for comparison.
If so, how well can it predict?
As mentioned previously, about ‘r scales::percent(mean(MFGEmployees$AbsenceRate).’ of the variation is accounted for in a linear regression model using age and length of service.
Can we reduce our absenteeism?
On the surface, only getting the age reduced and length of service increased will reduce absensteeism with this model.
Obviously, absenteeism is much more complex that just the rudimentary data we have collected. A serious look at this metric and problem would require more and different kinds of data. As mentioned before, the raw data used in this article and analysis is totally contrived to illustrate an example.