Chapter 25 Interview attendance problem
The dataset consists of details of nore than 1200 candidates and the interviews they have attended during the course of the period 2014-2016.
The following are the variables columns:
- Date of Interview refers to the day the candidates were scheduled for the interview. The formats vary.
- Client that gave the recruitment vendor the requisite mandate
- Industry refers to the sector the client belongs to (Candidates can job hunt in vrious industries.)
- Location refers to the current location of the candidate.
- Position to be closed: Niche refers to rare skill sets, while routine refers to more common skill sets.
- Nature of Skillset refers to the skill the client has and specifies the same.
- Interview Type: There are three interview types:
- Walk in drives - These are unscheduled. Candidates are either contacted or they come to the interview on their own volition,
- Scheduled - Here the candidates profiles are screened by the client and subsequent to this, the vendor fixes an appointment between the client and the candidate.
- The third one is a scheduled walkin. Here the number of candidates is larger and the candidates are informed beforehand of a tentative date to ascertain their availability. The profiles are screened as in a scheduled interview. In a sense it bears features of both a walk-in and a scheduled interview.
- Name( Cand ID) This is a substitute to keep the candidates identity a secret
- Gender
- Candidate Current Location
- Candidate Job Location
- Interview Venue
- Candidate Native location
- Have you obtained the necessary permission to start at the required time?
- I hope there will be no unscheduled meetings.
- Can I call you three hours before the interview and follow up on your attendance for the interview?
- Can I have an alternative telephone number? I assure you that I will not trouble you too much.
- Have you taken a printout of your updated resume? Have you read the JD and understood it?
- Are you clear with the venue details and the landmark?
- Has the call letter been shared
- Expected Attendance: Whether the candidate was expected to attend the interview. Here the alternatives are yes, no or uncertain.
- Observed Attendance: Whether the candidate attended the interview. This is binary and will form our dependent variable to be predicted.
- Marital Status: Single or married.
Source: https://www.kaggle.com/hugohk/learning-ml-with-caret
25.1 Data reading
library(tidyverse)
<- read_csv("https://hranalyticslive.netlify.com/data/interview.csv") interview_attendance
head(interview_attendance, 5)
# A tibble: 5 x 28
`Date of Interv~ `Client name` Industry Location `Position to be~
<chr> <chr> <chr> <chr> <chr>
1 13.02.2015 Hospira Pharmac~ Chennai Production- Ste~
2 13.02.2015 Hospira Pharmac~ Chennai Production- Ste~
3 13.02.2015 Hospira Pharmac~ Chennai Production- Ste~
4 13.02.2015 Hospira Pharmac~ Chennai Production- Ste~
5 13.02.2015 Hospira Pharmac~ Chennai Production- Ste~
# ... with 23 more variables: `Nature of Skillset` <chr>, `Interview
# Type` <chr>, `Name(Cand ID)` <chr>, Gender <chr>, `Candidate Current
# Location` <chr>, `Candidate Job Location` <chr>, `Interview Venue` <chr>,
# `Candidate Native location` <chr>, `Have you obtained the necessary
# permission to start at the required time` <chr>, `Hope there will be no
# unscheduled meetings` <chr>, `Can I Call you three hours before the
# interview and follow up on your attendance for the interview` <chr>, `Can I
# have an alternative number/ desk number. I assure you that I will not
# trouble you too much` <chr>, `Have you taken a printout of your updated
# resume. Have you read the JD and understood the same` <chr>, `Are you clear
# with the venue details and the landmark.` <chr>, `Has the call letter been
# shared` <chr>, `Expected Attendance` <chr>, `Observed Attendance` <chr>,
# `Marital Status` <chr>, X24 <lgl>, X25 <lgl>, X26 <lgl>, X27 <lgl>,
# X28 <lgl>
25.2 Data cleaning
<- interview_attendance[-1234,] #remove the last row that contains only missing values
interview_attendance
$X24 <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X25 <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X26 <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X27 <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X28 <- NULL #get rid of unnecesary columns on the right side
interview_attendance
# Here we create a vector with column titles
<-c('date_of_interview','client_name','industry','location','position',
mycolnames 'skillset','interview_type','name', 'gender','current_location',
'cjob_location','interview_venue','cnative_location',
'permission_obtained','unscheduled_meetings','call_three_hours_before',
'alternative_number','printout_resume_jd','clear_with_venue',
'letter_been_shared','expected_attendance','observed_attendance',
'marital_status')
#Here we assign the previously defined column titles
colnames(interview_attendance) <- mycolnames
rm(mycolnames) # Here we remove the mycolnames vector, as it is not required anymore
<- mutate_all(interview_attendance, funs(tolower)) # Sets all words to lower case
interview_attendance
# Cancels all empty spaces in the observed attendance column
$observed_attendance <- gsub(" ", "", interview_attendance$observed_attendance)
interview_attendance
# Cancels all empty spaces in the location column
$location <- gsub(" ", "", interview_attendance$location)
interview_attendance
# Cancels all empty spaces in the interview type column
$interview_type <- gsub(" ", "", interview_attendance$interview_type)
interview_attendance
# Corrects a typo in the interview type column
$interview_type <- gsub("sceduledwalkin", "scheduledwalkin", interview_attendance$interview_type)
interview_attendance
# Cancels all empty spaces in the candidate current location column
$current_location <- gsub(" ", "", interview_attendance$current_location)
interview_attendance
#Converts values from character to numbers for Yes/no answers, just to keep things simple.
<- c(14:22) # Here we define which column numbers to look at.
colstoyesno for (i in 1:length(colstoyesno)){ # Here we tell R to examine all variables in the previously defined columns
<- colstoyesno[i]
j !="yes"] <- "no"
interview_attendance[,j][interview_attendance[,j] is.na(interview_attendance[,j]) == TRUE] <- "no"
interview_attendance[,j][#With the previous two lines all values different to yes, become a no, i.e. "uncertain" and "NA" are set to a "no".
}rm(colstoyesno, i, j) #Here we remove the three just created vectors as a claen up.
In the following step we figure out what is the content of each relevant column and identify its unique values. Once we have done that a number is assigned for a later step.
dir.create("codefiles_interview_attendance", showWarnings = FALSE)
#detach("package:plyr", unload = TRUE)
for(i in 1:length(colnames(interview_attendance))){
<- colnames(interview_attendance)[i]
vvar <- interview_attendance %>% dplyr::group_by(.dots = vvar) %>% dplyr::count(.dots = vvar)
outdata $idt <- LETTERS[seq(from=1, to=nrow(outdata))]
outdata$id <- row.names(outdata)
outdata<- paste0("codefiles_interview_attendance/", vvar, ".csv")
outfile write.csv(outdata, file=outfile, row.names = FALSE)
}rm(outdata, i, outfile, vvar)
This step uses the csv files created in the previous step and replaces the words with a number (except for the variable that we would like to predict) in order to prepare the data for machine learning.
<- c(2:7, 9:21, 23)
colstomap library(plyr)
for(i in 1:length(colstomap)){
<- colstomap[i]
j <- paste0("codefiles_interview_attendance/", colnames(interview_attendance)[j], ".csv")
vfilename <- read.csv(vfilename, stringsAsFactors=FALSE)
dfcodes <- as.vector(dfcodes[,1])
vfrom <- as.vector(dfcodes[,4])
vto <- mapvalues(interview_attendance[,j], from=vfrom, to=vto)
interview_attendance[,j] <- as.integer(interview_attendance[,j])
interview_attendance[,j]
}rm(colstomap, i, j, vfilename, vfrom, vto, dfcodes)
Here we pick the data that are needed for the predictor.
<- interview_attendance %>% dplyr::select(-date_of_interview, -name)
interview_attendanceml <- interview_attendanceml %>% dplyr::select(client_name:expected_attendance,
interview_attendanceml
observed_attendance)head(interview_attendanceml, 5)
Here we start the real machine learning part. The training dataset is created containing 75% of the observations [interview_attendanceml_train] and the remaining observations are assigned to the test dataset [interview_attendanceml_test].
library(caret) # Calls the caret library
set.seed(144) # Sets a seed for reproducability
<- createDataPartition(interview_attendanceml$observed_attendance, p=0.75, list=FALSE) #Creates an index vector of the length of all the observations
index
<- interview_attendanceml[index,] # Creates a subset of 75% of data for training dataset
interview_attendanceml_train <- interview_attendanceml[-index,] # Creates a subset of the remaining 25% of data for test dataset
interview_attendanceml_test
rm(index,interview_attendanceml)
25.3 Choosing a model
We didn’t want to tune anything and just let the code figure it out. Since the output is 1 for a no show and 2 for a show, we decided to use the gbm method (generalized boosting machine) from caret for creating the algorithm.
25.4 Training
We use the train function of the caret library to train the training data set determined in the previous step.
<- train(interview_attendanceml_train[,1:19], interview_attendanceml_train[,20], method='gbm')
myml_model
summary(myml_model)
<- predict(object = myml_model, interview_attendanceml_test,
predictions type = 'raw')
head(predictions)
print(postResample(pred=predictions, obs=as.factor(interview_attendanceml_test[,20])))