Chapter 25 Interview attendance problem

The dataset consists of details of nore than 1200 candidates and the interviews they have attended during the course of the period 2014-2016.

The following are the variables columns:

Date of Interview refers to the day the candidates were scheduled for the interview. The formats vary.
Client that gave the recruitment vendor the requisite mandate
Industry refers to the sector the client belongs to (Candidates can job hunt in vrious industries.)
Location refers to the current location of the candidate.
Position to be closed: Niche refers to rare skill sets, while routine refers to more common skill sets.
Nature of Skillset refers to the skill the client has and specifies the same.
Interview Type: There are three interview types:
1. Walk in drives - These are unscheduled. Candidates are either contacted or they come to the interview on their own volition,
2. Scheduled - Here the candidates profiles are screened by the client and subsequent to this, the vendor fixes an appointment between the client and the candidate.
3. The third one is a scheduled walkin. Here the number of candidates is larger and the candidates are informed beforehand of a tentative date to ascertain their availability. The profiles are screened as in a scheduled interview. In a sense it bears features of both a walk-in and a scheduled interview.
Name( Cand ID) This is a substitute to keep the candidates identity a secret
Gender
Candidate Current Location
Candidate Job Location
Interview Venue
Candidate Native location
Have you obtained the necessary permission to start at the required time?
I hope there will be no unscheduled meetings.
Can I call you three hours before the interview and follow up on your attendance for the interview?
Can I have an alternative telephone number? I assure you that I will not trouble you too much.
Have you taken a printout of your updated resume? Have you read the JD and understood it?
Are you clear with the venue details and the landmark?
Has the call letter been shared
Expected Attendance: Whether the candidate was expected to attend the interview. Here the alternatives are yes, no or uncertain.
Observed Attendance: Whether the candidate attended the interview. This is binary and will form our dependent variable to be predicted.
Marital Status: Single or married.

Source: https://www.kaggle.com/hugohk/learning-ml-with-caret

25.1 Data reading

library(tidyverse)

interview_attendance <- read_csv("https://hranalyticslive.netlify.com/data/interview.csv")

head(interview_attendance, 5)

# A tibble: 5 x 28
  `Date of Interv~ `Client name` Industry Location `Position to be~
  <chr>            <chr>         <chr>    <chr>    <chr>           
1 13.02.2015       Hospira       Pharmac~ Chennai  Production- Ste~
2 13.02.2015       Hospira       Pharmac~ Chennai  Production- Ste~
3 13.02.2015       Hospira       Pharmac~ Chennai  Production- Ste~
4 13.02.2015       Hospira       Pharmac~ Chennai  Production- Ste~
5 13.02.2015       Hospira       Pharmac~ Chennai  Production- Ste~
# ... with 23 more variables: `Nature of Skillset` <chr>, `Interview
#   Type` <chr>, `Name(Cand ID)` <chr>, Gender <chr>, `Candidate Current
#   Location` <chr>, `Candidate Job Location` <chr>, `Interview Venue` <chr>,
#   `Candidate Native location` <chr>, `Have you obtained the necessary
#   permission to start at the required time` <chr>, `Hope there will be no
#   unscheduled meetings` <chr>, `Can I Call you three hours before the
#   interview and follow up on your attendance for the interview` <chr>, `Can I
#   have an alternative number/ desk number. I assure you that I will not
#   trouble you too much` <chr>, `Have you taken a printout of your updated
#   resume. Have you read the JD and understood the same` <chr>, `Are you clear
#   with the venue details and the landmark.` <chr>, `Has the call letter been
#   shared` <chr>, `Expected Attendance` <chr>, `Observed Attendance` <chr>,
#   `Marital Status` <chr>, X24 <lgl>, X25 <lgl>, X26 <lgl>, X27 <lgl>,
#   X28 <lgl>

25.2 Data cleaning

interview_attendance <- interview_attendance[-1234,] #remove the last row that contains only missing values

interview_attendance$X24   <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X25 <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X26 <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X27 <- NULL #get rid of unnecesary columns on the right side
interview_attendance$X28 <- NULL #get rid of unnecesary columns on the right side

# Here we create a vector with column titles
mycolnames <-c('date_of_interview','client_name','industry','location','position',
               'skillset','interview_type','name', 'gender','current_location',
               'cjob_location','interview_venue','cnative_location',
               'permission_obtained','unscheduled_meetings','call_three_hours_before',
               'alternative_number','printout_resume_jd','clear_with_venue',
               'letter_been_shared','expected_attendance','observed_attendance',
               'marital_status')  

#Here we assign the previously defined column titles
colnames(interview_attendance) <- mycolnames 

rm(mycolnames) # Here we remove the mycolnames vector, as it is not required anymore

interview_attendance<- mutate_all(interview_attendance, funs(tolower)) # Sets all words to lower case

# Cancels all empty spaces in the observed attendance column
interview_attendance$observed_attendance <- gsub(" ", "", interview_attendance$observed_attendance)

# Cancels all empty spaces in the location column
interview_attendance$location <- gsub(" ", "", interview_attendance$location) 

# Cancels all empty spaces in the interview type column
interview_attendance$interview_type <- gsub(" ", "", interview_attendance$interview_type)

# Corrects a typo in the interview type column
interview_attendance$interview_type <- gsub("sceduledwalkin", "scheduledwalkin", interview_attendance$interview_type)

# Cancels all empty spaces in the candidate current location column
interview_attendance$current_location <- gsub(" ", "", interview_attendance$current_location)


#Converts values from character to numbers for Yes/no answers, just to keep things simple.
  colstoyesno <- c(14:22) # Here we define which column numbers to look at.
for (i in 1:length(colstoyesno)){   # Here we tell R to examine all variables in the previously defined columns
  j <- colstoyesno[i]
  interview_attendance[,j][interview_attendance[,j] !="yes"] <- "no"   
  interview_attendance[,j][is.na(interview_attendance[,j]) == TRUE] <- "no"
  #With the previous two lines all values different to yes, become a no, i.e. "uncertain" and "NA" are set to a "no".
}
rm(colstoyesno, i, j) #Here we remove the three just created vectors as a claen up.

In the following step we figure out what is the content of each relevant column and identify its unique values. Once we have done that a number is assigned for a later step.

dir.create("codefiles_interview_attendance", showWarnings = FALSE)
#detach("package:plyr", unload = TRUE)
for(i in 1:length(colnames(interview_attendance))){
  vvar <- colnames(interview_attendance)[i]
  outdata <- interview_attendance %>% dplyr::group_by(.dots = vvar) %>% dplyr::count(.dots = vvar)
  outdata$idt <- LETTERS[seq(from=1, to=nrow(outdata))]
  outdata$id  <- row.names(outdata)
  outfile <- paste0("codefiles_interview_attendance/", vvar, ".csv")
  write.csv(outdata, file=outfile, row.names = FALSE)
}
rm(outdata, i, outfile, vvar)

This step uses the csv files created in the previous step and replaces the words with a number (except for the variable that we would like to predict) in order to prepare the data for machine learning.

colstomap <- c(2:7, 9:21, 23)
library(plyr)
for(i in 1:length(colstomap)){
    j <- colstomap[i]
    vfilename <- paste0("codefiles_interview_attendance/", colnames(interview_attendance)[j], ".csv")
    dfcodes <- read.csv(vfilename, stringsAsFactors=FALSE)
    vfrom <- as.vector(dfcodes[,1])
    vto <- as.vector(dfcodes[,4])
    interview_attendance[,j] <- mapvalues(interview_attendance[,j], from=vfrom, to=vto)
    interview_attendance[,j] <- as.integer(interview_attendance[,j])
}
rm(colstomap, i, j, vfilename, vfrom, vto, dfcodes)

Here we pick the data that are needed for the predictor.

interview_attendanceml <- interview_attendance %>% dplyr::select(-date_of_interview, -name)
interview_attendanceml <- interview_attendanceml %>% dplyr::select(client_name:expected_attendance, 
                                observed_attendance)
head(interview_attendanceml, 5)

Here we start the real machine learning part. The training dataset is created containing 75% of the observations [interview_attendanceml_train] and the remaining observations are assigned to the test dataset [interview_attendanceml_test].

library(caret) # Calls the caret library

set.seed(144) # Sets a seed for reproducability

index <- createDataPartition(interview_attendanceml$observed_attendance, p=0.75, list=FALSE) #Creates an index vector of the length of all the observations

interview_attendanceml_train <- interview_attendanceml[index,] # Creates a subset of 75% of data for training dataset
interview_attendanceml_test  <- interview_attendanceml[-index,] # Creates a subset of the remaining 25% of data for test dataset

rm(index,interview_attendanceml)

25.3 Choosing a model

We didn’t want to tune anything and just let the code figure it out. Since the output is 1 for a no show and 2 for a show, we decided to use the gbm method (generalized boosting machine) from caret for creating the algorithm.

25.4 Training

We use the train function of the caret library to train the training data set determined in the previous step.

myml_model <- train(interview_attendanceml_train[,1:19], interview_attendanceml_train[,20], method='gbm')

summary(myml_model)

predictions <- predict(object = myml_model, interview_attendanceml_test,
                       type = 'raw')

head(predictions)

print(postResample(pred=predictions, obs=as.factor(interview_attendanceml_test[,20])))