Chapter 27 Webscraping LinkedIn
Did you know that the internet is the greatest source of publicly available data. One of the key skills to being able to obtain data from the web is “web-scraping,” where you use a piece of software to run through a website and collect information.
This technique can be used for collecting data from databases or to collect data that is scattered across a website.
In the following scenario, you are being asked to extract from LinkedIn all the education qualifications and later compare them with those from your own company. Your senior management is seriously worried that the employees working for your main competitor are better skilled.
You are tasked to extract all the forenames, the surnames, the job titles and the education qualifications, including the “from to”" dates and the education organisations.
In our example, we have chosen to look at Innocent, the famous drink manufacturer. Innocent ranks third among “The Sunday Times 100 Best Companies to Work For.”
From the Sunday Times you know that Innocent has 289 employees and that brand managers are earning salaries of around £50,000 plus bonuses. Furthermore the male/female ratio is 40:60, the average age is 39, the amount of voluntary leavers 15% and the earning are over £35,000 for 70% of its staff.
Innocent makes a convincing offer to ambitious types. Nicki Garland is the firm’s team leader for UK and Ireland finance and enjoys the best of both worlds she gets at Innocent. “I wanted to work for a brand that had purpose and meaning,” she says. “When you walk in and see a wall that says the company’s mission is to ‘help people to live well and die old,’ you realise this is definitely a very different business.”
This will be your starting page of your web scraping. After a few checks, you realise that according to LinkedIn the company has 543 employees for a total of 55 pages of records. From looking at the urls you also realise that %5B%2224177%22%5D is the code used to identify Innocent by LinkedIn.
After lengthy googling on the internet, you find an excellent tutorial on DataCamp for webscraping which does mostly what you plan to do, but unfortunately it is scraping Trustpilot, which on the top of things is a website that looks completely different to LinkedIn. You will need to adapt the code given in that tutorial, open each indivual LinkedIn profile for all 55 pages and scrape the required information. The final end product has to be a table which respects “Hadley Wickham’s tidy data rule”
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
library(tidyverse) # General purpose data wrangling
library(rvest) # Parsing of html/xml files
library(rebus) # Verbose regular expressions
<- function(html){
get_last_page
<- html %>%
pages_data html_nodes('.pagination-page') %>% # The '.' indicates the class
html_text() # Extracts the raw text as a list
length(pages_data)-1)] %>% # The second to last of the buttons is the one
pages_data[(unname() %>% # Take the raw string
as.numeric() # Convert to number
}
<- function(html){
get_reviews %>%
html html_nodes('.review-info__body__text') %>% # The relevant tag
html_text() %>%
str_trim() %>% # Trims additional white space
unlist() # Converts the list into a vector
}
<- function(html){
get_reviewer_names %>%
html html_nodes('.consumer-info__details__name') %>%
html_text() %>%
str_trim() %>%
unlist()
}
<- function(html){
get_review_dates
<- html %>%
status html_nodes('time') %>%
html_attrs() %>% # The status information is this time a tag attribute
map(2) %>% # Extracts the second element
unlist()
<- html %>%
dates html_nodes('time') %>%
html_attrs() %>%
map(1) %>%
ymd_hms() %>% # Using a lubridate function
# to parse the string into a datetime object
unlist()
<- tibble(status = status, dates = dates) %>% # You combine the status and the date
return_dates # information to filter one via the other
filter(status == 'ndate') %>% # Only these are actual reviews
pull(dates) %>% # Selects and converts to vector
as.POSIXct(origin = '1970-01-01 00:00:00') # Converts datetimes to POSIX objects
# The lengths still occasionally do not lign up. You then arbitrarily crop the dates to fit
# This can cause data imperfections, however reviews on one page are generally close in time)
<- length(get_reviews(html))
length_reviews
<- if (length(return_dates)> length_reviews){
return_reviews 1:length_reviews]
return_dates[else{
}
return_dates
}
return_reviews
}
<- function(html){
get_star_rating
# The pattern we look for
= 'star-rating-'%R% capture(DIGIT)
pattern
<- html %>%
vector html_nodes('.star-rating') %>%
html_attr('class') %>%
str_match(pattern) %>%
map(1) %>%
as.numeric()
<- vector[!is.na(vector)]
vector <- vector[3:length(vector)]
vector
vector
}
<- function(html, company_name){
get_data_table
# Extract the Basic information from the HTML
<- get_reviews(html)
reviews <- get_reviewer_names(html)
reviewer_names <- get_review_dates(html)
dates <- get_star_rating(html)
ratings
# Minimum length
<- min(length(reviews), length(reviewer_names), length(dates), length(ratings))
min_length
# Combine into a tibble
<- tibble(reviewer = reviewer_names[1:min_length],
combined_data date = dates[1:min_length],
rating = ratings[1:min_length],
review = reviews[1:min_length])
# Tag the individual data with the company name
%>%
combined_data mutate(company = company_name) %>%
select(company, reviewer, date, rating, review)
}
<- function(url, company_name){
get_data_from_url <- read_html(url)
html get_data_table(html, company_name)
}
<- function(url, company_name){
scrape_write_table
# Read first page
<- read_html(url)
first_page
# Extract the number of pages that have to be queried
<- get_last_page(first_page)
latest_page_number
# Generate the target URLs
<- str_c(url, '?page=', 1:latest_page_number)
list_of_pages
# Apply the extraction and bind the individual results back into one table,
# which is then written as a tsv file into the working directory
%>%
list_of_pages map(get_data_from_url, company_name) %>% # Apply to all URLs
bind_rows() %>% # Combines the tibbles into one tibble
write_tsv(str_c(company_name,'.tsv')) # Writes a tab separated file
}
# url <-'http://www.trustpilot.com/review/www.amazon.com'
# scrape_write_table(url, 'amazon')
# amazon_table <- read_tsv('amazon.tsv')