Web Scraping Project

Background

During my summer internship, my supervisor was researching how social networks meditated the effect crisis (i.e., economic crisis or natural disasters) had on employment-related im/migrations in Brazil. For example, would a person with strong social ties be more or less likely to im/migrate than someone with weaker ties following a crisis? My supervisor had been granted access to a depth of confidential information from the Brazilian federal government: full name, social security number, location(s), employer(s), occupation(s), wage(s), date(s) of hiring and separation, birth date, gender, ethnicity, education level, and etc. However, to capture the full effect of a person’s social network, he also needed data on an individual’s educational background: the university they attended, their major, graduation date and advisors. As a result, he tasked me with web scraping Escavador—the Brazilian equivalent of LinkedIn—using R. The main difference was that the profiles on Escavador were generated by Escavador by scraping publicly available data. Furthermore, Escavador not only generated profiles of individuals, but also large organizations such as colleges and businesses. The profiles of people included information on their academic history as well as their employment history. The hope was to, first, scrape and match the profiles on Escavador to the individuals on the federal dataset using their employment history to assess the accuracy of Escavador’s profiles. Once its accuracy is confirmed, my supervisor would use my script to scrape more Escavador profiles and thereby having both an individual's employment and academic history.

Before this project, I’ve had some experience with statistical coding using STATA, but no experience with R nor have I done anything remotely close to web scraping. By no means am I an expert now, but I believe my coding capabilities have improved exponentially over the past years and I credit this experience for igniting my curiosity and love for coding.

Step 1: Learning

I was given a few links to study over a span of a week to a week and a half.

Basics of R
- https://campus.datacamp.com/courses/free-introduction-to-r/
- https://www.guru99.com/r-tutorial.html
Web Scraping Tutorial
- https://gist.github.com/statsccpr/afeb31e3291c001bfea50a0acb3d59f3

During this process, I meticulously took notes to help understand and remember the copious amount of information.

*Please refer to documents part1.pdf and part2.pdf

Step 2: Testing

To test my web scraping skills, my supervisor had me gather and organize data from all the pages from a link following his template shown below.

Link
- http://bdm.unb.br/handle/10483/6?offset=0
Template

Data de publicação	Data de apresentação	Título	Autor	Orientador	Coorientador	Course
5-Ago-2011	11-Jun-2011	Acidente ofídico com serpentes brasileiras do gênero Bothrops	Alex	Kim	-	Administracao
3-Ago-2011	11-Jun-2011	Alterações na qualidade de vida de portadores de Diabetes mellitus tipo I	Hannah	Fred	-	Administracao Civil

*Please refer to test.R for step 2

Step 3: Outer Loop

My supervisor provided me with a CSV file of the Escavador links for a couple universities. My first task was to go through all the profiles from these two universities and compile their information into a CSV file like below. We started off with only two universities since we were still in testing phase. I should also mention that my secondary task was to transfer my knowledge of web scraping to my supervisor, who mostly delt with statistical coding in STATA. So throughout the script, I've made a lot comments for clarity.

Template

Page number	Name	Study/Work	link
1	Greg	Estudou em 2017	____.com
2	Ester	Estudou em 2016	____.com

*Please refer to outer_loop.R for step 3

Step 4: Inner Loop

I went into each profile and gathered their employment history as well as their academic history.

Name	link	school_1	major_1	dates_school_1	advisor_s1	school_2	major_2	dates_school_2	advisor_s2	work_1	dates_work_1	work_2	dates_work_2	work_3	dates_work_3	work_4	dates_work_4	work_5	dates_work_5	work_6	dates_work_6
Rachel	____.com	UA	Art	2013 - 2018	Alex
Steve	____.com	UB	Business	2015 - 2016	Brittney	UD	Dentistry	2009 - 2013	Daniella	A INC	2011 - 2012	C INC	2014 - Atual
Lily	____.com	UC	Chemistry	2001 - 2005	Charles					B INC	2011 - Atual	D INC	2007 - 2011	E INC	2006 - 2007	F INC	2008 - 2009	G INC	2008 - 2009	H INC	2009 - 2010

*Please refer to inner_loop.R to step 4

Step 5: Assessment

My supervisor, measured the reliability of the scraped information by comparing it to his federal dataset. According to him, it was accurate, and the script was ready for implementation.

Difficulties along the way

Admittedly, for a novice, the whole process was a challenge: figuring out HTML nodes, using regular expressions and etc. However, most of the difficulties revolved around implementing a rotating proxy. My supervisor thought it'd be best to use a rotating proxy in case they banned our IP address. I knew of proxies but had barely any knowledge of their workings. Consequently, it took a bit of trial and error. When I had finally successfully implemented a rotating proxy, another issue emerged. Some of the proxies that I was using were either defunct or unstable. After many hours of research, I stumbled upon tryCatch(), which I used to switch out unreliable proxies.

The last obstacle had to do with my ignorance of functions and apply(). As a beginner, the syntax of apply() functions were a bit abstract and I simply stuck with for-loops not understanding the significance of memory overload. As the script continued to run, the amount of data it compiled became larger resulting in R to freeze every couple of hours. Ironically, it only resumed once a user clicked on the "STOP" button. Through further testing, I also found that pressing the "STOP" button preemptively did not stop the script. My supervisor and I both had no idea what was causing this and with my internship coming close to an end, I had to quickly jury-rigg a solution. Luckily, I was taking a Python course alongside my internship. I created a Python script to get the coordinates of the "STOP" button and set a command to press the button at specified intervals set by the user, ideally around two hours. I was quite proud of my work-around until I realized later down the road that the issue was probably the for-loops.

I should add that my supervisor was a paying customer of Escavador and that we looked over their robot.txt file before embarking on this project. Furthermore, we added Sys.sleep() functions after every iteration to pause the script for a specified amount of time. Lastly, because Escavador, themselves, gathered information through web scraping and because we weren't using the information for commercial purposes, we believed our actions were ethically justifiable.

*Please refer to "button_clicker.py" for the button clicking Python script

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
button_clicker.py		button_clicker.py
inner_loop.R		inner_loop.R
outer_loop.R		outer_loop.R
part 1.pdf		part 1.pdf
part 2.pdf		part 2.pdf
test.R		test.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping Project

Background

Step 1: Learning

Step 2: Testing

Step 3: Outer Loop

Step 4: Inner Loop

Step 5: Assessment

Difficulties along the way

About

Releases

Packages

Languages

License

parkshub/r-web-scraping

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Project

Background

Step 1: Learning

Step 2: Testing

Step 3: Outer Loop

Step 4: Inner Loop

Step 5: Assessment

Difficulties along the way

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages