During my summer internship, my supervisor was researching how social networks meditated the effect crisis (i.e., economic crisis or natural disasters) had on employment-related im/migrations in Brazil. For example, would a person with strong social ties be more or less likely to im/migrate than someone with weaker ties following a crisis? My supervisor had been granted access to a depth of confidential information from the Brazilian federal government: full name, social security number, location(s), employer(s), occupation(s), wage(s), date(s) of hiring and separation, birth date, gender, ethnicity, education level, and etc. However, to capture the full effect of a person’s social network, he also needed data on an individual’s educational background: the university they attended, their major, graduation date and advisors. As a result, he tasked me with web scraping Escavador—the Brazilian equivalent of LinkedIn—using R. The main difference was that the profiles on Escavador were generated by Escavador by scraping publicly available data. Furthermore, Escavador not only generated profiles of individuals, but also large organizations such as colleges and businesses. The profiles of people included information on their academic history as well as their employment history. The hope was to, first, scrape and match the profiles on Escavador to the individuals on the federal dataset using their employment history to assess the accuracy of Escavador’s profiles. Once its accuracy is confirmed, my supervisor would use my script to scrape more Escavador profiles and thereby having both an individual's employment and academic history.
Before this project, I’ve had some experience with statistical coding using STATA, but no experience with R nor have I done anything remotely close to web scraping. By no means am I an expert now, but I believe my coding capabilities have improved exponentially over the past years and I credit this experience for igniting my curiosity and love for coding.
I was given a few links to study over a span of a week to a week and a half.
-
Basics of R
-
Web Scraping Tutorial
During this process, I meticulously took notes to help understand and remember the copious amount of information.
*Please refer to documents part1.pdf and part2.pdf
To test my web scraping skills, my supervisor had me gather and organize data from all the pages from a link following his template shown below.
-
Link
-
Template
Data de publicação | Data de apresentação | Título | Autor | Orientador | Coorientador | Course |
---|---|---|---|---|---|---|
5-Ago-2011 | 11-Jun-2011 | Acidente ofídico com serpentes brasileiras do gênero Bothrops | Alex | Kim | - | Administracao |
3-Ago-2011 | 11-Jun-2011 | Alterações na qualidade de vida de portadores de Diabetes mellitus tipo I | Hannah | Fred | - | Administracao Civil |
*Please refer to test.R for step 2
My supervisor provided me with a CSV file of the Escavador links for a couple universities. My first task was to go through all the profiles from these two universities and compile their information into a CSV file like below. We started off with only two universities since we were still in testing phase. I should also mention that my secondary task was to transfer my knowledge of web scraping to my supervisor, who mostly delt with statistical coding in STATA. So throughout the script, I've made a lot comments for clarity.
- Template
Page number | Name | Study/Work | link |
---|---|---|---|
1 | Greg | Estudou em 2017 | ____.com |
2 | Ester | Estudou em 2016 | ____.com |
*Please refer to outer_loop.R for step 3
I went into each profile and gathered their employment history as well as their academic history.
Name | link | school_1 | major_1 | dates_school_1 | advisor_s1 | school_2 | major_2 | dates_school_2 | advisor_s2 | school_3 | major_3 | dates_school_3 | advisor_s3 | school_4 | major_4 | dates_school_4 | advisor_s4 | work_1 | dates_work_1 | work_2 | dates_work_2 | work_3 | dates_work_3 | work_4 | dates_work_4 | work_5 | dates_work_5 | work_6 | dates_work_6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rachel | ____.com | UA | Art | 2013 - 2018 | Alex | ||||||||||||||||||||||||
Steve | ____.com | UB | Business | 2015 - 2016 | Brittney | UD | Dentistry | 2009 - 2013 | Daniella | A INC | 2011 - 2012 | C INC | 2014 - Atual | ||||||||||||||||
Lily | ____.com | UC | Chemistry | 2001 - 2005 | Charles | B INC | 2011 - Atual | D INC | 2007 - 2011 | E INC | 2006 - 2007 | F INC | 2008 - 2009 | G INC | 2008 - 2009 | H INC | 2009 - 2010 |
*Please refer to inner_loop.R to step 4
My supervisor, measured the reliability of the scraped information by comparing it to his federal dataset. According to him, it was accurate, and the script was ready for implementation.
Admittedly, for a novice, the whole process was a challenge: figuring out HTML nodes, using regular expressions and etc. However, most of the difficulties revolved around implementing a rotating proxy. My supervisor thought it'd be best to use a rotating proxy in case they banned our IP address. I knew of proxies but had barely any knowledge of their workings. Consequently, it took a bit of trial and error. When I had finally successfully implemented a rotating proxy, another issue emerged. Some of the proxies that I was using were either defunct or unstable. After many hours of research, I stumbled upon tryCatch(), which I used to switch out unreliable proxies.
The last obstacle had to do with my ignorance of functions and apply(). As a beginner, the syntax of apply() functions were a bit abstract and I simply stuck with for-loops not understanding the significance of memory overload. As the script continued to run, the amount of data it compiled became larger resulting in R to freeze every couple of hours. Ironically, it only resumed once a user clicked on the "STOP" button. Through further testing, I also found that pressing the "STOP" button preemptively did not stop the script. My supervisor and I both had no idea what was causing this and with my internship coming close to an end, I had to quickly jury-rigg a solution. Luckily, I was taking a Python course alongside my internship. I created a Python script to get the coordinates of the "STOP" button and set a command to press the button at specified intervals set by the user, ideally around two hours. I was quite proud of my work-around until I realized later down the road that the issue was probably the for-loops.
I should add that my supervisor was a paying customer of Escavador and that we looked over their robot.txt file before embarking on this project. Furthermore, we added Sys.sleep() functions after every iteration to pause the script for a specified amount of time. Lastly, because Escavador, themselves, gathered information through web scraping and because we weren't using the information for commercial purposes, we believed our actions were ethically justifiable.
*Please refer to "button_clicker.py" for the button clicking Python script