Skip to content

parkshub/r-web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping Project

Background

During my summer internship, my supervisor was researching how social networks meditated the effect crisis (i.e., economic crisis or natural disasters) had on employment-related im/migrations in Brazil. For example, would a person with strong social ties be more or less likely to im/migrate than someone with weaker ties following a crisis? My supervisor had been granted access to a depth of confidential information from the Brazilian federal government: full name, social security number, location(s), employer(s), occupation(s), wage(s), date(s) of hiring and separation, birth date, gender, ethnicity, education level, and etc. However, to capture the full effect of a person’s social network, he also needed data on an individual’s educational background: the university they attended, their major, graduation date and advisors. As a result, he tasked me with web scraping Escavador—the Brazilian equivalent of LinkedIn—using R. The main difference was that the profiles on Escavador were generated by Escavador by scraping publicly available data. Furthermore, Escavador not only generated profiles of individuals, but also large organizations such as colleges and businesses. The profiles of people included information on their academic history as well as their employment history. The hope was to, first, scrape and match the profiles on Escavador to the individuals on the federal dataset using their employment history to assess the accuracy of Escavador’s profiles. Once its accuracy is confirmed, my supervisor would use my script to scrape more Escavador profiles and thereby having both an individual's employment and academic history.

Before this project, I’ve had some experience with statistical coding using STATA, but no experience with R nor have I done anything remotely close to web scraping. By no means am I an expert now, but I believe my coding capabilities have improved exponentially over the past years and I credit this experience for igniting my curiosity and love for coding.

Step 1: Learning

I was given a few links to study over a span of a week to a week and a half.

During this process, I meticulously took notes to help understand and remember the copious amount of information.

*Please refer to documents part1.pdf and part2.pdf

Step 2: Testing

To test my web scraping skills, my supervisor had me gather and organize data from all the pages from a link following his template shown below.

Data de publicação Data de apresentação Título Autor Orientador Coorientador Course
5-Ago-2011 11-Jun-2011 Acidente ofídico com serpentes brasileiras do gênero Bothrops Alex Kim - Administracao
3-Ago-2011 11-Jun-2011 Alterações na qualidade de vida de portadores de Diabetes mellitus tipo I Hannah Fred - Administracao Civil

*Please refer to test.R for step 2

Step 3: Outer Loop

My supervisor provided me with a CSV file of the Escavador links for a couple universities. My first task was to go through all the profiles from these two universities and compile their information into a CSV file like below. We started off with only two universities since we were still in testing phase. I should also mention that my secondary task was to transfer my knowledge of web scraping to my supervisor, who mostly delt with statistical coding in STATA. So throughout the script, I've made a lot comments for clarity.

  • Template
Page number Name Study/Work link
1 Greg Estudou em 2017 ____.com
2 Ester Estudou em 2016 ____.com

*Please refer to outer_loop.R for step 3

Step 4: Inner Loop

I went into each profile and gathered their employment history as well as their academic history.

Name link school_1 major_1 dates_school_1 advisor_s1 school_2 major_2 dates_school_2 advisor_s2 school_3 major_3 dates_school_3 advisor_s3 school_4 major_4 dates_school_4 advisor_s4 work_1 dates_work_1 work_2 dates_work_2 work_3 dates_work_3 work_4 dates_work_4 work_5 dates_work_5 work_6 dates_work_6
Rachel ____.com UA Art 2013 - 2018 Alex
Steve ____.com UB Business 2015 - 2016 Brittney UD Dentistry 2009 - 2013 Daniella A INC 2011 - 2012 C INC 2014 - Atual
Lily ____.com UC Chemistry 2001 - 2005 Charles B INC 2011 - Atual D INC 2007 - 2011 E INC 2006 - 2007 F INC 2008 - 2009 G INC 2008 - 2009 H INC 2009 - 2010

*Please refer to inner_loop.R to step 4

Step 5: Assessment

My supervisor, measured the reliability of the scraped information by comparing it to his federal dataset. According to him, it was accurate, and the script was ready for implementation.

Difficulties along the way

Admittedly, for a novice, the whole process was a challenge: figuring out HTML nodes, using regular expressions and etc. However, most of the difficulties revolved around implementing a rotating proxy. My supervisor thought it'd be best to use a rotating proxy in case they banned our IP address. I knew of proxies but had barely any knowledge of their workings. Consequently, it took a bit of trial and error. When I had finally successfully implemented a rotating proxy, another issue emerged. Some of the proxies that I was using were either defunct or unstable. After many hours of research, I stumbled upon tryCatch(), which I used to switch out unreliable proxies.

The last obstacle had to do with my ignorance of functions and apply(). As a beginner, the syntax of apply() functions were a bit abstract and I simply stuck with for-loops not understanding the significance of memory overload. As the script continued to run, the amount of data it compiled became larger resulting in R to freeze every couple of hours. Ironically, it only resumed once a user clicked on the "STOP" button. Through further testing, I also found that pressing the "STOP" button preemptively did not stop the script. My supervisor and I both had no idea what was causing this and with my internship coming close to an end, I had to quickly jury-rigg a solution. Luckily, I was taking a Python course alongside my internship. I created a Python script to get the coordinates of the "STOP" button and set a command to press the button at specified intervals set by the user, ideally around two hours. I was quite proud of my work-around until I realized later down the road that the issue was probably the for-loops.

I should add that my supervisor was a paying customer of Escavador and that we looked over their robot.txt file before embarking on this project. Furthermore, we added Sys.sleep() functions after every iteration to pause the script for a specified amount of time. Lastly, because Escavador, themselves, gathered information through web scraping and because we weren't using the information for commercial purposes, we believed our actions were ethically justifiable.

*Please refer to "button_clicker.py" for the button clicking Python script

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published