This is a small web-scraping project created for scraping using a centralised controller and uses multiple agents for efficiency.
Scraping is done for the LinkedIn profiles for now, if needed there will be multiple branches on this same repo.
- Python 3.6 or higher
- Libraries:
requests
,beautifulsoup4
,selenium
,psycopg2
-
Install the required Python libraries with pip:
pip install requests beautifulsoup4 selenium psycopg2
The data requirements are:
- Name
- Current Position
- Skills
- LinkedIn URL
Run:
python ./parent.py
A data-set data-mini.csv
is formed in ./data/
directory along with that data.csv
is also initiated. data-mini.csv
contains basic details like name
, role
and for now, most importantly, profile_URL
. Now you may analyse the size of data-mini.csv
and according to that engage multiple minion.py
to download and process data faster simultaneously.
Run:
python ./minion.py
>>> Enter the starting index: <start_index as per your choice>
>>> Enter the ending index: <end_index as per your choice>
The data of all the profiles will simultaneously be appended to the data.csv
. And further processed to postgreSQL.
-
Initially, I considered using the LinkedIn APIs available for public or developer use. However, they were either deprecated or not relevant (mostly for scraping jobs, not profiles).
-
The most relevant and straightforward method I found was to use Google search with some filters, such as:
- The site filter "site:" as
"site:linkedin.com/in/" OR "site:linkedin.com/pub/"
- And since searching for profiles the intitle filter as
-intitle:"profiles
- The job profile as simply
"Software Developer"
- Email by adding
"@gmail.com" OR "@yahoo.com"
(not done here) - Location can also be added as a field with
Software Developer
(not done for this case).
- The site filter "site:" as
- Final
url
used washttps://www.google.com/search?q=+"Software+Developers" -intitle:"profiles" -inurl:"dir/+"+site:linkedin.com/in/+OR+site:linkedin.com/pub/
-
By this method, for each person, majority of the data as
name
,profile URL
,position
is obtained. We just need to extract the data from parsing the request by passing the URL (in which we applied filters) -
For the fields as
Current Position
,Past Experiences
,Education
,About
and other possible data is a bit tough. But I thought to get it by now parsing the individualprofile URL
, since we get it in previous step. -
I've used
Selenium
to prevent the cached count of requested pages as it opens a completely new Chromium window each time.
Linked prevents the multiple requests for profiles without logging in. It also hides some of the data which can be only viewed through login. Also, there are popups whose classes are needed to be found to access the page.
-
After this, most of the processing is to be done by parsing the data recieved, finding the spans containing data using
BeautifulSoup
library. -
In the end we get all the required data, namely,
name
,profile URL
,Current Position
, etc. I've also included the fields ofAbout
,Past Experiences
,Education
, which can also be potentially useful for Skills.
In the ./xtras/scraper.py
or ./parent.py
script, there are a few parameters you can change:
url
: This is the URL of the Google search results page. You can change the search query to scrape different LinkedIn profiles.
number_of_swipes
: This is the number of times the script will scroll down on the Google search results page to load more results. You can increase this number to scrape more profiles.
The comments in the code explain the process during the extraction. postgreSQL (.sql) file will be included in the data and the images are self explanatory.
The
Past Experience
field is not in the currentdata.csv
but is implemented later into the./xtras/scrapper.py
and./minion.py
. The re-run for the entire data was time consuming and also the IP-address was restricted to request the data. The re-run of./parent.py
and./minion.py
will populate that field too.