This repository contains scripts and code for four tasks:
- Download 50 public profile PDFs of your connections (randomly) from LinkedIn.
- Extract text from the above PDFs and store them in a CSV.
- For every profile data (text), find out the most frequent words and essential words used. It shouldn’t contain stop words (like is, the, an, etc.).
- Create two web APIs using flask/Django or another framework of your choice.
- The first web API should take a PDF file as input and return the text in it in JSON format.
- The second web API should take text data as input and return the most frequent words and important words (as mentioned in 3) in JSON format.
PDF Scraper, obviously, scrapes PDFs of one's LinkedIn connections. Certain variables like username and password needs to be provided by the user. It uses selenium ( yes, it is a testing suite but here is being hacked a web scraper ) and Google Chrome.
Note: The Script might be a bit slow because it pauses for 30 seconds before making another request in order to keep LinkedIn from blocking requests and having undesirable effects over your account. If you want to speed it up a bit change time.sleep(10) on line 61.(I don't recommend it.)
This script a function and uses pandas and pdftotext library from PyPI. First it takes Directory Path where LinkedIn resumes are stored in a format like Profile.pdf, Profile (1).pdf ...upto Profile (N).pdf. In our case, N = 50.
This script contains functions which remove stop words( Words used too often in language) from corpus, it happens to be text from LinkedIn profile, in our case. RemoveStopWords extract words from csv contained in a give path, removes symbols, then removes stopwords which are contained in NLTK. Then we use TF-IDF in order to find the most significant words from the corpus. This file also contains the function which will be used in API to find most significant words.
This is the api which is required in task 4 using flask framework. The api has 4 routes:
- /uploadPDF
Here is form to upload file which then makes api request to the "/textExtractor" which the perfoms text extraction from pdf with functions from "". - /textExtractor
This is the route where the actual extraction of text happens. - /uploadWords
This route renders the template containing the form to recive text input to extract the most significant and important words. On submit, it makes request to "/significantWords". - /significantWords
It is the api route which take input 'textdata' as POST request and return the most important words along with there count.
Templates contain the html templates for routes presented to the user.
Just make sure the Google Chrome is already installed on your system and we're good to go.
After that you will need to download chrome driver from Google Storage API and store it somewhere, preferably in the same folder as of project.
I have used version 2.24. You want to choose which version to download.
You want to run the scripts in the order in which they are listed here except the API.
If you have python3 installed that is well and good but if not, then you will need to install it.
Then install selenium but executing
pip install selenium
After selenium is installed you will need to change a few variables in
DOWNLOAD_DIR = Path where the pdfs should be downloaded( for me it was AI_Champ/PDFs )
CHROME_EXECUTABLE = Path to the chrome driver
Email = Your LinkedIn Email
Password = Your LinkedIn Password
Once those are set up you can execute
First of all to install pdftotext execute :
pip install pdftotext
If it shows any error then try again after installing it's dependencies which are listed in PY PI
Changes the following variables in pdfToCSV
PDF_DIRECTORY = Path to where the pdfs are located or DOWNLOAD_DIR in PDFScraper
CSV_FILE = Path to where the CSV should be stored
NUMBER_OF_PDFs = Total number of pdfs in PDF_DIRECTORY
Then execute the following and everything should work like a charm.
If it shows up an error then you might want to check the variables once again.
It requires NLTK and scikit-learn. So, execute :
pip install nltk
pip install sklearn
Variables you might want to change :
DIRECTORY = Path to the csv file generated via previous script or value CSV_FILE in pdfToCSV
Then run the script by executing
Install flask via pip install and then run
cd API
The API should be up and running by now you can try it by opening or in your browser.
- Selenium
- Python 3
- Google Chrome
- pdftotext
- Pandas
- Flask
- Scikit Learn