job-ad-compare

Compare job ads from several sites, to obtain an optimal job opportunity.

This project aims to use web scraping technology of Python, namely BeautifulSoup and Selenium, to extract job postings of a certain role (e.g. software developer) from various sites. The descriptions are then fed to a Scikit-Learn model to calculate TF-IDF scores for each vocabulary, to extract the most important keywords related to that role, so as to determine the most important skills required.

Project is still in development, current status is that the descriptions could be extracted from various sites, and keywords can be extracted from them.

From testing results, it seems that some irrevelant words often show up as keywords, thus the stopword list needs to be modified.

What's new - 7/4/2021

Standardise to use Selenium web driver to scrape all information.
New feature to allow writing jobs to Excel file.
Enable separation of job description from skills/qualifications (new section).

How to use

To generalise the extraction algorithm, the HTML tags containing the relevant information are stored as config in the job_ad_sites.csv file. A sample of those tags are uploaded. To obtain the config (i.e. HTML tags needed), the job posting website HTML page needs to be analysed with Chrome web developer tool first.

The script GUI.py could be directly run. Then, insert job category and job location as filters. You may use the default config paths.

For my testing and thought process, please refer to the Jupyter notebooks in Scraping_Test folder.

Future developments

(DONE) Generalise the program to be able to take in any query keywords (i.e. variable in the ?q={} in the URL) for job role and location.
(DONE) Scraping of multiple pages of results.
Addition of certain fake keywords to the stopword list (need further testing to identify).
(DONE) Removal of company name for keyword result, replaced by the next word.
(IN PROGRESS) Addition of a UI for inputting new scraping sites and HTML tags, also to trigger whole scraping process.
Consideration of using other texts as a control sample to calculate the IDF score.

Credits

Credits to the fantastic tutorial in BeautifulSoup, Selenium and TF-IDF calculations

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Config		Config
Scraping_Test		Scraping_Test
src		src
GUI.py		GUI.py
README.md		README.md
Scraping_Controller.py		Scraping_Controller.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

job-ad-compare

What's new - 7/4/2021

How to use

Future developments

Credits

About

Releases

Packages

Languages

adrielyeung/job-ad-compare

Folders and files

Latest commit

History

Repository files navigation

job-ad-compare

What's new - 7/4/2021

How to use

Future developments

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages