The Indeed Search Optimizer is a Python-based tool designed to automate job searches on Indeed. It scraps job postings based on specific queries and locations. This tool only displays new job postings since your last search by storing job keys each time pages are scraped. As a result, only new job keys are reported. However, note that sometimes an old posting may be reposted with a new job key, causing the same job to appear again. This does not happen frequently.
- Automated Job Scraping: Scrapes job listings from Indeed based on specified keywords, locations, and radius.
- Data Handling: Collects and stores job data, allowing for comparison between new and old job postings.
- Reporting: Generates reports highlighting only new job postings and key job characteristics.
- Scheduler: Manages periodic scraping tasks with configurable frequency and staggering.
- GUI Notifications: Displays notifications for newly found jobs.
- Redis Integration: Uses Redis for state management and scheduling of scraping tasks.
- Enhanced Logging: Provides detailed logs for debugging and monitoring.
- Automatic detection of execution environment (Docker container vs WSL)
- Dynamic Redis configuration based on environment:
- In Docker: Uses Redis Stack with JSON storage capabilities
- In WSL/Local: Uses standard Redis for state management
- Environment-specific connection handling:
- Docker: Connects to 'redis' host (internal Docker network)
- WSL/Local: Connects to 'localhost'
- Windows Subsystem for Linux 2 (WSL 2) Ubuntu with WSLg - check the official WSLg GitHub repository
- Python 3.7 or later
- Scrapfly API key
- Redis
-
Access WSL:
Open your WSL terminal. -
Clone the Repository:
Clone the repo and cd the root folder:
cd indeedOptimizer
-
Set Up the Virtual Environment:
Set up a Python virtual environment to manage dependencies:
python3 -m venv .venv
source .venv/bin/activate
-
Install Dependencies:
Install the required Python packages using pip:
pip install -r requirements.txt
-
Set Up Environment Variables:
Create a .env file in the root directory of the project and add your Scrapfly API key:
API_KEY=your_scrapfly_api_key
-
Ensure Redis is Running: Make sure Redis is installed and running on your system.
This assumes that docker is installed and running
-
Open terminal:
Open your WSL, Linux, or OSX terminal. -
Clone the Repository:
Clone the repo and cd the root folder:
cd indeedOptimizer
-
Set Up Environment Variables:
Create a .env file in the root directory of the project and add your Scrapfly API key:
API_KEY=your_scrapfly_api_key
-
Run docker compose:
If you make any changes to the code, like modifying the search queries, you need to rebuild the images:
docker compose up --build
-
Configure Search Parameters:
Edit themain.py
file to specify your desired job search queries, locations, and the staggering_minutes between each task scrap. Thestart_scheduler
function inmain.py
accepts these parameters. For example:tasks = [ ("php_developer", "tampa"), ("software_engineer", "miami") ] staggering_minutes = 5 start_scheduler(tasks, run_every_minutes, staggering_minutes)
-
Run the Scraper:
Execute the script in a Docker container or in your WSL terminal:python main.py
The program will now run continuously, performing scrapes based on the configured schedule and displaying notifications for new jobs found.
- Scraped job data will be stored in the
scrapped_data
directory as JSON files. - Reports on new job postings will also be generated in the same directory.
- Logs are stored in the
logs
directory.
- Redis is now used for state management and scheduling of scraping tasks.
- Ensure Redis is installed and running on your system.
- The program now includes a scheduler for managing periodic scraping tasks.
- Users can configure the frequency of scrapes and the staggering time between tasks in the
main.py
file.
- When new jobs are found, the program displays GUI notifications.
- Users can interact with these notifications to mark jobs as viewed.
- A new logging system provides detailed logs for debugging and monitoring.
- Logs are stored in the
logs
directory.
The JSON data you scrape from Indeed contains a wealth of information about each job posting. Notably, the organicApplyStartCount is a piece of information not available directly on the website. This data point can help you be more strategic when applying for jobs. Below is an explanation of some of the more notable keys you might find useful:
- adBlob: A string likely containing encrypted or encoded data for internal tracking or state management.
- adId: A unique identifier for the advertisement itself.
- advn: An advertiser number, which could be a unique identifier for the entity that posted the job.
- company: The name of the company that has posted the job.
- companyBrandingAttributes: Contains URLs to the company's logo and a header image which might be used in the job advertisement.
- companyOverviewLink: A URL to the company's overview page on Indeed.
- companyRating: The average rating of the company given by reviewers.
- companyReviewCount: The number of reviews that contributed to the company rating.
- createDate: The timestamp (likely in milliseconds since the Unix epoch) when the job was posted.
- displayTitle: The title of the job as displayed in the listing.
- estimatedSalary: An object containing the salary range for the job, including minimum and maximum values and the type of salary (e.g., yearly).
- formattedLocation: The location of the job, formatted for display.
- indeedApplyEnabled: Indicates whether the job supports applying directly through Indeed's platform.
- jobCardRequirementsModel: Details specific requirements for the job, such as necessary skills or experience.
- jobLocationCity, jobLocationState, jobLocationPostal: Specific location details of the job.
- link: A URL to the specific job posting on Indeed.
- organicApplyStartCount: The number of organic (non-sponsored) applications started for this job.
- remoteWorkModel: Details about the remote work options available for the job, such as hybrid work.
- snippet: A brief HTML snippet describing the job, often containing key points or requirements.
- title: The official title of the job posting.
The following keys are included in the final report generated by the scraper:
- company: The name of the company that has posted the job.
- companyRating: The average rating of the company given by reviewers.
- companyReviewCount: The number of reviews that contributed to the company rating.
- createDate: The timestamp (likely in milliseconds since the Unix epoch) when the job was posted.
- formattedCreateDate: Human-readable version of createDate.
- displayTitle: The title of the job as displayed in the listing.
- estimatedSalary: An object containing the salary range for the job, including minimum and maximum values and the type of salary (e.g., yearly).
- extractedSalary
- employerResponsive
- expired: A boolean value indicating whether the job posting has expired.
- formattedLocation: The location of the job, formatted for display.
- formattedRelativeTime: A human-readable string indicating how long ago the job was posted (e.g., "3 days ago").
- hiringMultipleCandidatesModel: Information about whether the employer is hiring multiple candidates for this position.
- jobCardRequirementsModel: Details specific requirements for the job, such as necessary skills or experience.
- jobDescription: Description of the job.
- jobkey: A unique identifier for the job posting.
- link: A URL to the specific job posting on Indeed.
- newJob: A boolean value indicating whether this is a newly posted job.
- organicApplyStartCount: The number of organic (non-sponsored) applications started for this job.
- pubDate: The timestamp (likely in milliseconds since the Unix epoch) when the job was published.
- formattedPubDate: Human-readable version of pubDate.
- remoteLocation: Information about the remote work location, if applicable.
- remoteWorkModel: Details about the remote work options available for the job, such as hybrid work.
- salarySnippet
- taxonomyAttributes: Classification attributes for the job, which might include industry, job type, or other categorizations.
- title: The official title of the job posting.
- urgentlyHiring: A boolean value indicating whether the employer is urgently trying to fill this position.