Skip to content

A tutorial for collecting job postings from Indeed using Python and Oxylabs Web Scraper API.

Notifications You must be signed in to change notification settings

oxylabs/how-to-scrape-indeed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

How to Scrape Indeed

Oxylabs promo code

Here's the process of extracting job postings from Indeed with the help of Oxylabs Web Scraper API (1-week free trial) and Python.

For the complete guide with in-depth explanations and visuals, check our blog post.

Project setup

Creating a virtual environment

python -m venv indeed_env #Windows
python3 -m venv indeed_env #Macand Linux

Activating the virtual environment

.\indeed_env\Scripts\Activate#Windows
source indeed_env/bin/activate #Macand Linux

Installing libraries

$ pip install requests

Overview of Web Scraper API

The following is an example that shows how Web Scraper API works.

# scraper_api_demo.py
import requests
payload = {
    "source": "universal",
    "url": "https://www.indeed.com"
}
response = requests.post(
    url="https://realtime.oxylabs.io/v1/queries",
    json=payload,
    auth=(username,password),
)
print(response.json())

Web Scraper API parameters

Parsing the page title and retrieving results in JSON

"title": {
    "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//title/text()"]
                }
            ]
        }
},

If you send this as parsing_instructions, the output would be the following JSON.

{ "title": "Job Search | Indeed", "parse_status_code": 12000 }

Note that the parse_status_code means a successful response.

The following code prints the title of the Indeed page.

# indeed_title.py

import requests
payload = {
    "source": "universal",
    "url": "https://www.indeed.com",
    "parse": True,
    "parsing_instructions": {
        "title": {
            "\_fns": [
                        {
                            "\_fn": "xpath_one",
                            "\_args": [
                                "//title/text()"
                                ]
                        }
                    ]
                }
    },
}
response = requests.post(
    url="https://realtime.oxylabs.io/v1/queries",
    json=payload,
    auth=('username', 'password'),
)
print(response.json()['results'][0]['content'])

Scraping Indeed job postings

Selecting a job listing

`.job_seen_beacon`

Creating the placeholder for a job listing

"job_listings": {
    "_fns": [
        {
            "_fn": "css",
            "_args": [".job_seen_beacon"]
        }
    ],
    "_items": {
        "job_title": {
            "_fns": [
                {
                "_fn": "xpath_one",
                "_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"]
                }
            ]
        },
        "company_name": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [".//span[@data-testid='company-name']/text()"]
                }
            ]
        },

Adding other selectors

{
  "source": "universal",
  "url": "https://www.indeed.com/jobs?q=work+from+home&l=San+Francisco%2C+CA",
  "parse": true,
  "parsing_instructions": {
    "job_listings": {
      "_fns": [
        {
          "_fn": "css",
          "_args": [".job_seen_beacon"]
        }
      ],
      "_items": {
        "job_title": {
          "_fns": [
            {
              "_fn": "xpath_one",
              "_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"]
            }
          ]
        },
        "company_name": {
          "_fns": [
            {
              "_fn": "xpath_one",
              "_args": [".//span[@data-testid='company-name']/text()"]
            }
          ]
        }
      }
    }
  }
}

For other data points, see the file here.

Saving the payload as a separator JSON file

# parse_jobs.py

import requests
import json
payload = {}
with open("job_search_payload.json") as f:
    payload = json.load(f)
response = requests.post(
    url="https://realtime.oxylabs.io/v1/queries",
    json=payload,
    auth=("username", "password"),
)
print(response.status_code)
with open("result.json", "w") as f:
    json.dump(response.json(), f, indent=4)

Exporting to JSON and CSV

# parse_jobs.py
with open("results.json", "w") as f:
    json.dump(data, f, indent=4)
df = pd.DataFrame(data["results"][0]["content"]["job_listings"])
df.to_csv("job_search_results.csv", index=False)

Final word

Check our documentation for more API parameters and variables found in this tutorial.

If you have any questions, feel free to contact us at support@oxylabs.io.

About

A tutorial for collecting job postings from Indeed using Python and Oxylabs Web Scraper API.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages