ReadEuraxess

Generation of a dataset including published offers from Euraxess

Setting up the ingestion system

The process is composed of two phases:

Daily download of offers

This is a simple process that needs to be set up as a cron process. Everyday the following command should be run:

> wget -O - "ANONYMIZED_EURAXESS_JOBPOSTS_URL" --output-document=[your_path]/jobs_`date +%Y-%m-%d_%H:%M:%S`.xml

Warning: Permision to crawl this dataset should be granted by the website owners. You should refer to them to get the URL that should be used in the previous terminal command line.

To consolidate all downloaded data (since offers appear repeteade in the retrieved files) we need to run the following python script

> python main.py -c config_file

This scripts processes the downloades XML files, extracts the necessasry information, and consolidates the offers in a final CSV file

The script keeps track of downloaded files that have already been processed. If the whole dataset wants to be regenerated from scratch, Step 2 needs to be carried out activating the --resetCSV flag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ReadEuraxess

Setting up the ingestion system

Files

README.md

Latest commit

History

README.md

File metadata and controls

ReadEuraxess

Setting up the ingestion system