Flows to fine-tune transformer models for a variety of downstream tasks using job adverts
This repo contains metaflows that train transformer models for both domain adaptation and a variety of downstream tasks using job adverts from Nesta's Open Jobs Observatory. With the permission of job board sites, we have been collecting online job adverts since 2021 and building algorithms to extract and structure information. We have collected millions of job adverts since the project's inception.
Although we are unable to share the raw data openly, we aim to build open source tools, algorithms and models that anyone can use for their own research and analysis. For example, we have built an open-source Skills Extractor library and have an open locations extraction repo.
This repo contains the metaflows used to fine-tune transformer models with job adverts for a variety of downstream tasks, including:
- next-sentence prediction
- masked language modelling
- skill semantic similarity
- named entity recognition
The fine-tuned models (and their associated model cards) can be accessed via huggingface's hub:
To run the flows, you will need to:
- Meet the data science cookiecutter requirements, in brief:
- Install:
direnv
andconda
- Install:
- Run
make install
to configure the development environment:- Setup the conda environment
- Configure
pre-commit
- Download spacy model:
python -m spacy download en_core_web_sm
- install Pytorch:
conda install pytorch torchvision -c pytorch
(if you are using mac OS x 13.4 operating system -pip install torch
) - Set up batch processing with Metaflow
- Sign into huggingface hub to push models to huggingface
- run
export LC_ALL="en_GB.UTF-8"
in your terminal
However, to simply use the models, please refer to 💘 Using fine-tuned model checkpoints section.
Technical and working style guidelines
Project based on Nesta's data science project template (Read the docs here).