Skip to content

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE
MIT
LICENSE-CODE
Notifications You must be signed in to change notification settings

yuzhimanhua/SciMult

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

License: MIT

This repository contains code and instructions for reproducing the experiments in the paper Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP 2023).

Links

Installation

We use one NVIDIA RTX A6000 GPU to run the evaluation code in our experiments. The code is written in Python 3.8. You can install the dependencies as follows.

git clone --recurse-submodules https://github.com/yuzhimanhua/SciMult
cd SciMult

# get the DPR codebase
mkdir third_party
cd third_party
git clone https://github.com/facebookresearch/DPR.git
cd ../

# create the sandbox
conda env create --file=environment.yml --name=scimult
conda activate scimult

# add the `src/` and `third_party/DPR` to the list of places python searches for packages
conda develop src/ third_party/DPR/

# download spacy models
python -m spacy download en_core_web_sm

Quick Start

You need to first download the evaluation datasets and the pre-trained models. After you unzip the dataset file, put the folder (i.e., data/) under the repository main folder ./. After you download the four model checkpoints (i.e., scimult_vanilla.ckpt, scimult_moe.ckpt, scimult_moe_pmcpatients_par.ckpt, and scimult_moe_pmcpatients_ppr.ckpt), put them under the model folder ./model/.

Then, you can run the evaluation code for each task:

cd src

# evaluate fine-grained classification (MAPLE [CS-Conference, Chemistry-MeSH, Geography, Psychology])
./eval_classification_fine.sh

# evaluate coarse-grained classification (SciDocs [MAG, MeSH])
./eval_classification_coarse.sh

# evaluate link prediction under the retrieval setting (SciDocs [Cite, Co-cite], PMC-Patients [PPR])
./eval_link_prediction_retrieval.sh

# evaluate link prediction under the reranking setting (Recommendation)
./eval_link_prediction_reranking.sh

# evaluate search (SciRepEval [Search, TREC-COVID], BEIR [TREC-COVID, SciFact, NFCorpus])
./eval_search.sh

The metrics will be shown at the end of the terminal output as well as in scores.txt.

Getting embeddings of your own data

If you have some documents (e.g., scientific papers) and want to get the embedding of each document using SciMult, we provide the following sample code for your reference:

cd src
python3.8 get_embedding.py

PMC-Patients

NOTE: The performance of SciMult on PMC-Patients reported in our paper is based on the old version of PMC-Patients (i.e., the version when we wrote the SciMult paper). The PMC-Patients Leaderboard at that time can be found here.

To reproduce our reported performance on the "old" PMC-Patients Leaderboard:

cd src
./eval_pmc_patients.sh

The metrics will be shown at the end of the terminal output as well as in scores.txt. The similarity scores that we submitted to the leaderboard can be found at ../output/PMCPatientsPAR_test_out.json and ../output/PMCPatientsPPR_test_out.json.

For the performance of SciMult on the new version of PMC-Patients, please refer to the up-to-date PMC-Patients Leaderboard.

SciDocs

To reproduce our performance on the SciDocs benchmark:

cd src
./eval_scidocs.sh

The output embedding files can be found at ../output/cls.jsonl and ../output/user-citation.jsonl. Then, run the adapted SciDocs evaluation code:

cd ../
git clone https://github.com/yuzhimanhua/SciDocs.git
cd scidocs

# install dependencies
conda deactivate
conda create -y --name scidocs python==3.7
conda activate scidocs
conda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorch
pip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0
python setup.py install

# run evaluation
python eval.py

The metrics will be shown at the end of the terminal output.

Datasets

The preprocessed evaluation datasets can be downloaded from here. The aggregate version is released under the ODC-By v1.0 License. By downloading this version you acknowledge that you have read and agreed to all the terms in this license.

Similar to Tensorflow datasets or Hugging Face's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have the license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

More details about each constituent dataset are as follows.

Dataset Folder #Queries #Candidates Source License
MAPLE (CS-Conference) classification_fine/ 261,781 15,808 Link ODC-By v1.0
MAPLE (Chemistry-MeSH) classification_fine/ 762,129 30,194 Link ODC-By v1.0
MAPLE (Geography) classification_fine/ 73,883 3,285 Link ODC-By v1.0
MAPLE (Psychology) classification_fine/ 372,954 7,641 Link ODC-By v1.0
SciDocs (MAG Fields) classification_coarse/ 25,001 19 Link CC BY 4.0
SciDocs (MeSH Diseases) classification_coarse/ 23,473 11 Link CC BY 4.0
SciDocs (Cite) link_prediction_retrieval/ 92,214 142,009 Link CC BY 4.0
SciDocs (Co-cite) link_prediction_retrieval/ 54,543 142,009 Link CC BY 4.0
PMC-Patients (PPR, Zero-shot) link_prediction_retrieval/ 100,327 155,151 Link CC BY-NC-SA 4.0
PMC-Patients (PAR, Supervised) pmc_patients/ 5,959 1,413,087 Link CC BY-NC-SA 4.0
PMC-Patients (PPR, Supervised) pmc_patients/ 2,812 155,151 Link CC BY-NC-SA 4.0
SciDocs (Co-view) scidocs/ 1,000 reranking, 29.98 for each query on average Link CC BY 4.0
SciDocs (Co-read) scidocs/ 1,000 reranking, 29.98 for each query on average Link CC BY 4.0
SciDocs (Cite) scidocs/ 1,000 reranking, 29.93 for each query on average Link CC BY 4.0
SciDocs (Co-cite) scidocs/ 1,000 reranking, 29.95 for each query on average Link CC BY 4.0
Recommendation link_prediction_reranking/ 137 reranking, 16.28 for each query on average Link N/A
SciRepEval-Search search/ 2,637 reranking, 10.00 for each query on average Link ODC-By v1.0
TREC-COVID in SciRepEval search/ 50 reranking, 1386.36 for each query on average Link ODC-By v1.0
TREC-COVID in BEIR search/ 50 171,332 Link Apache License 2.0
SciFact search/ 1,109 5,183 Link Apache License 2.0, CC BY-NC 2.0
NFCorpus search/ 3,237 3,633 Link Apache License 2.0

Models

Our pre-trained models can be downloaded from here. Please refer to the Hugging Face README for more details about the models.

Citation

If you find SciMult useful in your research, please cite the following paper:

@inproceedings{zhang2023pre,
  title={Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding},
  author={Zhang, Yu and Cheng, Hao and Shen, Zhihong and Liu, Xiaodong and Wang, Ye-Yi and Gao, Jianfeng},
  booktitle={Findings of EMNLP'23},
  pages={12259--12275},
  year={2023}
}

About

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP'23)

Topics

Resources

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE
MIT
LICENSE-CODE

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages