This repository contains the dataset and code for the paper C-STS: Conditional Semantic Textual Similarity. [ArXiv]
To avoid the intentional/unintentional scraping of the C-STS dataset for pre-training LLMs, which could cause training data contamination and impact their evaluation, we adopt the following approach for our dataset release.
The dataset for C-STS is stored in an encrypted file named csts.tar.enc
. To access the dataset, follow these steps:
-
Request Access: Submit a request to obtain the decryption password by clicking here. You will receive an email response with the password immediately.
-
Decrypt the Dataset: Once you have received the password via email, you can decrypt the
csts.tar.enc
file using the providedextract.sh
script. Follow the instructions below:-
Open a terminal and navigate to the
data
directory. -
Run the following command, replacing
<password>
with the decryption password obtained via email:bash extract.sh csts.tar.enc <password>
Provided the correct password, this step will generate three files
csts_train.csv
,csts_validation.csv
, andcsts_test.csv
, the unencrypted dataset splits. -
You can load the data using datasets with the following lines
from datasets import load_dataset
dataset = load_dataset(
'csv',
data_files=
{
'train': 'data/csts_train.csv',
'validation': 'data/csts_validation.csv',
'test': 'data/csts_test.csv'
}
)
Important: By using this dataset, you agree to not publicly share its unencrypted contents or decryption password.
We provide the basic training scripts and utilities for finetuning and evaluating the models in the paper. The code is adapted from the HuggingFace Transformers library. Refer to the documentation for more details.
The current code supports finetuning any encoder-only model, using the cross_encoder
, bi_encoder
, or tri_encoder
settings described in the paper.
You can finetune the models described in the paper using the run_sts.sh
script. For example, to finetune the princeton-nlp/sup-simcse-roberta-base
model on the C-STS dataset, run the following command:
MODEL=princeton-nlp/sup-simcse-roberta-base \
ENCODER_TYPE=bi_encoder \
LR=1e-5 \
WD=0.1 \
TRANSFORM=False \
OBJECTIVE=mse \
OUTPUT_DIR=output \
TRAIN_FILE=data/csts_train.csv \
EVAL_FILE=data/csts_validation.csv \
TEST_FILE=data/csts_test.csv \
bash run_sts.sh
See run_sts.sh
for a full description of the available options and default values.
The script run_sts_fewshot.sh
can be used to evaluate large language-models in a few-shot setting with or without instructions. For example, to evaluate the google/flan-t5-xxl
model on the C-STS dataset, run the following command:
python run_sts_fewshot.py \
--model_name_or_path google/flan-t5-xxl \
--k_shot 2 \
--prompt_name long \
--train_file data/csts_train.csv \
--validation_file data/csts_validation.csv \
--test_file data/csts_test.csv \
--output_dir output/flan-t5-xxl/k2_long \
--dtype tf32 \
--batch_size 4
To accommodate large model types run_sts_fewshot.sh
will use all visible GPUs to load the model in model parallel. For smaller models set CUDA_VISIBLE_DEVICES
to the desired GPU ids.
Run python run_sts_fewshot.py --help
for a full description of additional options and default values.
You can scores for your model on the test set by submitting your predictions using the make_test_submission.py
script as follows:
python make_test_submission.py your_email@email.com /path/to/your/predictions.json
This script expects the test predictions file to be in the format generated automatically by the scripts above; i.e.
{
"0": 1.0,
"1": 0.0,
"...":
"4731": 0.5
}
After submission your results will be emailed to the submitted email address with the relevant filename in the subject.
@misc{deshpande2023csts,
title={CSTS: Conditional Semantic Textual Similarity},
author={Ameet Deshpande and Carlos E. Jimenez and Howard Chen and Vishvak Murahari and Victoria Graf and Tanmay Rajpurohit and Ashwin Kalyan and Danqi Chen and Karthik Narasimhan},
year={2023},
eprint={2305.15093},
archivePrefix={arXiv},
primaryClass={cs.CL}
}