-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry About the Data Source for the Model Training Set #2
Comments
Hi @AnonymXXXXX, we did not use Here is the process script: import os
import csv
import gzip
import json
import random
from tqdm import tqdm
from sentence_transformers import util
from datasets import load_dataset
save_path = 'all_nli.B.jsonl'
nli_dataset_path = "data/AllNLI.tsv.gz"
if not os.path.exists(nli_dataset_path):
util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)
def add_to_samples(sent1, sent2, label):
if sent1 not in train_data:
train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
train_data[sent1][label].add(sent2)
train_data = {}
with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
for row in reader:
if row["split"] == "train":
sent1 = row["sentence1"].strip()
sent2 = row["sentence2"].strip()
add_to_samples(sent1, sent2, row["label"])
# add_to_samples(sent2, sent1, row["label"]) # Also add the opposite
data = []
for sent1, others in tqdm(train_data.items()):
negs = list(others['contradiction'])
if not negs:
continue
poss = list(others['entailment'])
if not poss:
continue
for pos in poss:
for neg in negs:
data.append({'text': sent1, 'positive': pos, 'negative': neg})
print('size:', len(data))
random.shuffle(data)
with open(save_path, 'w') as writer:
for obj in data:
writer.writelines(json.dumps(obj, ensure_ascii=False) + '\n') I know simcse also use multinli + snli, but I am not sure how |
Sorry for that; we forgot to make the additional data public. Will do it later. We collect the additional data as follows:
|
Thanks for the details. |
Hello,
I have been exploring your code and datasets as described in the README and downloaded the dataset used for training BeLLM from https://huggingface.co/datasets/SeanLee97/all_nli_angle_format_b/tree/main, named
SeanLee97/all_nli_angle_format_b
.After analyzing the data, I noticed that this dataset contains 480,862 rows and 3 columns. In comparison, previous works like PromptEOL utilized the
simcse_nli
dataset (comprised of SNLI and MNLI) which totals 275,602 rows x 3 columns.I am curious about the composition of the
all_nli_angle_format_b
dataset and am wondering why there is a significant increase in the data amount. Could you please share some insights on how this dataset was compiled and what makes up the additional data?Additionally, have you or your team tested the performance of PromptEOL with a larger dataset size of 480,862 rows x 3 columns? I am interested in understanding how the increase in dataset size might influence the model's performance.
Thank you for your time and assistance. I look forward to your response!
The text was updated successfully, but these errors were encountered: