Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry About the Data Source for the Model Training Set #2

Open
AnonymXXXXX opened this issue Apr 17, 2024 · 3 comments
Open

Inquiry About the Data Source for the Model Training Set #2

AnonymXXXXX opened this issue Apr 17, 2024 · 3 comments

Comments

@AnonymXXXXX
Copy link

Hello,

I have been exploring your code and datasets as described in the README and downloaded the dataset used for training BeLLM from https://huggingface.co/datasets/SeanLee97/all_nli_angle_format_b/tree/main, named SeanLee97/all_nli_angle_format_b.

After analyzing the data, I noticed that this dataset contains 480,862 rows and 3 columns. In comparison, previous works like PromptEOL utilized the simcse_nli dataset (comprised of SNLI and MNLI) which totals 275,602 rows x 3 columns.

I am curious about the composition of the all_nli_angle_format_b dataset and am wondering why there is a significant increase in the data amount. Could you please share some insights on how this dataset was compiled and what makes up the additional data?

Additionally, have you or your team tested the performance of PromptEOL with a larger dataset size of 480,862 rows x 3 columns? I am interested in understanding how the increase in dataset size might influence the model's performance.

Thank you for your time and assistance. I look forward to your response!

@SeanLee97
Copy link
Contributor

SeanLee97 commented Apr 17, 2024

Hi @AnonymXXXXX, we did not use simcse_nli. We directly transformed AllNLI (including MultiNLI and SNLI) into triples. The AllNLI dataset is provided by sentence-transformers.

Here is the process script:

import os
import csv
import gzip
import json
import random

from tqdm import tqdm
from sentence_transformers import util
from datasets import load_dataset

save_path = 'all_nli.B.jsonl'
nli_dataset_path = "data/AllNLI.tsv.gz"
if not os.path.exists(nli_dataset_path):
    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)

    
def add_to_samples(sent1, sent2, label):
    if sent1 not in train_data:
        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
    train_data[sent1][label].add(sent2)

train_data = {}
with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
    for row in reader:
        if row["split"] == "train":
            sent1 = row["sentence1"].strip()
            sent2 = row["sentence2"].strip()

            add_to_samples(sent1, sent2, row["label"])
            # add_to_samples(sent2, sent1, row["label"])  # Also add the opposite

data = []
for sent1, others in tqdm(train_data.items()):
    negs = list(others['contradiction'])
    if not negs:
        continue
    poss = list(others['entailment'])
    if not poss:
        continue
    for pos in poss:
        for neg in negs:
            data.append({'text': sent1, 'positive': pos, 'negative': neg})

print('size:', len(data))
random.shuffle(data)

with open(save_path, 'w') as writer:
    for obj in data:
        writer.writelines(json.dumps(obj, ensure_ascii=False) + '\n')

I know simcse also use multinli + snli, but I am not sure how simcse_nli is collected. We'd like to test it. But it might take some time as our computing resources are limited now. It requires A100 (80GB RAM) to run to support large batch sizes.

@SeanLee97
Copy link
Contributor

Sorry for that; we forgot to make the additional data public. Will do it later.

We collect the additional data as follows:

  1. collect all texts from multinli + nli
  2. index all texts to Elasticsearch (ES)
  3. retrieve top 30 similar texts for each text using ES.
  4. take the top 30 text as the hard negative

@AnonymXXXXX
Copy link
Author

Thanks for the details.
In addition, should the version number of angle-emb in requirements.txt be 0.3.10 instead of 3.1.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants