[Gen AI] allenai Model usage. #13

head-iie-vnr · 2024-06-30T01:36:25Z

Create custom training data
Train on the custom training data.

head-iie-vnr · 2024-06-30T01:39:37Z

When I used batch_size of 2 it crashed memory.
Even after briniging down the initial memory state to be 13.5 GB free (1.6GB use), it still crashed hitting the 16GB upper limit.

When I reduced the batch_size to 1, it was manageble.

Special observation: Each training iteration was taking 20seconds and the same can be observed in the Graph Heart Beat.
The lowest point is the start of new step(iteration)

head-iie-vnr · 2024-06-30T01:43:54Z

The training data contains
40 questions & answers

Original context text:
contains 900 words
total sentences are 50.

head-iie-vnr · 2024-06-30T02:05:38Z

Code: Step-by-Step Explanation

Importing Required Libraries:
```
import json
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
from datasets import DatasetDict, Dataset
```
- json: For loading the custom dataset from a JSON file.
- transformers: For using the Hugging Face library to handle tokenization, model loading, and training.
- datasets: For managing the dataset in a format compatible with Hugging Face's transformers.

Loading Custom Dataset:

def load_custom_dataset(file_path):
    with open(file_path, 'r') as f:
        dataset_dict = json.load(f)
    return dataset_dict

# Assuming your dataset file path is 'custom_qa_dataset.json'
custom_dataset = load_custom_dataset('custom_qa_dataset.json')

This function reads a JSON file containing the custom QA dataset and loads it into a Python dictionary.

Converting to SQuAD Format:

def convert_to_squad_format(custom_dataset):
    contexts = []
    questions = []
    answers = []
    
    for data in custom_dataset["data"]:
        for paragraph in data["paragraphs"]:
            context = paragraph["context"]
            for qa in paragraph["qas"]:
                question = qa["question"]
                for answer in qa["answers"]:
                    contexts.append(context)
                    questions.append(question)
                    answers.append({
                        "text": answer["text"],
                        "answer_start": answer["answer_start"]
                    })
    
    return {
        "context": contexts,
        "question": questions,
        "answers": answers
    }

squad_format_dataset = convert_to_squad_format(custom_dataset)

This function converts the custom dataset into a format similar to the SQuAD dataset format.
It extracts contexts, questions, and answers into separate lists and returns a dictionary with these lists.

Creating Hugging Face Dataset:
```
dataset = DatasetDict({"train": Dataset.from_dict(squad_format_dataset)})
```
- This creates a Hugging Face Dataset from the structured data and puts it into a DatasetDict under the "train" key.

Loading the Tokenizer and Model:

model_name = "allenai/longformer-base-4096"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Loads the tokenizer and model for the Longformer, which is capable of handling long contexts up to 4096 tokens.

Tokenizing the Dataset:

def preprocess_function(examples):
    inputs = tokenizer(
        examples["question"],
        examples["context"],
        max_length=4096,
        truncation=True,
        padding="max_length",
        return_offsets_mapping=True,
    )
    offset_mapping = inputs.pop("offset_mapping")
    start_positions = []
    end_positions = []

    for i, answer in enumerate(examples["answers"]):
        start_char = answer["answer_start"]
        end_char = start_char + len(answer["text"])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is out of the context, label it (0, 0)
        if not (offset_mapping[i][context_start][0] <= start_char and offset_mapping[i][context_end][1] >= end_char):
            start_positions.append(0)
            end_positions.append(0)
        else:
            start_idx = context_start
            while offset_mapping[i][start_idx][0] <= start_char:
                start_idx += 1
            start_positions.append(start_idx - 1)

            end_idx = context_end
            while offset_mapping[i][end_idx][1] >= end_char:
                end_idx -= 1
            end_positions.append(end_idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True, batch_size=2)

The preprocess_function tokenizes the questions and contexts.
It calculates the start and end positions of the answers in the tokenized context.
dataset.map(preprocess_function, batched=True, batch_size=2) applies the preprocessing function to the dataset in batches of size 2.

Setting Up Training Arguments:

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,  # Reduced batch size
    per_device_eval_batch_size=1,  # Reduced batch size
    num_train_epochs=3,
    weight_decay=0.01,
    gradient_accumulation_steps=8,  # Use gradient accumulation
    fp16=True,  # Enable mixed precision training
)

Configures the training parameters, such as output directory, learning rate, batch size, number of epochs, gradient accumulation, and mixed precision training.

Initializing the Trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["train"],
)

Initializes the Trainer with the model, training arguments, and the tokenized dataset for training and evaluation.

Training the Model:
```
trainer.train()
```
- Trains the model using the specified training arguments and dataset.

Saving the Fine-Tuned Model:

model.save_pretrained("./fine-tuned-longformer")
tokenizer.save_pretrained("./fine-tuned-longformer")

Saves the fine-tuned model and tokenizer to the specified directory.

Summary

The code loads a custom QA dataset, converts it to a format compatible with Hugging Face's transformers library, tokenizes the data, sets up training parameters, trains the Longformer model on the dataset, and finally saves the fine-tuned model and tokenizer.

head-iie-vnr · 2024-06-30T02:52:53Z

The results are not satisfactory

Question: When was Gandhi born?
Answer: was born
Score: 0.0077830287627875805
Question: Where was Gandhi born?
Answer: was born

head-iie-vnr · 2024-06-30T02:53:09Z

Trying different model in Issue#14

head-iie-vnr added a commit that referenced this issue Jun 30, 2024

AllenAI model and custom dataset #13

7afdde6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gen AI] allenai Model usage. #13

[Gen AI] allenai Model usage. #13

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

[Gen AI] allenai Model usage. #13

[Gen AI] allenai Model usage. #13

Comments

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024

Code: Step-by-Step Explanation

Summary

head-iie-vnr commented Jun 30, 2024

head-iie-vnr commented Jun 30, 2024