Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gen AI] allenai Model usage. #13

Open
2 tasks
head-iie-vnr opened this issue Jun 30, 2024 · 5 comments
Open
2 tasks

[Gen AI] allenai Model usage. #13

head-iie-vnr opened this issue Jun 30, 2024 · 5 comments

Comments

@head-iie-vnr
Copy link
Contributor

  • Create custom training data
  • Train on the custom training data.
@head-iie-vnr
Copy link
Contributor Author

When I used batch_size of 2 it crashed memory.
Even after briniging down the initial memory state to be 13.5 GB free (1.6GB use), it still crashed hitting the 16GB upper limit.

Screenshot from 2024-06-30 06-56-20

When I reduced the batch_size to 1, it was manageble.
Screenshot from 2024-06-30 07-04-36

Special observation: Each training iteration was taking 20seconds and the same can be observed in the Graph Heart Beat.
The lowest point is the start of new step(iteration)

@head-iie-vnr
Copy link
Contributor Author

The training data contains
40 questions & answers

Original context text:
contains 900 words
total sentences are 50.

@head-iie-vnr
Copy link
Contributor Author

Code: Step-by-Step Explanation

  1. Importing Required Libraries:

    import json
    from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
    from datasets import DatasetDict, Dataset
    • json: For loading the custom dataset from a JSON file.
    • transformers: For using the Hugging Face library to handle tokenization, model loading, and training.
    • datasets: For managing the dataset in a format compatible with Hugging Face's transformers.
  2. Loading Custom Dataset:

    def load_custom_dataset(file_path):
        with open(file_path, 'r') as f:
            dataset_dict = json.load(f)
        return dataset_dict
    
    # Assuming your dataset file path is 'custom_qa_dataset.json'
    custom_dataset = load_custom_dataset('custom_qa_dataset.json')
    • This function reads a JSON file containing the custom QA dataset and loads it into a Python dictionary.
  3. Converting to SQuAD Format:

    def convert_to_squad_format(custom_dataset):
        contexts = []
        questions = []
        answers = []
        
        for data in custom_dataset["data"]:
            for paragraph in data["paragraphs"]:
                context = paragraph["context"]
                for qa in paragraph["qas"]:
                    question = qa["question"]
                    for answer in qa["answers"]:
                        contexts.append(context)
                        questions.append(question)
                        answers.append({
                            "text": answer["text"],
                            "answer_start": answer["answer_start"]
                        })
        
        return {
            "context": contexts,
            "question": questions,
            "answers": answers
        }
    
    squad_format_dataset = convert_to_squad_format(custom_dataset)
    • This function converts the custom dataset into a format similar to the SQuAD dataset format.
    • It extracts contexts, questions, and answers into separate lists and returns a dictionary with these lists.
  4. Creating Hugging Face Dataset:

    dataset = DatasetDict({"train": Dataset.from_dict(squad_format_dataset)})
    • This creates a Hugging Face Dataset from the structured data and puts it into a DatasetDict under the "train" key.
  5. Loading the Tokenizer and Model:

    model_name = "allenai/longformer-base-4096"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    • Loads the tokenizer and model for the Longformer, which is capable of handling long contexts up to 4096 tokens.
  6. Tokenizing the Dataset:

    def preprocess_function(examples):
        inputs = tokenizer(
            examples["question"],
            examples["context"],
            max_length=4096,
            truncation=True,
            padding="max_length",
            return_offsets_mapping=True,
        )
        offset_mapping = inputs.pop("offset_mapping")
        start_positions = []
        end_positions = []
    
        for i, answer in enumerate(examples["answers"]):
            start_char = answer["answer_start"]
            end_char = start_char + len(answer["text"])
            sequence_ids = inputs.sequence_ids(i)
    
            # Find the start and end of the context
            idx = 0
            while sequence_ids[idx] != 1:
                idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:
                idx += 1
            context_end = idx - 1
    
            # If the answer is out of the context, label it (0, 0)
            if not (offset_mapping[i][context_start][0] <= start_char and offset_mapping[i][context_end][1] >= end_char):
                start_positions.append(0)
                end_positions.append(0)
            else:
                start_idx = context_start
                while offset_mapping[i][start_idx][0] <= start_char:
                    start_idx += 1
                start_positions.append(start_idx - 1)
    
                end_idx = context_end
                while offset_mapping[i][end_idx][1] >= end_char:
                    end_idx -= 1
                end_positions.append(end_idx + 1)
    
        inputs["start_positions"] = start_positions
        inputs["end_positions"] = end_positions
        return inputs
    
    tokenized_datasets = dataset.map(preprocess_function, batched=True, batch_size=2)
    • The preprocess_function tokenizes the questions and contexts.
    • It calculates the start and end positions of the answers in the tokenized context.
    • dataset.map(preprocess_function, batched=True, batch_size=2) applies the preprocessing function to the dataset in batches of size 2.
  7. Setting Up Training Arguments:

    training_args = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=1,  # Reduced batch size
        per_device_eval_batch_size=1,  # Reduced batch size
        num_train_epochs=3,
        weight_decay=0.01,
        gradient_accumulation_steps=8,  # Use gradient accumulation
        fp16=True,  # Enable mixed precision training
    )
    • Configures the training parameters, such as output directory, learning rate, batch size, number of epochs, gradient accumulation, and mixed precision training.
  8. Initializing the Trainer:

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["train"],
    )
    • Initializes the Trainer with the model, training arguments, and the tokenized dataset for training and evaluation.
  9. Training the Model:

    trainer.train()
    • Trains the model using the specified training arguments and dataset.
  10. Saving the Fine-Tuned Model:

    model.save_pretrained("./fine-tuned-longformer")
    tokenizer.save_pretrained("./fine-tuned-longformer")
    • Saves the fine-tuned model and tokenizer to the specified directory.

Summary

The code loads a custom QA dataset, converts it to a format compatible with Hugging Face's transformers library, tokenizes the data, sets up training parameters, trains the Longformer model on the dataset, and finally saves the fine-tuned model and tokenizer.

@head-iie-vnr
Copy link
Contributor Author

The results are not satisfactory

Question: When was Gandhi born?
Answer: was born
Score: 0.0077830287627875805
Question: Where was Gandhi born?
Answer: was born

@head-iie-vnr
Copy link
Contributor Author

Trying different model in Issue#14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant