Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Commit

Permalink
support more chatbot finetuning scenarios. (#1215)
Browse files Browse the repository at this point in the history
  • Loading branch information
lkk12014402 authored Jul 28, 2023
1 parent bd29731 commit 9694e55
Show file tree
Hide file tree
Showing 3 changed files with 362 additions and 96 deletions.
81 changes: 74 additions & 7 deletions workflows/chatbot/fine_tuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@ NeuralChat Fine-tuning

This example demonstrates how to finetune the pretrained large language model (LLM) with the instruction-following dataset for creating the NeuralChat, a chatbot that can conduct the textual conversation. Giving NeuralChat the textual instruction, it will respond with the textual response. This example have been validated on the 4th Gen Intel® Xeon® Processors, Sapphire Rapids.

## Validated Model List
|Pretrained model| Text Generation (Instruction) | Text Generation (ChatBot) | summarization tuning
|------------------------------------|---|---|---
|LLaMA series| ✅| ✅| ✅
|MPT series|✅ |✅ |✅
|FLAN-T5 series| ✅ | NA | NA

# Prerequisite​

## 1. Environment​
Expand All @@ -26,19 +33,23 @@ It should be noticed that the early version of LLama model's name in Transformer
The user can obtain the [release model](https://huggingface.co/google/flan-t5-xl) from Huggingface.

## 3. Prepare Dataset
The instruction-following dataset is needed for the finetuning. We select two kinds of Datasets to conduct the finetuning process: general domain dataset and domain specific dataset.
We select 4 kind of datasets to conduct the finetuning process for different tasks.

1. Text Generation (General domain instruction): We use the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca) from Stanford University as the general domain dataset to fine-tune the model. This dataset is provided in the form of a JSON file, [alpaca_data.json](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json). In Alpaca, researchers have manually crafted 175 seed tasks to guide `text-davinci-003` in generating 52K instruction data for diverse tasks.

1. General domain dataset: We use the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca) from Stanford University as the general domain dataset to fine-tune the model. This dataset is provided in the form of a JSON file, [alpaca_data.json](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json). In Alpaca, researchers have manually crafted 175 seed tasks to guide `text-davinci-003` in generating 52K instruction data for diverse tasks.
2. Text Generation (Domain-specific instruction): Inspired by Alpaca, we constructed a domain-specific dataset focusing on Business and Intel-related issues. We made minor modifications to the [prompt template](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) to proactively guide Alpaca in generating more Intel and Business related instruction data. The generated data could be find in `intel_domain.json`.

2. Domain-specific dataset: Inspired by Alpaca, we constructed a domain-specific dataset focusing on Business and Intel-related issues. We made minor modifications to the [prompt template](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) to proactively guide Alpaca in generating more Intel and Business related instruction data. The generated data could be find in `intel_domain.json`.
3. Text Generation (ChatBot): To finetune a chatbot, we use the chat-style dataset [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).

4. Summarization: An English-language dataset [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail, is used for this task.

# Finetune

We employ the [LoRA approach](https://arxiv.org/pdf/2106.09685.pdf) to finetune the LLM efficiently, currently, FLAN-T5 and LLaMA are supported for finetuning.

## 1. Single Node Fine-tuning in Xeon SPR

For FLAN-T5, use the below command line for finetuning on the Alpaca dataset.
**For FLAN-T5**, use the below command line for finetuning on the Alpaca dataset.

```bash
python finetune_seq2seq.py \
Expand All @@ -61,7 +72,9 @@ python finetune_seq2seq.py \
--peft lora
```

For LLaMA, use the below command line for finetuning on the Alpaca dataset.
#### For LLaMA

- use the below command line for finetuning on the Alpaca dataset.

```bash
python finetune_clm.py \
Expand All @@ -86,7 +99,61 @@ python finetune_clm.py \
--no_cuda \
```

For [MPT](https://huggingface.co/mosaicml/mpt-7b), use the below command line for finetuning on the Alpaca dataset. Only LORA supports MPT in PEFT perspective.it uses gpt-neox-20b tokenizer, so you need to define it in command line explicitly.This model also requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom MPT model architecture that is not yet part of the Hugging Face transformers package.
- use the below command line for finetuning chatbot on the [Intel/openassistant-preprocessed](https://huggingface.co/datasets/Intel/openassistant-preprocessed).

```bash
python finetune_clm.py \
--model_name_or_path "decapoda-research/llama-7b-hf" \
--bf16 True \
--dataset_name "Intel/openassistant-preprocessed" \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 1 \
--do_train \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--logging_steps 100 \
--save_total_limit 2 \
--overwrite_output_dir \
--log_level info \
--save_strategy epoch \
--output_dir ./llama_chatbot_peft_finetuned_model \
--peft lora \
--use_fast_tokenizer false \
--no_cuda \
--special_tokens "<|im_start|>" "<|im_end|>"

# the script also support other models, like mpt.
```

- use the below command line for summarization scenario on the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail).

```bash
python finetune_clm.py \
--model_name_or_path "/models/llama-7b-hf" \
--bf16 True \
--dataset_name "cnn_dailymail" \
--dataset_config_name "3.0.0" \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 1 \
--do_train \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--logging_steps 100 \
--save_total_limit 2 \
--overwrite_output_dir \
--log_level info \
--save_strategy epoch \
--output_dir ./llama_peft_finetuned_model \
--peft lora \
--use_fast_tokenizer false \
--no_cuda

# the script also support other models, like mpt.
```

**For [MPT](https://huggingface.co/mosaicml/mpt-7b)**, use the below command line for finetuning on the Alpaca dataset. Only LORA supports MPT in PEFT perspective.it uses gpt-neox-20b tokenizer, so you need to define it in command line explicitly.This model also requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom MPT model architecture that is not yet part of the Hugging Face transformers package.

```bash
python finetune_clm.py \
Expand Down Expand Up @@ -382,4 +449,4 @@ For finetuning on SPR, add `--bf16` argument will speedup the finetuning process
You could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LoRA,
see https://github.com/huggingface/peft. Note for MPT, only LoRA is supported.

Add option **"--use_fast_tokenizer False"** when using latest transformers if you met failure in llama fast tokenizer for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`
Add option **"--use_fast_tokenizer False"** when using latest transformers if you met failure in llama fast tokenizer for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
import copy
import datasets
import re
from itertools import chain

IGNORE_INDEX = -100

ALPACA_PROMPT_DICT = {
"prompt_with_input": (
"Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
),
"prompt_without_input": (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Response:"
),
}

conv_header = """<|im_start|>system
- You are a helpful assistant chatbot trained by Intel.
- You answer questions.
- You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- You are more than just an information source, you are also able to write poetry, short stories, and make jokes.<|im_end|>\n"""

user = "<|im_start|>user\n"
assistant = "<|im_start|>assistant\n"
end = "<|im_end|>"

summarization_suffix_template = "\nSummarize the highlights of this article.\n"

def create_alpaca(examples):
prompts = {}
prompts["source"] = []
prompts["target"] = []
for example in examples:
prompt_template = (
ALPACA_PROMPT_DICT["prompt_with_input"]
if example["input"] != ""
else ALPACA_PROMPT_DICT["prompt_without_input"]
)
source = prompt_template.format_map(example)
prompts["source"].append(source)
prompts["target"].append(example["output"])
return prompts


def tokenize_alpaca(tokenizer, data_args, finetune_args):
def tokenize(prompt, add_eos_token=True):
results = tokenizer(
prompt,
truncation=True,
max_length=data_args.max_seq_length,
padding=False,
return_tensors=None,)
for i in range(len(results["input_ids"])):
if (results["input_ids"][i][-1] != tokenizer.eos_token_id \
and len(results["input_ids"][i]) < data_args.max_seq_length \
and add_eos_token \
):
results["input_ids"][i].append(tokenizer.eos_token_id)
results["attention_mask"][i].append(1)
results["labels"] = copy.deepcopy(results["input_ids"])
results["input_id_len"] = [len(result) for result in results["input_ids"]]
return results

def preprocess_function(examples):
st = [s + t for s, t in zip(examples["prompt_sources"], examples["prompt_targets"])]
examples_tokenized = tokenize(st)
input_ids = examples_tokenized["input_ids"]
labels = examples_tokenized["labels"]
if not finetune_args.train_on_inputs:
sources_tokenized = tokenize(examples["prompt_sources"], add_eos_token=False)
for label, source_len in zip(labels, sources_tokenized["input_id_len"]):
label[:source_len] = [IGNORE_INDEX] * source_len
return dict(
input_ids=input_ids,
labels=labels,
attention_mask=examples_tokenized["attention_mask"],
)

return preprocess_function


def create_oasst(examples):
prompts = {}
prompts["prompt_sources"] = []
prompts["prompt_targets"] = []

for conv in examples:
conv = conv["messages"]
prompt = conv_header

for j in range(0, len(conv) - 1, 2):
u = conv[j]["content"]
ass = conv[j+1]["content"]
prompt = prompt + user + u + end + '\n' + assistant
response = ass + end
prompts["prompt_sources"].append(prompt)
prompts["prompt_targets"].append(response)

prompt += response + '\n'
return prompts

def truncate_sequences(sequences, max_length):
words_to_cut = sum(list(map(len, sequences))) - max_length
if words_to_cut <= 0:
return sequences

while words_to_cut > 0 and len(sequences) > 0:
words_to_cut -= len(sequences[0])
sequences = sequences[1:]

return sequences

def tokenize_oasst(tokenizer, data_args, finetune_args):

# special tokens
assistant_tokens = tokenizer.tokenize(assistant)

def preprocess_function(examples):

instructions = [q.strip() for q in examples["prompt_sources"]]
responses = [q.strip() for q in examples["prompt_targets"]]

examples["input_ids"] = []
examples["labels"] = []
examples["attention_mask"] = []

for instruction, response in zip(instructions, responses):
header = re.findall("\<\|im_start\|\>system.*?\<\|im_end\|\>", instruction, re.DOTALL)[0]
convs = re.findall("\<\|im_start\|\>.*?\<\|im_end\|\>", instruction, re.DOTALL)[1:]

convs_tokens = [
tokenizer.tokenize(conv) + tokenizer.tokenize("\n")
for conv in convs
]
header_tokens = tokenizer.tokenize(header) + tokenizer.tokenize("\n")

max_input = data_args.max_source_length - len(header_tokens) - len(assistant_tokens)

truncated_convs = truncate_sequences(convs_tokens,
max_input)

if len(truncated_convs) == 0:
truncated_convs = [convs_tokens[-1][:max_input - 1] + convs_tokens[-1][-1:]]

prompt_tokens = [header_tokens] + truncated_convs + [assistant_tokens]
prompt_ids = [tokenizer.convert_tokens_to_ids(prompt_token) for prompt_token in prompt_tokens]
prompt_ids = list(chain(*prompt_ids))

resp_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(response.strip()))
# keep last and eos_id
max_resp = data_args.max_seq_length - len(prompt_ids) - 1
if len(resp_ids) > max_resp:
resp_ids = resp_ids[:max_resp - 1] + resp_ids[-1:]

input_ids = prompt_ids + resp_ids + [tokenizer.eos_token_id]
if not finetune_args.train_on_inputs:
labels = [-100] * len(prompt_ids) + resp_ids + [tokenizer.eos_token_id]
else:
labels = prompt_ids + resp_ids + [tokenizer.eos_token_id]

# padding
input_len = len(input_ids)
pad_len = data_args.max_seq_length - input_len
input_ids = input_ids + [tokenizer.eos_token_id] * pad_len
labels = labels + [-100] * pad_len
attention_mask = [1] * input_len + [0] * pad_len

assert len(input_ids) == data_args.max_seq_length
assert len(prompt_ids) <= data_args.max_source_length
assert len(labels) == len(input_ids) == len(attention_mask)

examples["input_ids"].append(input_ids)
examples["labels"].append(labels)
examples["attention_mask"].append(attention_mask)

return examples

return preprocess_function

def tokenize_cnn(tokenizer, data_args, finetune_args):
template_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(summarization_suffix_template))

def preprocess_function(examples):

articles = [q.strip() for q in examples["article"]]
highlights = [q.strip() for q in examples["highlights"]]

examples["input_ids"] = []
examples["labels"] = []
examples["attention_mask"] = []

for article, highlight in zip(articles, highlights):
max_input = data_args.max_source_length - len(template_ids)

article_tokens = tokenizer.tokenize(article)[:max_input]
prompt_ids = tokenizer.convert_tokens_to_ids(article_tokens) + template_ids

max_resp = data_args.max_seq_length - len(prompt_ids) - 1
resp_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(highlight))[:max_resp]

input_ids = prompt_ids + resp_ids + [tokenizer.eos_token_id]
if not finetune_args.train_on_inputs:
labels = [-100] * len(prompt_ids) + resp_ids + [tokenizer.eos_token_id]
else:
labels = prompt_ids + resp_ids + [tokenizer.eos_token_id]

# padding
input_len = len(input_ids)
pad_len = data_args.max_seq_length - input_len
input_ids = input_ids + [tokenizer.eos_token_id] * pad_len
labels = labels + [-100] * pad_len
attention_mask = [1] * input_len + [0] * pad_len

assert len(input_ids) == data_args.max_seq_length
assert len(prompt_ids) <= data_args.max_source_length
assert len(labels) == len(input_ids) == len(attention_mask)

examples["input_ids"].append(input_ids)
examples["labels"].append(labels)
examples["attention_mask"].append(attention_mask)

return examples

return preprocess_function


def preprocess_dataset(raw_datasets, tokenizer, data_args, finetune_args):

dataset_name = data_args.dataset_name if data_args.dataset_name is not None else data_args.train_file
if "oasst" in dataset_name:
new_datasets = datasets.DatasetDict()
for key in ["train"]:
prompts = create_oasst(raw_datasets[key])
new_datasets[key] = datasets.Dataset.from_dict(prompts)

preprocess_fn = tokenize_oasst(tokenizer, data_args, finetune_args)

return new_datasets, preprocess_fn

elif "cnn" in dataset_name:
preprocess_fn = tokenize_cnn(tokenizer, data_args, finetune_args)
return raw_datasets, preprocess_fn
else:
# default use alpaca instruction template
for key in raw_datasets:
prompts = create_alpaca(raw_datasets[key])
columns_to_be_removed = list(raw_datasets[key].features.keys())
raw_datasets[key] = raw_datasets[key].add_column(
"prompt_sources", prompts["source"]
)
raw_datasets[key] = raw_datasets[key].add_column(
"prompt_targets", prompts["target"]
)
raw_datasets[key] = raw_datasets[key].remove_columns(columns_to_be_removed)

preprocess_fn = tokenize_alpaca(tokenizer, data_args, finetune_args)

return raw_datasets, preprocess_fn
Loading

0 comments on commit 9694e55

Please sign in to comment.