Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LORA fine-tuning with openlm-research/open_llama_7b as a plugin replacement for decapoda-research/llama-7b-hf #63

Open
gjmulder opened this issue Jun 28, 2023 · 23 comments

Comments

@gjmulder
Copy link

Hi. Thanks for the open sourced models! This is a major step forward to the democratization of LLMs.

I'm trying to fine tune openlm-research/open_llama_7b using the LORA.

I first tried the code and data at alpaca-lora but was getting evaluation losses about 10% higher than decapoda-research/llama-7b-hf.

Given that alpaca-lora was a very early attempt at LORA training I then tried the code and data at baize-chatbot. However, I'm still getting evaluation losses about 10% higher than decapoda-research/llama-7b-hf:
image

Assuming both models are approximately equivalent in terms of generative ability, I am wondering if it is the tokenizer? I am using the flag use_fast=False, but I notice there are additional significant differences in the tokenizer_config.json

decapoda-research/llama-7b-hf:

{
   "bos_token":"",
   "eos_token":"",
   "model_max_length":1000000000000000019884624838656,
   "tokenizer_class":"LLaMATokenizer",
   "unk_token":""
}

openlm-research/open_llama_7b:

{
   "add_bos_token":true,
   "add_eos_token":false,
   "model_max_length":2048,
   "pad_token":null,
   "sp_model_kwargs":{
      
   },
   "tokenizer_class":"LlamaTokenizer",
   "clean_up_tokenization_spaces":false,
   "bos_token":{
      "__type":"AddedToken",
      "content":"<s>",
      "lstrip":false,
      "normalized":true,
      "rstrip":false,
      "single_word":false
   },
   "eos_token":{
      "__type":"AddedToken",
      "content":"</s>",
      "lstrip":false,
      "normalized":true,
      "rstrip":false,
      "single_word":false
   },
   "unk_token":{
      "__type":"AddedToken",
      "content":"<unk>",
      "lstrip":false,
      "normalized":true,
      "rstrip":false,
      "single_word":false
   }
}
@jorgemcgomes
Copy link

It's not just the tokenizer config. The tokenizer vocabs seem to be very different. See #40 and https://github.com/openlm-research/open_llama/discussions/41. Therefore, any direct comparison of loss values needs to be taken with a grain of salt.

@gjmulder
Copy link
Author

Would that imply that if the normalised train & eval loss between models is approximately similar that there's similar quality of fine tuning?

@jorgemcgomes
Copy link

jorgemcgomes commented Jun 28, 2023

Looking at the perplexity scores in the discussion in #41 , you have ppl=6.5 (loss=1.87) for Open and ppl=5.4 (loss=1.68) for Meta's. That turns out to be around 11% of difference in loss values, which is very similar to what you report here in your fine-tuning experiments.

@gjmulder
Copy link
Author

So maybe the tokenizer is simply 10% worse across a wide range of semi-uncurated text such as I am using for fine-tuning or perplexity, but for many custom hand-curated evaluation sets the tokenizer isn't having much effect, e.g. #40 isn't going to likely impact scores when using cleanly formatted single-spaced evaluation data.

@eschaffn
Copy link

I'm also having a hard time reproducing models that use the Facebook Llama as the base model. Using a slightly modified version of the Qlora code and the dataset here. I can't get results comparable to this.

Has anyone been able to successfully use OpenLLama as a drop in replacement for Llama yet? It seems like tokenization is a big problem as I get noisy/repetitive outputs as if the model has a hard time generating stop tokens correctly.

@gjmulder
Copy link
Author

gjmulder commented Jun 30, 2023

What are you losses looking like when fine tuning FB llama versus Open Llama base models?

Below are my normalised loss plots. Data starts at step 500 to remove outliers. The FB model train and eval losses are normalised with a mean of zero and divided by the standard deviation. The two Open Llama train and eval losses are grouped and normalised as one data set as they should be an apples-to-apples loss comparison, pre- and post- normalisation.

1. Default FB llama 7B model, default tokenizer from HF, and baize data set. Model is underfitting the data.
2. OpenLM Llama 7B model, trained on 1T tokens, no fast tokenizer, tokenizer initialized to have no BOS token, EOS token. Model is fitting the data. Model is fitting quite well.
3. OpenLM Llama 7B model, trained on 1T tokens, latest transformers (looks to fix the fast tokenizer issue), default OpenLM Llama tokenizer settings from HF. Model is overfitting somewhat.

image

@eschaffn
Copy link

eschaffn commented Jul 5, 2023

Sorry I don't have evaluation or loss curves to share, but generally OpenLLama models for me have repetition problems, repeating the same or very similar sequence until it reaches the token limit.

How do you configure your tokenizer in run 2? I'd like to try with the tokenizer initialization you mentioned.

@gjmulder
Copy link
Author

gjmulder commented Jul 5, 2023

@eschaffn

tokenizer = LlamaTokenizer.from_pretrained(
    "openlm-research/open_llama_7b",
    add_eos_token=True,
    add_bos_token=False,
    use_fast=False
)

The add_bos_token=False was actually an accident. I'd intended to set it to the open llama tokenizer_config.json default of True. I was using the latest commit of transformers, but given the concerns I had surrounding the tokenizer I thought I'd disable it just to make sure. Of course the behaviour I am seeing could be a function of the Baize fine-tuning dataset I am using.

If you can attach any trainer_state.json files from checkpoints I can run them through the same normalisation/scaling scripts for comparison. Ideally using both the FB model and OpenLM model so we can do an apples-to-apples comparison for a different data set and QLora.

I tried qlora with decapoda-research/llama-7b-hf, but I am getting a lot of CUDA assertion errors. A Google of the error indicates that the FB tokenizer looks to be the issue this time. I suspect: model_max_length":1000000000000000019884624838656" as openlm-research/open_llama_7b works fine. 🤷‍♂️

EDIT: Fixed some typos above and should add that I am evaluating every 20 steps as it helps in better plotting given the noise in training loss, at the cost of about 20% increase in run time.

@eschaffn
Copy link

eschaffn commented Jul 5, 2023

@eschaffn

tokenizer = LlamaTokenizer.from_pretrained(
    "openlm-research/open_llama_7b",
    add_eos_token=True,
    add_bos_token=False,
    use_fast=False
)

The add_bos_token=False was actually an accident. I'd intended to set it to the open llama tokenizer_config.json default of True. I was using the latest commit of transformers, but given the concerns I had surrounding the tokenizer I thought I'd disable it just to make sure. Of course the behaviour I am seeing could be a function of the Baize fine-tuning dataset I am using.

If you can attach any trainer_state.json files from checkpoints I can run them through the same normalisation/scaling scripts for comparison. Ideally using both the FB model and OpenLM model so we can do an apples-to-apples comparison for a different data set and QLora.

I tried qlora with decapoda-research/llama-7b-hf, but I am getting a lot of CUDA assertion errors. A Google of the error indicates that the FB tokenizer looks to be the issue this time. I suspect: model_max_length":1000000000000000019884624838656" as openlm-research/open_llama_7b works fine. 🤷‍♂️

Thanks! { "best_metric": null, "best_model_checkpoint": null, "epoch": 3.887269193391642, "global_step": 750, "is_hyper_param_search": false, "is_local_process_zero": true, "is_world_process_zero": true, "log_history": [ { "epoch": 0.05, "learning_rate": 0.0002, "loss": 1.1005, "step": 10 }, { "epoch": 0.1, "learning_rate": 0.0002, "loss": 1.0258, "step": 20 }, { "epoch": 0.16, "learning_rate": 0.0002, "loss": 0.9423, "step": 30 }, { "epoch": 0.21, "learning_rate": 0.0002, "loss": 0.9069, "step": 40 }, { "epoch": 0.26, "learning_rate": 0.0002, "loss": 0.9545, "step": 50 }, { "epoch": 0.31, "learning_rate": 0.0002, "loss": 0.8517, "step": 60 }, { "epoch": 0.36, "learning_rate": 0.0002, "loss": 0.8792, "step": 70 }, { "epoch": 0.41, "learning_rate": 0.0002, "loss": 0.8445, "step": 80 }, { "epoch": 0.47, "learning_rate": 0.0002, "loss": 0.8476, "step": 90 }, { "epoch": 0.52, "learning_rate": 0.0002, "loss": 0.8802, "step": 100 }, { "epoch": 0.57, "learning_rate": 0.0002, "loss": 0.8343, "step": 110 }, { "epoch": 0.62, "learning_rate": 0.0002, "loss": 0.833, "step": 120 }, { "epoch": 0.67, "learning_rate": 0.0002, "loss": 0.8455, "step": 130 }, { "epoch": 0.73, "learning_rate": 0.0002, "loss": 0.8587, "step": 140 }, { "epoch": 0.78, "learning_rate": 0.0002, "loss": 0.8447, "step": 150 }, { "epoch": 0.83, "learning_rate": 0.0002, "loss": 0.8016, "step": 160 }, { "epoch": 0.88, "learning_rate": 0.0002, "loss": 0.8644, "step": 170 }, { "epoch": 0.93, "learning_rate": 0.0002, "loss": 0.813, "step": 180 }, { "epoch": 0.98, "learning_rate": 0.0002, "loss": 0.7903, "step": 190 }, { "epoch": 1.04, "learning_rate": 0.0002, "loss": 0.6204, "step": 200 }, { "epoch": 1.09, "learning_rate": 0.0002, "loss": 0.4882, "step": 210 }, { "epoch": 1.14, "learning_rate": 0.0002, "loss": 0.4679, "step": 220 }, { "epoch": 1.19, "learning_rate": 0.0002, "loss": 0.4926, "step": 230 }, { "epoch": 1.24, "learning_rate": 0.0002, "loss": 0.491, "step": 240 }, { "epoch": 1.3, "learning_rate": 0.0002, "loss": 0.4793, "step": 250 }, { "epoch": 1.35, "learning_rate": 0.0002, "loss": 0.4967, "step": 260 }, { "epoch": 1.4, "learning_rate": 0.0002, "loss": 0.4926, "step": 270 }, { "epoch": 1.45, "learning_rate": 0.0002, "loss": 0.4923, "step": 280 }, { "epoch": 1.5, "learning_rate": 0.0002, "loss": 0.5025, "step": 290 }, { "epoch": 1.55, "learning_rate": 0.0002, "loss": 0.4964, "step": 300 }, { "epoch": 1.61, "learning_rate": 0.0002, "loss": 0.5149, "step": 310 }, { "epoch": 1.66, "learning_rate": 0.0002, "loss": 0.5019, "step": 320 }, { "epoch": 1.71, "learning_rate": 0.0002, "loss": 0.5348, "step": 330 }, { "epoch": 1.76, "learning_rate": 0.0002, "loss": 0.4985, "step": 340 }, { "epoch": 1.81, "learning_rate": 0.0002, "loss": 0.4917, "step": 350 }, { "epoch": 1.87, "learning_rate": 0.0002, "loss": 0.5106, "step": 360 }, { "epoch": 1.92, "learning_rate": 0.0002, "loss": 0.4962, "step": 370 }, { "epoch": 1.97, "learning_rate": 0.0002, "loss": 0.5116, "step": 380 }, { "epoch": 2.02, "learning_rate": 0.0002, "loss": 0.4312, "step": 390 }, { "epoch": 2.07, "learning_rate": 0.0002, "loss": 0.2957, "step": 400 }, { "epoch": 2.13, "learning_rate": 0.0002, "loss": 0.2881, "step": 410 }, { "epoch": 2.18, "learning_rate": 0.0002, "loss": 0.3076, "step": 420 }, { "epoch": 2.23, "learning_rate": 0.0002, "loss": 0.28, "step": 430 }, { "epoch": 2.28, "learning_rate": 0.0002, "loss": 0.3106, "step": 440 }, { "epoch": 2.33, "learning_rate": 0.0002, "loss": 0.294, "step": 450 }, { "epoch": 2.38, "learning_rate": 0.0002, "loss": 0.3072, "step": 460 }, { "epoch": 2.44, "learning_rate": 0.0002, "loss": 0.3086, "step": 470 }, { "epoch": 2.49, "learning_rate": 0.0002, "loss": 0.2905, "step": 480 }, { "epoch": 2.54, "learning_rate": 0.0002, "loss": 0.3143, "step": 490 }, { "epoch": 2.59, "learning_rate": 0.0002, "loss": 0.297, "step": 500 }, { "epoch": 2.64, "learning_rate": 0.0002, "loss": 0.3052, "step": 510 }, { "epoch": 2.7, "learning_rate": 0.0002, "loss": 0.3117, "step": 520 }, { "epoch": 2.75, "learning_rate": 0.0002, "loss": 0.296, "step": 530 }, { "epoch": 2.8, "learning_rate": 0.0002, "loss": 0.3244, "step": 540 }, { "epoch": 2.85, "learning_rate": 0.0002, "loss": 0.3025, "step": 550 }, { "epoch": 2.9, "learning_rate": 0.0002, "loss": 0.3181, "step": 560 }, { "epoch": 2.95, "learning_rate": 0.0002, "loss": 0.309, "step": 570 }, { "epoch": 3.01, "learning_rate": 0.0002, "loss": 0.2853, "step": 580 }, { "epoch": 3.06, "learning_rate": 0.0002, "loss": 0.1806, "step": 590 }, { "epoch": 3.11, "learning_rate": 0.0002, "loss": 0.1732, "step": 600 }, { "epoch": 3.16, "learning_rate": 0.0002, "loss": 0.1614, "step": 610 }, { "epoch": 3.21, "learning_rate": 0.0002, "loss": 0.1905, "step": 620 }, { "epoch": 3.27, "learning_rate": 0.0002, "loss": 0.171, "step": 630 }, { "epoch": 3.32, "learning_rate": 0.0002, "loss": 0.1826, "step": 640 }, { "epoch": 3.37, "learning_rate": 0.0002, "loss": 0.1814, "step": 650 }, { "epoch": 3.42, "learning_rate": 0.0002, "loss": 0.1647, "step": 660 }, { "epoch": 3.47, "learning_rate": 0.0002, "loss": 0.1925, "step": 670 }, { "epoch": 3.52, "learning_rate": 0.0002, "loss": 0.1694, "step": 680 }, { "epoch": 3.58, "learning_rate": 0.0002, "loss": 0.1906, "step": 690 }, { "epoch": 3.63, "learning_rate": 0.0002, "loss": 0.1941, "step": 700 }, { "epoch": 3.68, "learning_rate": 0.0002, "loss": 0.1777, "step": 710 }, { "epoch": 3.73, "learning_rate": 0.0002, "loss": 0.2012, "step": 720 }, { "epoch": 3.78, "learning_rate": 0.0002, "loss": 0.179, "step": 730 }, { "epoch": 3.84, "learning_rate": 0.0002, "loss": 0.1929, "step": 740 }, { "epoch": 3.89, "learning_rate": 0.0002, "loss": 0.2022, "step": 750 }, { "epoch": 3.89, "step": 750, "total_flos": 7.608674280150139e+17, "train_loss": 0.4744114775657654, "train_runtime": 10810.3668, "train_samples_per_second": 8.88, "train_steps_per_second": 0.069 } ], "max_steps": 750, "num_train_epochs": 4, "total_flos": 7.608674280150139e+17, "trial_name": null, "trial_params": null }

Here's a trainer_state.json from a ~4epoch training run with a length filtered dataset filtering out any input sequences with >768 tokens.

I'll start one with https://huggingface.co/decapoda-research/llama-13b-hf today!

@gjmulder
Copy link
Author

gjmulder commented Jul 5, 2023

Ideally I need the .json file attached here. You can change the file type to .txt if Github refuses to attach it. I also can't see the eval_loss logged in that which is what I need to compare the two.

Here's how I'm running qlora.py (for 7B, so I'm using 8bit). --eval_steps 20 does cause a lot of evaluation overhead (adds about 20% to the run duration), but allows me to make detailed plots:

python qlora.py \
  --model_name_or_path $HF_USER/$BASE_MODEL \
  --dataset $DATA \
  --eval_dataset_size 2000 \
  --bits 8 \
  --evaluation_strategy steps \
  --logging_steps 20 \
  --eval_steps 20 \
  --save_steps 20 \
  --no_skip_memory_metrics \
  --output_dir /data/lora

I notice that decapoda-research/llama-13b-hf/blob/main/config.json had a lot more defines in it so you might have better luck than I did with the llama-7b.

@eschaffn
Copy link

eschaffn commented Jul 5, 2023

I've reran that run with eval_steps = 20
trainer_state.txt

I'm currently running using https://huggingface.co/huggyllama/llama-13b since I ran into the same CUDA errors as you mentioned and this should be done in a few hours.

Also what would you suggest doing with these lines in the QLORA code:

    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path,
        # cache_dir=args.cache_dir,
        padding_side="right",
        use_fast=False, # Fast tokenizer giving issues.
        tokenizer_type='llama' if 'llama' in args.model_name_or_path else None, # Needed for HF name change
        use_auth_token=args.use_auth_token,
    )
    if tokenizer._pad_token is None:
        smart_tokenizer_and_embedding_resize(
            special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
            tokenizer=tokenizer,
            model=model,
        )
    if 'llama' in args.model_name_or_path or isinstance(tokenizer, LlamaTokenizer):
        # LLaMA tokenizer may not have correct special tokens set.
        # Check and add them if missing to prevent them from being parsed into different tokens.
        # Note that these are present in the vocabulary.
        # Note also that `model.config.pad_token_id` is 0 which corresponds to `<unk>` token.
        print('Adding special tokens.')
        tokenizer.add_special_tokens({
                "eos_token": tokenizer.convert_ids_to_tokens(model.config.eos_token_id),
                "bos_token": tokenizer.convert_ids_to_tokens(model.config.bos_token_id),
                "unk_token": tokenizer.convert_ids_to_tokens(
                    model.config.pad_token_id if model.config.pad_token_id != -1 else tokenizer.pad_token_id
                ),
        }) 

for training OpenLLama? I've left them as is for the first run, as well as for the FB Llama run.

@gjmulder
Copy link
Author

gjmulder commented Jul 6, 2023

Using the defaults for OpenLlama 7B was a bit of an overfitting disaster. See below, but not an apples-to-apples comparison. I needed to revert to 4 bit QLora as I got an out a CUDA memory error with 8bit on my 24GB GPU, which is extremely frustrating.

Different data set (OAssist) and different Qlora instead of 8 bit Baize Lora. Since the datasets are different in size, the epochs versus steps are also different ratios, so I plotted per step and added the epoch in black text.

image

@gjmulder
Copy link
Author

gjmulder commented Jul 6, 2023

I've reran that run with eval_steps = 20 trainer_state.txt

I'm currently running using https://huggingface.co/huggyllama/llama-13b since I ran into the same CUDA errors as you mentioned and this should be done in a few hours.

EDIT: Misunderstood what model you used:

Your data is open_llama_13b-wizardlm-100000000. Similar training behaviour as I am seeing with Qlora 4 bit. Normalised y-axis are we're using different data sets (and models). It doesn't look like the model as my Open Llama 7B w/Baize doesn't overfit that badly. Baize is a somewhat larger data set than WizardLM, but not by that much. I don't know the size of Open Assist:

$ cat ./baize-chatbot/data/alpaca_chat_data.json ./baize-chatbot/data/quora_chat_data.json ./baize-chatbot/data/stackoverflow_chat_data.json | wc -w
27207442
$ wc -w ./wizard_vicuna_70k_unfiltered/wizard_vicuna_dataset_unfiltered.json 
21921770 ./wizard_vicuna_70k_unfiltered/wizard_vicuna_dataset_unfiltered.json

Looks like a Qlora issue?

image

@gjmulder
Copy link
Author

gjmulder commented Jul 6, 2023

Adding similar /huggyllama/llama-13b data to the above plot will confirm is is Qlora, and not Open Llama.

@eschaffn
Copy link

eschaffn commented Jul 6, 2023

trainer_state.txt

Using FB Llama the same overfitting issues.

The default LORA R value is 64 for QLORA. I've been running with r=32 but maybe this is causing the overfitting. The LORA paper (IIRC) uses r=2? What rank value are you using to finetune?

Edit: LORA paper uses r=4, baize uses r=8. I'm trying a run with r=4 will update in a few hours.

@gjmulder
Copy link
Author

gjmulder commented Jul 6, 2023

Scaled losses for Qlora. I was using the defaults. All three are massively over-fitting after one epoch, hence the "dot steps":

image

And here's the unscaled loss for the 13B models:

image

@eschaffn
Copy link

eschaffn commented Jul 6, 2023

Still overfitting with r=4
trainer_state.txt

Are you using Baize with input/output data format? Would you be able to share the changes you've made to the baize code?

It seems you are probably correct about it being a qlora issue as the divergence seems consistent.

@eschaffn
Copy link

eschaffn commented Jul 6, 2023

Also my dataset is about 27k sequences after filtering out inputs with >768 tokens. I did this for memory restraints on QLORA but it seems that Baize is actually using less memory/GPU (with DDP via torchrun) in 8bit than QLORA in 4bit which seems odd to me.

@gjmulder
Copy link
Author

gjmulder commented Jul 7, 2023

Still overfitting with r=4 trainer_state.txt

Unscaled, below. I assume the last run was with Open Llama 13B? The plots look suspiciously similar, but I confirmed I didn't accidentally duplicate one of your trainer_state.txt files.

image

Are you using Baize with input/output data format? Would you be able to share the changes you've made to the baize code?

Here's my fork: gjmulder/baize-chatbot. I haven't changed much from the Baize defaults, except added a ./trainer.sh and parameterised a lot of the usual hyperparameters. My approach is:

  1. Configure ./trainer.sh with a config, initially identical to the code I forked
  2. ./trainer.sh generates a unique run id (the large number you see in the plot titles)
  3. git commit -am "<run number> notes on the config"
  4. Run lora_training_analysis.R which finds the latest checkpoint, runs a rsync of the latest trainer_state.json and prints all the combined plots.

My checkpoints are then associated with a git commit, and I can easily revert to any model run that in hindsight was my best to date. This all came about as the original Alpaca Lora code caused a bug with WanDB. I've found it a lot more flexible than WanDB as I can code any plot and comparison on the fly. The R code is written in such a way that it continually finds the latest checkpoint per run, so as the run proceeds I can semi-interactively how it is performing relative to prior runs.

It seems you are probably correct about it being a qlora issue as the divergence seems consistent.

Baize uses --lora_r 8 🤷‍♂️

Also my dataset is about 27k sequences after filtering out inputs with >768 tokens. I did this for memory restraints on QLORA but it seems that Baize is actually using less memory/GPU (with DDP via torchrun) in 8bit than QLORA in 4bit which seems odd to me.

The original Alpaca Lora code memory usage was always stable. Likewise with Baize.

I've tried a few implementations of Qlora, including and I keep on seeing memory leaks when the checkpoints are being written. Another reason to checkpoint often, as that means I get CUDA memory errors that much earlier. There's a script ./wgpu.sh I just added to my Baize fork that I use to watch GPU memory usage. I sit there watching it log per second so I can correlate exactly what the code is doing when it OOMs, namely the checkpointing.

@eschaffn
Copy link

Thank you for this. I was running into the memory issues while checkpointing this weekend. I'm going to try implementing FSDP instead of DDP and then continue training!

@gjmulder
Copy link
Author

@eschaffn Added you to my Baize repo fork if you'd like to collaborate.

I looked at using the WizardLM uncensored data set with Alpaca Lora, but after reviewing the code decided to find a better implementation. So far Baize looks to be a cleaner code base, doesn't OOM, but had a lot of hyperparams hard coded, hence the refactoring I've done.

If you don't code in R I can get ChatGPT-4 to translate my R code to the python equivalent for our analysis. Or move to WanDB which is likely the better solution, long term.

@eschaffn
Copy link

Sure, I'm taking a bit of a break from the training runs maybe a couple days but I'd be happy to collaborate! I can send trainer_states.json to you manually but cannot hook up my machine to wandb.

@gjmulder
Copy link
Author

@eschaffn feel free to branch and push your trainer_states.json in my repo in a subdir. I've opened a Issues and Discussions section for us to chat so we don't keep spamming open_llama here. Any interesting Open Llama results we can report back here 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants