[Bug] Tokenizer's BOS/EOS/PAD not set for inference #139

NanoCode012 · 2023-06-01T08:37:15Z

Following https://discord.com/channels/1104757954588196865/1111279858136383509/1113729100763381804, a user saw that the tokenizer's bos/eos/pad is not set when inference mode.

We can fix this by just setting this following yaml.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/288fd62431be84a7112fd461feeb9322f1177d3c/scripts/finetune.py#L66-L68

We need to update this as Alpaca is not the only method now.

Depends on #64

utensil · 2023-06-01T09:51:25Z

Actually it's also seen during fine-tuning falcon:

Setting ds_accelerator to cuda (auto detect)
WARNING:root:`trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
INFO:root:loading tokenizer... tiiuae/falcon-7b
Using bos_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using unk_token, but it is not set yet.
INFO:root:Loading prepared dataset from disk at last_run_prepared/31a4e867d804a957707db033c9abcd13...
INFO:root:Prepared dataset loaded from disk...
INFO:root:loading model and peft_config...
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:16<00:00,  8.49s/it]
INFO:root:converting PEFT model w/ prepare_model_for_int8_training

NanoCode012 · 2023-06-01T10:02:19Z

@utensil , Which dataset format are you using?

utensil · 2023-06-02T01:21:01Z

@utensil , Which dataset format are you using?

alpaca:chat

as in #132

winglian · 2023-06-02T15:47:04Z

we should probably change these to info and move it to after we add the special tokens from the config
https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L53-L56

winglian · 2023-06-02T15:47:37Z

Doesn't https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L68-L70 already solve this?

NanoCode012 · 2023-06-02T15:55:17Z

Doesn't https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L68-L70 already solve this?

Hm, yes. I think the problem is due to that code block in the first post. It may overwrite previous setting. I think just removing it is enough? The main work is in #64

NanoCode012 · 2023-06-08T13:56:09Z

If we were to remove the block in the first post, we need to make sure llama configs/tokenizers have those added in somewhere to prevent a regression.

AngainorDev · 2023-06-12T13:29:50Z

If I understand it right, the code from #180
should be ok as well for training, allowing to define only the yet undefined tokens, with config keeping precedence in case of Falcon for instance.

NanoCode012 · 2023-06-12T14:10:08Z

Yes, this should be closed. Thank you!

NanoCode012 added the bug Something isn't working label Jun 1, 2023

NanoCode012 closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Tokenizer's BOS/EOS/PAD not set for inference #139

[Bug] Tokenizer's BOS/EOS/PAD not set for inference #139

NanoCode012 commented Jun 1, 2023

utensil commented Jun 1, 2023 •

edited

Loading

NanoCode012 commented Jun 1, 2023

utensil commented Jun 2, 2023

winglian commented Jun 2, 2023

winglian commented Jun 2, 2023

NanoCode012 commented Jun 2, 2023

NanoCode012 commented Jun 8, 2023

AngainorDev commented Jun 12, 2023 •

edited

Loading

NanoCode012 commented Jun 12, 2023

[Bug] Tokenizer's BOS/EOS/PAD not set for inference #139

[Bug] Tokenizer's BOS/EOS/PAD not set for inference #139

Comments

NanoCode012 commented Jun 1, 2023

utensil commented Jun 1, 2023 • edited Loading

NanoCode012 commented Jun 1, 2023

utensil commented Jun 2, 2023

winglian commented Jun 2, 2023

winglian commented Jun 2, 2023

NanoCode012 commented Jun 2, 2023

NanoCode012 commented Jun 8, 2023

AngainorDev commented Jun 12, 2023 • edited Loading

NanoCode012 commented Jun 12, 2023

utensil commented Jun 1, 2023 •

edited

Loading

AngainorDev commented Jun 12, 2023 •

edited

Loading