Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Tokenizer's BOS/EOS/PAD not set for inference #139

Closed
NanoCode012 opened this issue Jun 1, 2023 · 9 comments
Closed

[Bug] Tokenizer's BOS/EOS/PAD not set for inference #139

NanoCode012 opened this issue Jun 1, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@NanoCode012
Copy link
Collaborator

Following https://discord.com/channels/1104757954588196865/1111279858136383509/1113729100763381804, a user saw that the tokenizer's bos/eos/pad is not set when inference mode.

We can fix this by just setting this following yaml.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/288fd62431be84a7112fd461feeb9322f1177d3c/scripts/finetune.py#L66-L68

We need to update this as Alpaca is not the only method now.

Depends on #64

@NanoCode012 NanoCode012 added the bug Something isn't working label Jun 1, 2023
@utensil
Copy link
Contributor

utensil commented Jun 1, 2023

Actually it's also seen during fine-tuning falcon:

Setting ds_accelerator to cuda (auto detect)
WARNING:root:`trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
INFO:root:loading tokenizer... tiiuae/falcon-7b
Using bos_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using unk_token, but it is not set yet.
INFO:root:Loading prepared dataset from disk at last_run_prepared/31a4e867d804a957707db033c9abcd13...
INFO:root:Prepared dataset loaded from disk...
INFO:root:loading model and peft_config...
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:16<00:00,  8.49s/it]
INFO:root:converting PEFT model w/ prepare_model_for_int8_training

@NanoCode012
Copy link
Collaborator Author

@utensil , Which dataset format are you using?

@utensil
Copy link
Contributor

utensil commented Jun 2, 2023

@utensil , Which dataset format are you using?

alpaca:chat

as in #132

@winglian
Copy link
Collaborator

winglian commented Jun 2, 2023

we should probably change these to info and move it to after we add the special tokens from the config
https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L53-L56

@winglian
Copy link
Collaborator

winglian commented Jun 2, 2023

@NanoCode012
Copy link
Collaborator Author

Doesn't https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L68-L70 already solve this?

Hm, yes. I think the problem is due to that code block in the first post. It may overwrite previous setting. I think just removing it is enough? The main work is in #64

@NanoCode012
Copy link
Collaborator Author

If we were to remove the block in the first post, we need to make sure llama configs/tokenizers have those added in somewhere to prevent a regression.

@AngainorDev
Copy link
Contributor

AngainorDev commented Jun 12, 2023

If I understand it right, the code from #180
should be ok as well for training, allowing to define only the yet undefined tokens, with config keeping precedence in case of Falcon for instance.

@NanoCode012
Copy link
Collaborator Author

Yes, this should be closed. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants