Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting FIM data #22

Open
gojkoc54 opened this issue Sep 6, 2023 · 1 comment
Open

Formatting FIM data #22

gojkoc54 opened this issue Sep 6, 2023 · 1 comment

Comments

@gojkoc54
Copy link

gojkoc54 commented Sep 6, 2023

Hi,

I want to finetune my model on FIM-only data.
If I use this repo for FIM data formatting, seems like it could frequently happen that a single chunk (i.e. single element of ConstantLengthDataset) doesn't contain all the FIM components (or sometimes not containing any of them) due to long inputs that need to be chunked.

Does this "hurt" the FIM training? Would it benefit from a different way of formatting/splitting the data so that all FIM components fit into a single chunk (so that they get passed to the model together)?

Thanks!

@hanlinGao
Copy link

hanlinGao commented Jan 5, 2024

I am also confused about this question. It seems that the method fim.permute() returns samples of different length, but finally it will be chunked into seq_length in all_token_ids[i: i + seq_length], resulting samples like <fim-prefix>xxxxxxx<fim-suffix>xxx which has no <fim-middle> and following content. Is this a trick for better generalization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants