Formatting FIM data #22

gojkoc54 · 2023-09-06T12:32:22Z

Hi,

I want to finetune my model on FIM-only data.
If I use this repo for FIM data formatting, seems like it could frequently happen that a single chunk (i.e. single element of ConstantLengthDataset) doesn't contain all the FIM components (or sometimes not containing any of them) due to long inputs that need to be chunked.

Does this "hurt" the FIM training? Would it benefit from a different way of formatting/splitting the data so that all FIM components fit into a single chunk (so that they get passed to the model together)?

Thanks!

The text was updated successfully, but these errors were encountered:

hanlinGao · 2024-01-05T07:29:50Z

I am also confused about this question. It seems that the method fim.permute() returns samples of different length, but finally it will be chunked into seq_length in all_token_ids[i: i + seq_length], resulting samples like <fim-prefix>xxxxxxx<fim-suffix>xxx which has no <fim-middle> and following content. Is this a trick for better generalization?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formatting FIM data #22

Formatting FIM data #22

gojkoc54 commented Sep 6, 2023

hanlinGao commented Jan 5, 2024 •

edited

Loading

Formatting FIM data #22

Formatting FIM data #22

Comments

gojkoc54 commented Sep 6, 2023

hanlinGao commented Jan 5, 2024 • edited Loading

hanlinGao commented Jan 5, 2024 •

edited

Loading