You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to finetune my model on FIM-only data.
If I use this repo for FIM data formatting, seems like it could frequently happen that a single chunk (i.e. single element of ConstantLengthDataset) doesn't contain all the FIM components (or sometimes not containing any of them) due to long inputs that need to be chunked.
Does this "hurt" the FIM training? Would it benefit from a different way of formatting/splitting the data so that all FIM components fit into a single chunk (so that they get passed to the model together)?
Thanks!
The text was updated successfully, but these errors were encountered:
I am also confused about this question. It seems that the method fim.permute() returns samples of different length, but finally it will be chunked into seq_length in all_token_ids[i: i + seq_length], resulting samples like <fim-prefix>xxxxxxx<fim-suffix>xxx which has no <fim-middle> and following content. Is this a trick for better generalization?
Hi,
I want to finetune my model on FIM-only data.
If I use this repo for FIM data formatting, seems like it could frequently happen that a single chunk (i.e. single element of
ConstantLengthDataset
) doesn't contain all the FIM components (or sometimes not containing any of them) due to long inputs that need to be chunked.Does this "hurt" the FIM training? Would it benefit from a different way of formatting/splitting the data so that all FIM components fit into a single chunk (so that they get passed to the model together)?
Thanks!
The text was updated successfully, but these errors were encountered: