-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lm-eval-harness WikiText bug #46
Comments
Any help on this would be appreciated (a hacky suggestion would work too!) -- I am experimenting with architectures where block mask isn't valid, so perplexity will be a decent metric to track different architectures I used the usual -GPT fixes, but I don't think its quite right. |
This is the fix I had to apply, kept the try - except check on the original method to ensure it doesn't cause issues elsewhere. (note that this was in lm-eval : lm_eval/api/metrics.py) Likely can be fixed by changing loglikelihood_rolling in lingua/apps/main/eval.py here:
but I avoided that solution to not mess with the overall framework However, I must say -- wikitext2 word perplexity in less than 400 steps on a 24M param model with fineweb_edu_10bt_shuffled is 4.84 (started at ~15), not sure if that is reasonable so I am uncertain if this is the right solution. |
+1 fix works for me, though fixing this in the actual repo would be nice |
Glad to hear the fix works for you, but for word-perplexity, did you also observe that it was quite less on WK2? (I am getting like, 2.5 perplexity on a sub-set with 1024 seq-len, with a 25M param model), which sounds wrong. Follow-up on this by the authors of lingua would also be very much appreciated (@BadrYoubiIdrissi -- Sorry to bother but this is the only way I can evaluate my method due to weird attention masking.) 😅 |
I have only tried with mamba models. I pretrained a 130m parameter model on 7.5B tokens of Fineweb-Edu and got around 28.17 word perplexity, which I think is comparable to gpt2 performance. The 2.5 seems very low, and there is definitely something weird going on there. |
Hmm, even the base transformer gives very low perplexity, I will test it out more thoroughly and maybe pull in mamba models for comparison to see if I can fix it |
I think this was a sequence length issue -- @Hprairie if you are evaluating at 8192 seq-len, you might still be having this issue, just not noticing it. Specifically, the sequence-lengths are capped at the max (the eval-seq-len you specify), but the true seq-len of the document is used when calculating weighted-mean. There can be very few wikitext2 docs that have length > 8192, but with the earlier fix, you might end up underestimating perplexity, as it would take loglikelihood_on_8192 / true_doc_len, where true_doc_len can be > 8192 This is done in apps/main/generate.py:
So, the full solution (I think) is: in apps/main/eval.py def loglikelihood_rolling:
and then a try-except (to support older code in case I am making a mistake)
With the fix above, I am getting reasonable perplexities (> 36.9) in early stages of training, do let me know if you think my approach is wrong... Thanks! |
Ahh, I see, yeah I was evaluated with 4096 so that's probably why I was getting higher results, good catch though are you sure that this is calculating word perplexity now and not token perplexity? I'll take a deeper dive into this, though this definitely seems like something that should be fixed in lingua |
Hmm, I see. I think this might in this case be token perplexity? Though it tells me word perplexity, haven't had the time to investigate in much depth Further, I actually see some interesting behavior in my training runs... the Loss always decreases till the end, but the word perplexity I am evaluating does not decrease, in fact, it starts increasing. -- grad norm goes down exponentially, and then slowly starts rising towards the end of training Did you observe similar effects? |
It might be useful for the authors of the framework to know that I did not observe this issue with the dclm_baseline_1.0_shuffled dataset (1 out of 10 global shard), this only happens on fineweb_edu_10bt_shuffled. Not sure why. |
Hmmm, that is super weird and I haven't noticed that in my training logs. I have been using fineweb_edu_10bt_shuffled also. Are you training a model with context length equal to your evaluation size? I will also say that a length of 8192 in word length is hard to compare with training context seqlen. I.e. 1024 words doesn't imply 1024 tokens (could be less or more). |
That makes sense, in my case I train and eval both on 1024 tokens, but I'm happy this isn't happening on dclm, so I'll stick with that for now. grad_norm still increases by the end, but I saw the same on the wandb logs for OLMO7B, so perhaps it isn't a problem. |
Just in case it helps, I think this was a data processing misunderstanding, I detail this in Issue #55, I think I was only training on one chunk of the JSON files, and that means many epochs on a small dataset when using fineweb-edu, but this doesn't happen in dclm because chunks are bigger |
Wow, I didn't see that issue, I will have to rerun some experiments now. Thanks for the heads up. It seems like I should just keep everything in a single chunk to ensure that everything is trained on. |
Yes, I have asked the authors to explicitly post this to avoid such issues, it's easy to miss How are you evaluating the LLM? Most of the evaluations are failing for me, I think due to bugs in multi-choice benchmark integration with lm-eval At the moment my only evaluation metric is wikitext perplexity.. |
Hello,
I am trying to use the harness for the WikiText task, and I get the following error
I am unsure how to fix it on my end...
Thank you for open-sourcing the library
The text was updated successfully, but these errors were encountered: