Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lm-eval-harness WikiText bug #46

Open
akhauriyash opened this issue Nov 4, 2024 · 15 comments
Open

lm-eval-harness WikiText bug #46

akhauriyash opened this issue Nov 4, 2024 · 15 comments

Comments

@akhauriyash
Copy link

Hello,

I am trying to use the harness for the WikiText task, and I get the following error


0: INFO    24-11-04 14:47:26.234773 - 0:00:52 - Building contexts for wikitext on rank 0...                                             
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 528.63it/s]
0: INFO    24-11-04 14:47:26.355134 - 0:00:52 - Running loglikelihood_rolling requests
0: INFO    24-11-04 14:47:32.996628 - 0:00:59 - Killing async data process 407418 ...
0: INFO    24-11-04 14:47:33.010259 - 0:00:59 - Async dataloader cleaned up
[rank0]: Traceback (most recent call last):                 
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/scratch/ya255/lingua/apps/main/train.py", line 652, in <module>
[rank0]:     main()                                                  
[rank0]:   File "/scratch/ya255/lingua/apps/main/train.py", line 648, in main
[rank0]:     train(cfg)          
[rank0]:   File "/scratch/ya255/lingua/apps/main/train.py", line 560, in train
[rank0]:     launch_eval(eval_args)
[rank0]:   File "/scratch/ya255/lingua/apps/main/eval.py", line 185, in launch_eval
[rank0]:     results = simple_evaluate(wrap, **asdict(cfg.harness))
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
[rank0]:     return fn(*args, **kwargs)                                                                                                    
[rank0]:            ^^^^^^^^^^^^^^^^^^^                             
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/evaluator.py", line 301, in simple_evaluate
[rank0]:     results = evaluate(
[rank0]:               ^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/evaluator.py", line 599, in evaluate
[rank0]:     task_output.calculate_aggregate_metric(bootstrap_iters=bootstrap_iters)
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/evaluator_utils.py", line 104, in calculate_agg
regate_metric
[rank0]:     self.agg_metrics[metric_key] = agg_fn(items)
[rank0]:                                    ^^^^^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/api/metrics.py", line 43, in weighted_perplexit
y
[rank0]:     return math.exp(-weighted_mean(items))
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/api/metrics.py", line 406, in weighted_mean
[rank0]:     return sum(a) / sum(b)
[rank0]:            ^^^^^^
[rank0]: TypeError: unsupported operand type(s) for +: 'int' and 'tuple'

I am unsure how to fix it on my end...

Thank you for open-sourcing the library

@akhauriyash
Copy link
Author

Any help on this would be appreciated (a hacky suggestion would work too!) -- I am experimenting with architectures where block mask isn't valid, so perplexity will be a decent metric to track different architectures

I used the usual -GPT fixes, but I don't think its quite right.

@akhauriyash
Copy link
Author

akhauriyash commented Nov 9, 2024

-
 def weighted_mean(items):
     a, b = zip(*items)
-    return sum(a) / sum(b)
+    try:
+        wmean = sum(a) / sum(b)
+    except:
+        wmean = sum([x[0] for x in a]) / sum(b)
+    return wmean
 

This is the fix I had to apply, kept the try - except check on the original method to ensure it doesn't cause issues elsewhere.

(note that this was in lm-eval : lm_eval/api/metrics.py)

Likely can be fixed by changing loglikelihood_rolling in lingua/apps/main/eval.py here:

        for ll in lls:
            results.append((ll.sum().item(),)) # <-- This tuple seems incorrect.

but I avoided that solution to not mess with the overall framework

However, I must say -- wikitext2 word perplexity in less than 400 steps on a 24M param model with fineweb_edu_10bt_shuffled is 4.84 (started at ~15), not sure if that is reasonable so I am uncertain if this is the right solution.

@Hprairie
Copy link

+1 fix works for me, though fixing this in the actual repo would be nice

@akhauriyash
Copy link
Author

akhauriyash commented Nov 13, 2024

Glad to hear the fix works for you, but for word-perplexity, did you also observe that it was quite less on WK2? (I am getting like, 2.5 perplexity on a sub-set with 1024 seq-len, with a 25M param model), which sounds wrong. Follow-up on this by the authors of lingua would also be very much appreciated (@BadrYoubiIdrissi -- Sorry to bother but this is the only way I can evaluate my method due to weird attention masking.) 😅

@Hprairie
Copy link

I have only tried with mamba models. I pretrained a 130m parameter model on 7.5B tokens of Fineweb-Edu and got around 28.17 word perplexity, which I think is comparable to gpt2 performance. The 2.5 seems very low, and there is definitely something weird going on there.

@akhauriyash
Copy link
Author

Hmm, even the base transformer gives very low perplexity, I will test it out more thoroughly and maybe pull in mamba models for comparison to see if I can fix it

@akhauriyash
Copy link
Author

I think this was a sequence length issue -- @Hprairie if you are evaluating at 8192 seq-len, you might still be having this issue, just not noticing it.

Specifically, the sequence-lengths are capped at the max (the eval-seq-len you specify), but the true seq-len of the document is used when calculating weighted-mean. There can be very few wikitext2 docs that have length > 8192, but with the earlier fix, you might end up underestimating perplexity, as it would take loglikelihood_on_8192 / true_doc_len, where true_doc_len can be > 8192

This is done in apps/main/generate.py:

        prompts = [p[-max_prompt_len:] for p in prompts]

So, the full solution (I think) is:

in apps/main/eval.py def loglikelihood_rolling:

        for ll in lls:
            results.append((ll.sum().item(), len(ll)))

and then a try-except (to support older code in case I am making a mistake)


def weighted_mean(items):
    try:
        clean_items = [x[0] for x in items]
        a, b = zip(*clean_items)
        wmean = sum(a) / sum(b)
    except:
        a, b = zip(*items)
        try:
            wmean = sum(a) / sum(b)
        except:
            wmean = sum([x[0] for x in a]) / sum(b)
    return wmean

With the fix above, I am getting reasonable perplexities (> 36.9) in early stages of training, do let me know if you think my approach is wrong... Thanks!

@Hprairie
Copy link

Ahh, I see, yeah I was evaluated with 4096 so that's probably why I was getting higher results, good catch though are you sure that this is calculating word perplexity now and not token perplexity? I'll take a deeper dive into this, though this definitely seems like something that should be fixed in lingua

@akhauriyash
Copy link
Author

akhauriyash commented Nov 15, 2024

Hmm, I see. I think this might in this case be token perplexity? Though it tells me word perplexity, haven't had the time to investigate in much depth

Further, I actually see some interesting behavior in my training runs... the Loss always decreases till the end, but the word perplexity I am evaluating does not decrease, in fact, it starts increasing. -- grad norm goes down exponentially, and then slowly starts rising towards the end of training

Did you observe similar effects?

image

@akhauriyash
Copy link
Author

akhauriyash commented Nov 16, 2024

It might be useful for the authors of the framework to know that I did not observe this issue with the dclm_baseline_1.0_shuffled dataset (1 out of 10 global shard), this only happens on fineweb_edu_10bt_shuffled. Not sure why.

@Hprairie
Copy link

Hmmm, that is super weird and I haven't noticed that in my training logs. I have been using fineweb_edu_10bt_shuffled also. Are you training a model with context length equal to your evaluation size? I will also say that a length of 8192 in word length is hard to compare with training context seqlen. I.e. 1024 words doesn't imply 1024 tokens (could be less or more).

@akhauriyash
Copy link
Author

That makes sense, in my case I train and eval both on 1024 tokens, but I'm happy this isn't happening on dclm, so I'll stick with that for now. grad_norm still increases by the end, but I saw the same on the wandb logs for OLMO7B, so perhaps it isn't a problem.

@akhauriyash
Copy link
Author

Just in case it helps, I think this was a data processing misunderstanding, I detail this in Issue #55, I think I was only training on one chunk of the JSON files, and that means many epochs on a small dataset when using fineweb-edu, but this doesn't happen in dclm because chunks are bigger

@Hprairie
Copy link

Wow, I didn't see that issue, I will have to rerun some experiments now. Thanks for the heads up. It seems like I should just keep everything in a single chunk to ensure that everything is trained on.

@akhauriyash
Copy link
Author

akhauriyash commented Nov 25, 2024

Yes, I have asked the authors to explicitly post this to avoid such issues, it's easy to miss

How are you evaluating the LLM? Most of the evaluations are failing for me, I think due to bugs in multi-choice benchmark integration with lm-eval

At the moment my only evaluation metric is wikitext perplexity..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants