lm-eval-harness WikiText bug #46

akhauriyash · 2024-11-04T19:48:36Z

Hello,

I am trying to use the harness for the WikiText task, and I get the following error


0: INFO    24-11-04 14:47:26.234773 - 0:00:52 - Building contexts for wikitext on rank 0...                                             
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 528.63it/s]
0: INFO    24-11-04 14:47:26.355134 - 0:00:52 - Running loglikelihood_rolling requests
0: INFO    24-11-04 14:47:32.996628 - 0:00:59 - Killing async data process 407418 ...
0: INFO    24-11-04 14:47:33.010259 - 0:00:59 - Async dataloader cleaned up
[rank0]: Traceback (most recent call last):                 
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/scratch/ya255/lingua/apps/main/train.py", line 652, in <module>
[rank0]:     main()                                                  
[rank0]:   File "/scratch/ya255/lingua/apps/main/train.py", line 648, in main
[rank0]:     train(cfg)          
[rank0]:   File "/scratch/ya255/lingua/apps/main/train.py", line 560, in train
[rank0]:     launch_eval(eval_args)
[rank0]:   File "/scratch/ya255/lingua/apps/main/eval.py", line 185, in launch_eval
[rank0]:     results = simple_evaluate(wrap, **asdict(cfg.harness))
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
[rank0]:     return fn(*args, **kwargs)                                                                                                    
[rank0]:            ^^^^^^^^^^^^^^^^^^^                             
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/evaluator.py", line 301, in simple_evaluate
[rank0]:     results = evaluate(
[rank0]:               ^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/evaluator.py", line 599, in evaluate
[rank0]:     task_output.calculate_aggregate_metric(bootstrap_iters=bootstrap_iters)
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/evaluator_utils.py", line 104, in calculate_agg
regate_metric
[rank0]:     self.agg_metrics[metric_key] = agg_fn(items)
[rank0]:                                    ^^^^^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/api/metrics.py", line 43, in weighted_perplexit
y
[rank0]:     return math.exp(-weighted_mean(items))
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ya255/.conda/envs/lingua_241023/lib/python3.11/site-packages/lm_eval/api/metrics.py", line 406, in weighted_mean
[rank0]:     return sum(a) / sum(b)
[rank0]:            ^^^^^^
[rank0]: TypeError: unsupported operand type(s) for +: 'int' and 'tuple'

I am unsure how to fix it on my end...

Thank you for open-sourcing the library

The text was updated successfully, but these errors were encountered:

akhauriyash · 2024-11-07T13:50:25Z

Any help on this would be appreciated (a hacky suggestion would work too!) -- I am experimenting with architectures where block mask isn't valid, so perplexity will be a decent metric to track different architectures

I used the usual -GPT fixes, but I don't think its quite right.

akhauriyash · 2024-11-09T22:40:27Z

-
 def weighted_mean(items):
     a, b = zip(*items)
-    return sum(a) / sum(b)
+    try:
+        wmean = sum(a) / sum(b)
+    except:
+        wmean = sum([x[0] for x in a]) / sum(b)
+    return wmean

This is the fix I had to apply, kept the try - except check on the original method to ensure it doesn't cause issues elsewhere.

(note that this was in lm-eval : lm_eval/api/metrics.py)

Likely can be fixed by changing loglikelihood_rolling in lingua/apps/main/eval.py here:

        for ll in lls:
            results.append((ll.sum().item(),)) # <-- This tuple seems incorrect.

but I avoided that solution to not mess with the overall framework

However, I must say -- wikitext2 word perplexity in less than 400 steps on a 24M param model with fineweb_edu_10bt_shuffled is 4.84 (started at ~15), not sure if that is reasonable so I am uncertain if this is the right solution.

Hprairie · 2024-11-12T03:04:04Z

+1 fix works for me, though fixing this in the actual repo would be nice

akhauriyash · 2024-11-13T20:37:36Z

Glad to hear the fix works for you, but for word-perplexity, did you also observe that it was quite less on WK2? (I am getting like, 2.5 perplexity on a sub-set with 1024 seq-len, with a 25M param model), which sounds wrong. Follow-up on this by the authors of lingua would also be very much appreciated (@BadrYoubiIdrissi -- Sorry to bother but this is the only way I can evaluate my method due to weird attention masking.) 😅

Hprairie · 2024-11-13T21:01:25Z

I have only tried with mamba models. I pretrained a 130m parameter model on 7.5B tokens of Fineweb-Edu and got around 28.17 word perplexity, which I think is comparable to gpt2 performance. The 2.5 seems very low, and there is definitely something weird going on there.

akhauriyash · 2024-11-13T23:16:15Z

Hmm, even the base transformer gives very low perplexity, I will test it out more thoroughly and maybe pull in mamba models for comparison to see if I can fix it

akhauriyash · 2024-11-14T12:48:30Z

I think this was a sequence length issue -- @Hprairie if you are evaluating at 8192 seq-len, you might still be having this issue, just not noticing it.

Specifically, the sequence-lengths are capped at the max (the eval-seq-len you specify), but the true seq-len of the document is used when calculating weighted-mean. There can be very few wikitext2 docs that have length > 8192, but with the earlier fix, you might end up underestimating perplexity, as it would take loglikelihood_on_8192 / true_doc_len, where true_doc_len can be > 8192

This is done in apps/main/generate.py:

        prompts = [p[-max_prompt_len:] for p in prompts]

So, the full solution (I think) is:

in apps/main/eval.py def loglikelihood_rolling:

        for ll in lls:
            results.append((ll.sum().item(), len(ll)))

and then a try-except (to support older code in case I am making a mistake)


def weighted_mean(items):
    try:
        clean_items = [x[0] for x in items]
        a, b = zip(*clean_items)
        wmean = sum(a) / sum(b)
    except:
        a, b = zip(*items)
        try:
            wmean = sum(a) / sum(b)
        except:
            wmean = sum([x[0] for x in a]) / sum(b)
    return wmean

With the fix above, I am getting reasonable perplexities (> 36.9) in early stages of training, do let me know if you think my approach is wrong... Thanks!

Hprairie · 2024-11-14T22:45:52Z

Ahh, I see, yeah I was evaluated with 4096 so that's probably why I was getting higher results, good catch though are you sure that this is calculating word perplexity now and not token perplexity? I'll take a deeper dive into this, though this definitely seems like something that should be fixed in lingua

akhauriyash · 2024-11-15T17:14:04Z

Hmm, I see. I think this might in this case be token perplexity? Though it tells me word perplexity, haven't had the time to investigate in much depth

Further, I actually see some interesting behavior in my training runs... the Loss always decreases till the end, but the word perplexity I am evaluating does not decrease, in fact, it starts increasing. -- grad norm goes down exponentially, and then slowly starts rising towards the end of training

Did you observe similar effects?

akhauriyash · 2024-11-16T00:03:59Z

It might be useful for the authors of the framework to know that I did not observe this issue with the dclm_baseline_1.0_shuffled dataset (1 out of 10 global shard), this only happens on fineweb_edu_10bt_shuffled. Not sure why.

Hprairie · 2024-11-16T00:13:12Z

Hmmm, that is super weird and I haven't noticed that in my training logs. I have been using fineweb_edu_10bt_shuffled also. Are you training a model with context length equal to your evaluation size? I will also say that a length of 8192 in word length is hard to compare with training context seqlen. I.e. 1024 words doesn't imply 1024 tokens (could be less or more).

akhauriyash · 2024-11-16T11:05:07Z

That makes sense, in my case I train and eval both on 1024 tokens, but I'm happy this isn't happening on dclm, so I'll stick with that for now. grad_norm still increases by the end, but I saw the same on the wandb logs for OLMO7B, so perhaps it isn't a problem.

akhauriyash · 2024-11-24T14:45:13Z

Just in case it helps, I think this was a data processing misunderstanding, I detail this in Issue #55, I think I was only training on one chunk of the JSON files, and that means many epochs on a small dataset when using fineweb-edu, but this doesn't happen in dclm because chunks are bigger

Hprairie · 2024-11-25T00:14:39Z

Wow, I didn't see that issue, I will have to rerun some experiments now. Thanks for the heads up. It seems like I should just keep everything in a single chunk to ensure that everything is trained on.

akhauriyash · 2024-11-25T00:36:49Z

Yes, I have asked the authors to explicitly post this to avoid such issues, it's easy to miss

How are you evaluating the LLM? Most of the evaluations are failing for me, I think due to bugs in multi-choice benchmark integration with lm-eval

At the moment my only evaluation metric is wikitext perplexity..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lm-eval-harness WikiText bug #46

lm-eval-harness WikiText bug #46

akhauriyash commented Nov 4, 2024

akhauriyash commented Nov 7, 2024

akhauriyash commented Nov 9, 2024 •

edited

Loading

Hprairie commented Nov 12, 2024

akhauriyash commented Nov 13, 2024 •

edited

Loading

Hprairie commented Nov 13, 2024

akhauriyash commented Nov 13, 2024

akhauriyash commented Nov 14, 2024

Hprairie commented Nov 14, 2024

akhauriyash commented Nov 15, 2024 •

edited

Loading

akhauriyash commented Nov 16, 2024 •

edited

Loading

Hprairie commented Nov 16, 2024

akhauriyash commented Nov 16, 2024

akhauriyash commented Nov 24, 2024

Hprairie commented Nov 25, 2024

akhauriyash commented Nov 25, 2024 •

edited

Loading

lm-eval-harness WikiText bug #46

lm-eval-harness WikiText bug #46

Comments

akhauriyash commented Nov 4, 2024

akhauriyash commented Nov 7, 2024

akhauriyash commented Nov 9, 2024 • edited Loading

Hprairie commented Nov 12, 2024

akhauriyash commented Nov 13, 2024 • edited Loading

Hprairie commented Nov 13, 2024

akhauriyash commented Nov 13, 2024

akhauriyash commented Nov 14, 2024

Hprairie commented Nov 14, 2024

akhauriyash commented Nov 15, 2024 • edited Loading

akhauriyash commented Nov 16, 2024 • edited Loading

Hprairie commented Nov 16, 2024

akhauriyash commented Nov 16, 2024

akhauriyash commented Nov 24, 2024

Hprairie commented Nov 25, 2024

akhauriyash commented Nov 25, 2024 • edited Loading

akhauriyash commented Nov 9, 2024 •

edited

Loading

akhauriyash commented Nov 13, 2024 •

edited

Loading

akhauriyash commented Nov 15, 2024 •

edited

Loading

akhauriyash commented Nov 16, 2024 •

edited

Loading

akhauriyash commented Nov 25, 2024 •

edited

Loading