I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper #17

WNQzhu · 2024-07-31T15:42:23Z

Hi, I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper.

Here is our result:

Methods	Tokens	Coursera	GSM	QuALITY	TOEFL	CodeU	SFiction	Avg
(L-Eval)Llama2-7b-4k (w/o SFT)	4k	20.05	2.0	28.71	24.53	0.00	40.62	19.31
(Ours) Llama2-7b-4k (w/o SFT)	4k	15.26	19.0	30.69	13.01	3.33	35.93	19.54

Here is our experimental setting:
We change the llama2-chat-test.py file, disable the NTK parameters and using LLama2-7b to conduct the evaluation.
And run like this:
python3 Baselines/llama2-chat-test.py
--scale 7b
--max_length 4k
--metric exam_eval

What's the possible reason for that ? Should I adjust the prompt or other pamameters?

ChenxinAn-fdu · 2024-08-01T06:27:07Z

I did not use the chat format of Llama2-chat to test the base model. The prompt is very simple:
long ctx \nQ: instruction \nA:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper #17

I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper #17

WNQzhu commented Jul 31, 2024

ChenxinAn-fdu commented Aug 1, 2024

I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper #17

I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper #17

Comments

WNQzhu commented Jul 31, 2024

ChenxinAn-fdu commented Aug 1, 2024