In this GitHub repository, I present a baseline solution for the Elementary Math Solving task from the ZaloAI Challenge 2023. Leveraging the powerful mathematical reasoning capabilities of the Deepseek-math model, this approach achieves an impressive 80% accuracy on the competition's private test set.
git clone https://github.com/dinhquy-nguyen-1704/ZaloAI2023-Elementary-Math-Solving.git
cd ZaloAI2023-Elementary-Math-Solving
pip install -r requirements.txt
huggingface-cli login
wandb login
I only utilize a dataset of over 1000 training samples from the competition to fine-tune the model.
To rerun the fine-tuning code, you can execute the following command line.
python main.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name>
You can also find the fine-tuned model I've trained at [🤗 Models] and the merged version at [🤗 Models].
To infer a fine-tuned model with any elementary math multiple-choice question, you can run the following commands.
Chain of Thought:
python inference_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name>
Few-shot Chain of Thought:
python inference_few_shot_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name>
You can absolutely use the model I've fine-tuned for inference as well.
Chain of Thought:
python inference_cot.py --hf_account quynguyen1704 --model_hf_name deepseek-math-7b-rl-zaloai-v2
Few-shot Chain of Thought:
python inference_few_shot_cot.py --hf_account quynguyen1704 --model_hf_name deepseek-math-7b-rl-zaloai-v2
To evaluate the accuracy of the model on the private test set, you can run the following command:
Chain of Thought:
python evaluate_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name> --max_new_tokens <max new tokens>
Few-shot Chain of Thought:
python evaluate_few_shot_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name> --max_new_tokens <max new tokens>
You can also completely replace my model with yours and give it a try.
Chain of Thought with vLLM:
You can also evaluate with vLLM, through the model I merged here. With vLLM, the entire evaluation process with 332 questions in the test set will take about 30 minutes, compared to 4 hours when not using it. However, in return, the quality of the model's answers will be slightly reduced.
python evaluate_vllm.py --hf_account quynguyen1704 --model_hf_name deepseek-math-7b-rl-zaloai-vllm --max_new_tokens 2048
The following table summarizes the results of the model after fine-tuning. For questions where the model does not have enough tokens to generate the final answer (A, B, C or D), answer E will be output.
Model | Max_new_tokens | Prompt | Note | Accuracy |
---|---|---|---|---|
deepseek-math-7b-rl | 500 | CoT | 67% | |
deepseek-math-7b-rl | 1024 | CoT | 82% | |
deepseek-math-7b-rl | 1024 | Few-shot CoT | 80% | |
deepseek-math-7b-rl | 2048 | CoT | vLLM | 80% |
Deepseek-Math-7B-RL is a powerful LLM model with strong mathematical reasoning capabilities in English, Chinese, and Vietnamese. However, there are still certain drawbacks:
- With max_new_tokens = 500, there are many questions in the private dataset where the model doesn't have enough tokens to generate a final answer.
- With max_new_tokens = 1024, the inference time for each question is quite long, averaging about 40s - 60s per question.