Skip to content

Latest commit

 

History

History
80 lines (77 loc) · 4.6 KB

File metadata and controls

80 lines (77 loc) · 4.6 KB

ZaloAI2023-Elementary-Math-Solving

1. Introduction

In this GitHub repository, I present a baseline solution for the Elementary Math Solving task from the ZaloAI Challenge 2023. Leveraging the powerful mathematical reasoning capabilities of the Deepseek-math model, this approach achieves an impressive 80% accuracy on the competition's private test set.

VQA System

2. Getting Started

git clone https://github.com/dinhquy-nguyen-1704/ZaloAI2023-Elementary-Math-Solving.git
cd ZaloAI2023-Elementary-Math-Solving
pip install -r requirements.txt
huggingface-cli login
wandb login

3. Finetune

I only utilize a dataset of over 1000 training samples from the competition to fine-tune the model.

To rerun the fine-tuning code, you can execute the following command line.

python main.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name>

You can also find the fine-tuned model I've trained at [🤗 Models] and the merged version at [🤗 Models].

4. Inference

To infer a fine-tuned model with any elementary math multiple-choice question, you can run the following commands.

Chain of Thought:

python inference_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name>

Few-shot Chain of Thought:

python inference_few_shot_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name>

You can absolutely use the model I've fine-tuned for inference as well.

Chain of Thought:

python inference_cot.py --hf_account quynguyen1704 --model_hf_name deepseek-math-7b-rl-zaloai-v2

Few-shot Chain of Thought:

python inference_few_shot_cot.py --hf_account quynguyen1704 --model_hf_name deepseek-math-7b-rl-zaloai-v2

5. Evaluate

To evaluate the accuracy of the model on the private test set, you can run the following command:

Chain of Thought:

python evaluate_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name> --max_new_tokens <max new tokens>

Few-shot Chain of Thought:

python evaluate_few_shot_cot.py --hf_account <HuggingFace account> --model_hf_name <HuggingFace model's name> --max_new_tokens <max new tokens>

You can also completely replace my model with yours and give it a try.

Chain of Thought with vLLM:

You can also evaluate with vLLM, through the model I merged here. With vLLM, the entire evaluation process with 332 questions in the test set will take about 30 minutes, compared to 4 hours when not using it. However, in return, the quality of the model's answers will be slightly reduced.

python evaluate_vllm.py --hf_account quynguyen1704 --model_hf_name deepseek-math-7b-rl-zaloai-vllm --max_new_tokens 2048

6. Results

The following table summarizes the results of the model after fine-tuning. For questions where the model does not have enough tokens to generate the final answer (A, B, C or D), answer E will be output.

Model Max_new_tokens Prompt Note Accuracy
deepseek-math-7b-rl 500 CoT 67%
deepseek-math-7b-rl 1024 CoT 82%
deepseek-math-7b-rl 1024 Few-shot CoT 80%
deepseek-math-7b-rl 2048 CoT vLLM 80%

7. Limitations

Deepseek-Math-7B-RL is a powerful LLM model with strong mathematical reasoning capabilities in English, Chinese, and Vietnamese. However, there are still certain drawbacks:

  • With max_new_tokens = 500, there are many questions in the private dataset where the model doesn't have enough tokens to generate a final answer.
  • With max_new_tokens = 1024, the inference time for each question is quite long, averaging about 40s - 60s per question.

8. References