Skip to content

banksy23/XCoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

The performance of large language models on programming tasks is impressive, but many datasets suffer from data leakage, particularly in benchmarks like HumanEval and MBPP. To tackle this, we introduce the XCoder-Complexity-Scorer, which control code instruction-tuning data quality across three key dimensions: instruction complexity, response quality, and diversity. We also traine a Unit Test Model to generate unit test programs for each candidate solution. On this basis, we developed XCoder, a family of models fine-tuned from LLaMA3. Alongside the XCoder-80K Dataset, we release XCoder-8B and XCoder-70B. Our experiments show that XCoder achieves state-of-the-art performance with less training data, validating our data strategy.

📖 Paper • 🤖️ XCoder-8B Model • 🤖️ XCoder-70B Model • 🤗 XCoder-80K Dataset
• 👉 XCoder-Complexity-Scorer • 👉 Unit Test Model


🕊 Detailed Resources.

📃 Read our Paper on arxiv .

📚 Get our Dataset on huggingface.

🕊 Try our Coder: Get XCoder-8B from huggingface or modelscope.

🕊 Try our Coder: Get XCoder-70B form huggingface or modelscope.

🐬 We train a model to score the complexity of each instruction: Get Complexity Scorer from huggingface or modelscope. You can use the complexity inference file to inference the complexity of the query in each turn. Thanks for deita!

🐋 We trained a model to generate unit test programs for each candidate solution: Get Unit Test Model from huggingface or modelscope.


😃 Motivations & Key Findings.

The performance of large language models on programming tasks is impressive, but many datasets suffer from data leakage, especially on benchmarks like HumanEval and MBPP. To address this, we introduce the Test Leakage Indicator (TLI), which identifies high-leakage data, and cleans it. We also evaluate it on cleaner benchmarks, LiveCodeBench and BigCodeBench, using filtered data on LLaMA3. We release our high-quality

Our findings reveal that some widely used datasets, like Magicoder-Evol-Instruct, are less reliable than previously thought. Inspired by alignment and mathematical data selection works, we select training data based on instruction complexity, code pass rate, and diversity. With just 40K examples, our model XCoder matches top performance and surpasses prior results at 80K.

Beyond cleaner data, we aim to redefine what makes a good Code Instruction Tuning dataset, analyzing previous works through XCoder's three key dimensions: 🎉🎉 New Insights For Code Instruction Data Synthesis.

Click here, if you are curious about some leaked cases. image

If you wish to assess the complexity of a query, you can follow these steps:

from complexity import Scorer
model_name_or_path = "banksy235/XCoder-Complexity-Scorer"
scorer = Scorer(model_name_or_path,is_vllm=True)
query = "Your query"
complexity_score = scorer.infer_complexity(query)

If your data has multiple turns, you can score it turn by turn without history. For example, if data is

[{"role": "user", "value": "query1"}, {"role": "assistant", "value": "response1"}, {"role": "user", "value": "query2"}, {"role": "assistant", "value": "response2"}]

You should apply the scorer like

complexity_score = [scorer.infer_complexity(query1),scorer.infer_complexity(query2)]

🐬 Use TLI to detect the extent of data leakage in your training set.

python3 compute_TLI.py \
  --train_data_path {train_dataset} \
  --test_data_path {test_dataset} \
  --key_train {key name of the instruction in the training data JSON} \
  --key_test {key name of the instruction in the test data JSON} \
  --only_analysis true

🌠 What open-source data do we collect?

We construct a data pool that includes many open-source code instruction fine-tuning datasets. The specific datasets are listed in the table below:

Dataset Data Size Instruction Source Response Source
Code-290k-ShareGPT-Vicuna-Clean 289k - -
CodeExercise-Python-27k 27k GPT GPT
CodeUp 19k GPT(Self-Instruct) GPT
Glaive-code-assistant-v3 950k Glaive Glaive
oa_leet_10k 23k - -
Code-Alpaca 20k GPT(Self-Instruct) GPT
Codefuse-Evol-Instruct-Clean 66k GPT(Evol-Instruct) GPT
DolphCoder 79k GPT(Evol-Instruct) GPT
Magiccoder-Evol-Instruct-Clean 110k GPT(Evol-Instruct) GPT
MagicCoder-OSS-Instruct 75k GPT(OSS-Instruct) GPT
CommitPackFT 702k GitHub GitHub
StarCoder-Self-Align 50k StarCoder2(OSS-Instruct) StarCoder2
Leet10k_alpaca 10k - -
Code-Feedback-Clean 66k GPT GPT
  • The dataset with the "Clean" suffix implies that the original dataset contains data leakage. We use the cleaned version.

🔑 Data Selection Method For XCoder

Illustration of our data selection approach.

XCoder selects good samples based on three dimensions: instruction complexity, response quality, and instruction diversity.

  • Instruction complexity: People always hope that Code LLM can write more complex programs.Thus, we train a Complexity Scorer to measure the complexity of each sample.
  • Response quality: We use the number of passed test cases as a measure of code coverage quality. We train a Unit Test Model to generate a unit test program for each sample. Compared to using language models directly to judge code correctness, executing test cases can obtain real-world feedback and have better judgment performance.
  • Instruction diversity: As a general principle, an advanced LLM should be able to handle various requests from humans. We use Diversity-based Sampling method to ensure the diversity of the selected data.

🎖 Performance

Dataset Size LiveCodeBench Pass@1 LiveCodeBench Easy-Pass@1 BigCodeBench Pass@1 HumanEval Base-Pass@1 HumanEval Plus-Pass@1
Code-Alpaca 20k 0.0 0.0 11.9 30.5 25.6
StarCoder2-Self-Align 50k 9.5 24.7 14.5 37.8 34.8
Codefuse-Evol-Instruct* 66k 12.3 33.1 25.4 59.1 53.7
Magicoder-OSS-Instruct 75k 12.8 33.8 22.0 54.3 50.0
Magicoder-Evol-Instruct* 100k 13.0 34.5 21.8 65.9 59.8
Code-Feedback 64k 14.8 38.0 27.0 56.7 51.8
XCoder 40k 16.5 43.7 27.4 54.9 50.6
XCoder 80k 16.8 43.7 29.6 57.3 53.0
  • * means that the original dataset may have data leakage, and we perform a n-gram decontamination.

🎉 New Insights For Code Instruction Data Synthesis

We analyze XCoder's data composition, reassess various data sources, and gain new insights into data synthesis. Our key findings:

  • Complexity: Training models to assess instruction complexity outperforms heuristic methods. Evol-Instruct is effective for enhancing complexity, especially with longer, multi-round contexts.
  • Quality: Test case execution provides better feedback for judging code correctness than model-based heuristics. Stronger models also yield higher-quality synthesized data.
  • Diversity: Diverse instruction tuning is crucial. Real-world data sampling leads to better diversity than expanding instructions from fixed seeds.
Click here, if you are curious about the data composition of XCoder image

Citation

Please kindly cite our paper if it helps your research:

@misc{wang2024codellmsperformempowering,
      title={How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data}, 
      author={Yejie Wang and Keqing He and Dayuan Fu and Zhuoma Gongque and Heyang Xu and Yanxu Chen and Zhexu Wang and Yujia Fu and Guanting Dong and Muxi Diao and Jingang Wang and Mengdi Zhang and Xunliang Cai and Weiran Xu},
      year={2024},
      eprint={2409.03810},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2409.03810}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages