GitHub

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

The performance of large language models on programming tasks is impressive, but many datasets suffer from data leakage, particularly in benchmarks like HumanEval and MBPP. To tackle this, we introduce the XCoder-Complexity-Scorer, which control code instruction-tuning data quality across three key dimensions: instruction complexity, response quality, and diversity. We also traine a Unit Test Model to generate unit test programs for each candidate solution. On this basis, we developed XCoder, a family of models fine-tuned from LLaMA3. Alongside the XCoder-80K Dataset, we release XCoder-8B and XCoder-70B. Our experiments show that XCoder achieves state-of-the-art performance with less training data, validating our data strategy.

📖 Paper • 🤖️ XCoder-8B Model • 🤖️ XCoder-70B Model • 🤗 XCoder-80K Dataset
• 👉 XCoder-Complexity-Scorer • 👉 Unit Test Model

🕊 Detailed Resources.

📃 Read our Paper on arxiv .

📚 Get our Dataset on huggingface.

🕊 Try our Coder: Get XCoder-8B from huggingface or modelscope.

🕊 Try our Coder: Get XCoder-70B form huggingface or modelscope.

🐬 We train a model to score the complexity of each instruction: Get Complexity Scorer from huggingface or modelscope. You can use the complexity inference file to inference the complexity of the query in each turn. Thanks for deita!

🐋 We trained a model to generate unit test programs for each candidate solution: Get Unit Test Model from huggingface or modelscope.

😃 Motivations & Key Findings.

The performance of large language models on programming tasks is impressive, but many datasets suffer from data leakage, especially on benchmarks like HumanEval and MBPP. To address this, we introduce the Test Leakage Indicator (TLI), which identifies high-leakage data, and cleans it. We also evaluate it on cleaner benchmarks, LiveCodeBench and BigCodeBench, using filtered data on LLaMA3. We release our high-quality

Our findings reveal that some widely used datasets, like Magicoder-Evol-Instruct, are less reliable than previously thought. Inspired by alignment and mathematical data selection works, we select training data based on instruction complexity, code pass rate, and diversity. With just 40K examples, our model XCoder matches top performance and surpasses prior results at 80K.

Beyond cleaner data, we aim to redefine what makes a good Code Instruction Tuning dataset, analyzing previous works through XCoder's three key dimensions: 🎉🎉 New Insights For Code Instruction Data Synthesis.

Click here, if you are curious about some leaked cases.

If you wish to assess the complexity of a query, you can follow these steps:

from complexity import Scorer
model_name_or_path = "banksy235/XCoder-Complexity-Scorer"
scorer = Scorer(model_name_or_path,is_vllm=True)
query = "Your query"
complexity_score = scorer.infer_complexity(query)

If your data has multiple turns, you can score it turn by turn without history. For example, if data is

[{"role": "user", "value": "query1"}, {"role": "assistant", "value": "response1"}, {"role": "user", "value": "query2"}, {"role": "assistant", "value": "response2"}]

You should apply the scorer like

complexity_score = [scorer.infer_complexity(query1),scorer.infer_complexity(query2)]

🐬 Use TLI to detect the extent of data leakage in your training set.

python3 compute_TLI.py \
  --train_data_path {train_dataset} \
  --test_data_path {test_dataset} \
  --key_train {key name of the instruction in the training data JSON} \
  --key_test {key name of the instruction in the test data JSON} \
  --only_analysis true

🌠 What open-source data do we collect?

We construct a data pool that includes many open-source code instruction fine-tuning datasets. The specific datasets are listed in the table below:

Dataset	Data Size	Instruction Source	Response Source
Code-290k-ShareGPT-Vicuna-Clean	289k	-	-
CodeExercise-Python-27k	27k	GPT	GPT
CodeUp	19k	GPT(Self-Instruct)	GPT
Glaive-code-assistant-v3	950k	Glaive	Glaive
oa_leet_10k	23k	-	-
Code-Alpaca	20k	GPT(Self-Instruct)	GPT
Codefuse-Evol-Instruct-Clean	66k	GPT(Evol-Instruct)	GPT
DolphCoder	79k	GPT(Evol-Instruct)	GPT
Magiccoder-Evol-Instruct-Clean	110k	GPT(Evol-Instruct)	GPT
MagicCoder-OSS-Instruct	75k	GPT(OSS-Instruct)	GPT
CommitPackFT	702k	GitHub	GitHub
StarCoder-Self-Align	50k	StarCoder2(OSS-Instruct)	StarCoder2
Leet10k_alpaca	10k	-	-
Code-Feedback-Clean	66k	GPT	GPT

The dataset with the "Clean" suffix implies that the original dataset contains data leakage. We use the cleaned version.

🔑 Data Selection Method For XCoder

Illustration of our data selection approach.

XCoder selects good samples based on three dimensions: instruction complexity, response quality, and instruction diversity.

Instruction complexity: People always hope that Code LLM can write more complex programs.Thus, we train a Complexity Scorer to measure the complexity of each sample.
Response quality: We use the number of passed test cases as a measure of code coverage quality. We train a Unit Test Model to generate a unit test program for each sample. Compared to using language models directly to judge code correctness, executing test cases can obtain real-world feedback and have better judgment performance.
Instruction diversity: As a general principle, an advanced LLM should be able to handle various requests from humans. We use Diversity-based Sampling method to ensure the diversity of the selected data.

🎖 Performance

Dataset	Size	LiveCodeBench Pass@1	LiveCodeBench Easy-Pass@1	BigCodeBench Pass@1	HumanEval Base-Pass@1	HumanEval Plus-Pass@1
Code-Alpaca	20k	0.0	0.0	11.9	30.5	25.6
StarCoder2-Self-Align	50k	9.5	24.7	14.5	37.8	34.8
Codefuse-Evol-Instruct*	66k	12.3	33.1	25.4	59.1	53.7
Magicoder-OSS-Instruct	75k	12.8	33.8	22.0	54.3	50.0
Magicoder-Evol-Instruct*	100k	13.0	34.5	21.8	65.9	59.8
Code-Feedback	64k	14.8	38.0	27.0	56.7	51.8
XCoder	40k	16.5	43.7	27.4	54.9	50.6
XCoder	80k	16.8	43.7	29.6	57.3	53.0

* means that the original dataset may have data leakage, and we perform a n-gram decontamination.

🎉 New Insights For Code Instruction Data Synthesis

We analyze XCoder's data composition, reassess various data sources, and gain new insights into data synthesis. Our key findings:

Complexity: Training models to assess instruction complexity outperforms heuristic methods. Evol-Instruct is effective for enhancing complexity, especially with longer, multi-round contexts.
Quality: Test case execution provides better feedback for judging code correctness than model-based heuristics. Stronger models also yield higher-quality synthesized data.
Diversity: Diverse instruction tuning is crucial. Real-world data sampling leads to better diversity than expanding instructions from fixed seeds.

Click here, if you are curious about the data composition of XCoder

Citation

Please kindly cite our paper if it helps your research:

@misc{wang2024codellmsperformempowering,
      title={How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data}, 
      author={Yejie Wang and Keqing He and Dayuan Fu and Zhuoma Gongque and Heyang Xu and Yanxu Chen and Zhexu Wang and Yujia Fu and Guanting Dong and Muxi Diao and Jingang Wang and Mengdi Zhang and Xunliang Cai and Weiran Xu},
      year={2024},
      eprint={2409.03810},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2409.03810}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
README.md		README.md
XCoder.pdf		XCoder.pdf
complexity.py		complexity.py
compute_TLI.py		compute_TLI.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

🕊 Detailed Resources.

😃 Motivations & Key Findings.

If you wish to assess the complexity of a query, you can follow these steps:

🐬 Use TLI to detect the extent of data leakage in your training set.

🌠 What open-source data do we collect?

🔑 Data Selection Method For XCoder

🎖 Performance

🎉 New Insights For Code Instruction Data Synthesis

Citation

About

Releases

Packages

Contributors 5

Languages

banksy23/XCoder

Folders and files

Latest commit

History

Repository files navigation

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

🕊 Detailed Resources.

😃 Motivations & Key Findings.

If you wish to assess the complexity of a query, you can follow these steps:

🐬 Use TLI to detect the extent of data leakage in your training set.

🌠 What open-source data do we collect?

🔑 Data Selection Method For XCoder

🎖 Performance

🎉 New Insights For Code Instruction Data Synthesis

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages