GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning
- [07/31/2023]: Release the model weights.
- [07/26/2023]: Release the tech report.
Welcome to the repository of GrammarGPT.
The implementation repository for NLPCC 2023 Sharedtask1, which achieves third place.
Here is a list of what has been released:
- The 1k data for training, 65% of which are generated by ChatGPT, and the rest are manually annotated.
- The code for training and inferencing.
- You can find more details about the data and model on our technical report.
We introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors.
This table shows the six main types of grammatical errors made by native Chinese speakers, which can be divided into two types, e.g., with (w/) and without (w/o) clues. We can find that the incorrect sentences are fluent and in line with the habits of native Chinese. However, they do not conform to Chinese grammar, which is more difficult to correct. We utilized both ChatGPT-generated data and human-annotated data for dealing with grammatical errors with and without clues, respectively.
Grammatical errors with clues are easy to detect and correct by recognizing the specific clues. For example, more than and about are used together leading to redundant component, The cause and caused by are used together leading to structural confusion, and prompting and pace are used together leading to improper collocation. Conversely, we can construct ungrammatical sentences by inserting these cues into grammatical sentences. We can instruct ChatGPT to generate the ungrammatical sentences that meet our requirements by providing these Clues collected from public websites.
For those ungrammatical errors,we collected data from public websites 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 and manaually annotated them.
Native Chinese grammatical errors are often subtle and infrequently found in the position of named entities. Therefore, we adopt a strategy of substituting the named entities in the parallel data with similar ones(Synonyms).
python finetuning.py
python generate.py
We are aware that our works are inspired by the following works, including but not limited to
- Bloom: https://huggingface.co/bigscience/bloom
- Self-instruct: https://github.com/yizhongw/self-instruct
- LLMZoo: https://github.com/FreedomIntelligence/LLMZoo
Without these, nothing could happen in this repository.
@inproceedings{fan2023grammargpt,
title={GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning},
author={Fan, Yaxin and Jiang, Feng and Li, Peifeng and Li, Haizhou},
booktitle={CCF International Conference on Natural Language Processing and Chinese Computing},
pages={69--80},
year={2023},
organization={Springer}
}
We are from the School of Data Science, the Chinese University of Hong Kong, Shenzhen (CUHKSZ), and the Shenzhen Research Institute of Big Data (SRIBD).
The first author is a visiting student from Soochow University, and we welcome aspiring individuals to join our group and contribute to the new era of LLM.