Skip to content

yxfanSuda/GrammarGPT

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

✨ Latest News

⚡ Introduction

Welcome to the repository of GrammarGPT.

The implementation repository for NLPCC 2023 Sharedtask1, which achieves third place.

Here is a list of what has been released:

  • The 1k data for training, 65% of which are generated by ChatGPT, and the rest are manually annotated.
  • The code for training and inferencing.
  • You can find more details about the data and model on our technical report.

💭 Overview

We introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors.

📚 Construction of Hybrid Dataset

This table shows the six main types of grammatical errors made by native Chinese speakers, which can be divided into two types, e.g., with (w/) and without (w/o) clues. We can find that the incorrect sentences are fluent and in line with the habits of native Chinese. However, they do not conform to Chinese grammar, which is more difficult to correct. We utilized both ChatGPT-generated data and human-annotated data for dealing with grammatical errors with and without clues, respectively.

ChatGPT-generated Data

Grammatical errors with clues are easy to detect and correct by recognizing the specific clues. For example, more than and about are used together leading to redundant component, The cause and caused by are used together leading to structural confusion, and prompting and pace are used together leading to improper collocation. Conversely, we can construct ungrammatical sentences by inserting these cues into grammatical sentences. We can instruct ChatGPT to generate the ungrammatical sentences that meet our requirements by providing these Clues collected from public websites.

Human-annotated Data

For those ungrammatical errors,we collected data from public websites 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 and manaually annotated them.

Error-invariant Augmentation

Native Chinese grammatical errors are often subtle and infrequently found in the position of named entities. Therefore, we adopt a strategy of substituting the named entities in the parallel data with similar ones(Synonyms).

🚀 Training

python finetuning.py

🧐 Inferencing

python generate.py

😀 Acknowledgement

We are aware that our works are inspired by the following works, including but not limited to

Without these, nothing could happen in this repository.

Citation

@inproceedings{fan2023grammargpt,
  title={GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning},
  author={Fan, Yaxin and Jiang, Feng and Li, Peifeng and Li, Haizhou},
  booktitle={CCF International Conference on Natural Language Processing and Chinese Computing},
  pages={69--80},
  year={2023},
  organization={Springer}
}

We are from the School of Data Science, the Chinese University of Hong Kong, Shenzhen (CUHKSZ), and the Shenzhen Research Institute of Big Data (SRIBD).

The first author is a visiting student from Soochow University, and we welcome aspiring individuals to join our group and contribute to the new era of LLM.

Star History Chart

About

Repo for GrammarGPT on NLPCC'2023.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%