A simple yet effective training-free and prompt-free approach to Chinese Spelling Correction based on Large Language Models.
This repository provides an implementation of the paper A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models.
News
- 2024/12/09: We won 1st place in the Kingsoft Office 2024 Algorithm Challenge: Chinese Text Correction Competition (金山办公2024算法挑战赛-中文文本智能校对大赛), with this codebase serving as a key module of our solution. Notably, our solution achieved an
$F_{0.5}$ score that was 2.02 points higher than the second-place team. - 2024/09/20: Our paper is accepted by EMNLP 2024 main conference.
- torch>=2.0.1
- transformers>=4.27.0
- xformers==0.0.21
- accelerate
- bitsandbytes
- sentencepiece
- pypinyin
- pypinyin-dict
- opencc-python-reimplemented
- modelscope (optional, for model download from modelscope)
- streamlit (optional, for demo app)
- uvicorn (optional, for RESTful API server)
- fastapi (optional, for RESTful API server)
- loguru (optional, for RESTful API server)
- sse_starlette (optional, for RESTful API server)
You can set up the environment by running:
bash scripts/set_enviroment.sh
This will automatically create a virtual environment and install the required packages.
For better performance, you can install flash-attn:
pip install flash-attn --no-build-isolation
Warning
Reported by a user, using Qwen2 or Qwen2.5 family models without flash-attn will lead unexpected errors. Specifically, the corrector will stuck in the beam search process.
Please install flash-attn to avoid this issue. Or you can set torch_dtype=torch.bfloat16
in the LMCorrector
class to avoid this issue.
Though we strongly recommend using flash-attn, which will significantly reduce the memory usage and speed up the inference process.
The code will automatically download the model from the Huggingface model hub, if the model is not found in the local cache.
We provide a simple Python API for the corrector:
from lmcsc import LMCorrector
import torch
corrector = LMCorrector(
model="Qwen/Qwen2.5-0.5B",
config_path="configs/default_config.yaml",
torch_dtype=torch.bfloat16, # the default torch_dtype is torch.float16, but it will lead unexpected errors when using Qwen2 or Qwen2.5 family models without flash-attn.
)
outputs = corrector("完善农产品上行发展机智。")
print(outputs)
# [('完善农产品上行发展机制。',)]
Stream mode is also available:
outputs = corrector("完善农产品上行发展机智。", stream=True)
for output in outputs:
print(output[0][0], end="\r", flush=True)
print()
We also provide the RESTful API server for the corrector.
python api_server.py \
--model "Qwen/Qwen2.5-0.5B" \
--host 127.0.0.1 \
--port 8000 \
--workers 1 \
--bf16 # use bf16 to avoid unexpected errors when using Qwen2 or Qwen2.5 family models without flash-attn.
You can use curl
to test the RESTful API server.
# Default
curl -X POST 'http://127.0.0.1:8000/correction' -H 'Content-Type: application/json' -d '{"input": "完善农产品上行发展机智。"}'
# > {"id":"","object":"correction","choices":[{"index":0,"message":{"content":"完善农产品上行发展机制。"}}],"created":1727058762}
# Stream
curl -X POST 'http://127.0.0.1:8000/correction' -H 'Content-Type: application/json' -d '{"input": "完善农产品上行发展机智。", "stream": "True"}'
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展模式。"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}
# > data: [DONE]
# Correction with contexts
curl -X POST 'http://127.0.0.1:8000/correction' -H 'Content-Type: application/json' -d '{"input": "未挨前兆", "contexts": "患者提问:", "stream": "True"}'
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"未"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"未挨"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058762}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058763}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058763}
# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058763}
# > data: [DONE]
We provide a demo application for our approach. To run the demo:
- Ensure you have installed the
streamlit
package. - Run the following command:
streamlit run demo.py
By default, the demo uses Qwen/Qwen2.5-0.5B
, which can run on a V100 GPU with 32GB memory. You can change to other models in the demo's sidebar or by modifying the default_model
in configs/demo_app_config.yaml
.
The sidebar also allows you to adjust n_beam
, alpha
, and use_faithfulness_reward
parameters.
Several examples are provided in the sidebar, including a long sentence with 1866 characters.
The experiments on the datasets mentioned in the paper can be run by the following command:
python -u run.py \
--input-file <input-file> \
--path <path> \
--model-name <model-name> \
--n-observed-chars <n-observed-chars> \
--n-beam <n-beam> \
--batch-size <batch-size> \
--alpha <alpha> \
--use-faithfulness-reward
Before running, you are required to preprocess each sentence pair into the format of
[src] [tgt]
[src] [tgt]
[src] [tgt]
Where [src]
and [tgt]
are the source and target sentences, respectively.
A \t
is used to separate them.
The process of the data preparation can be found in the scripts/download_datasets.sh
.
This script will download the datasets from the original sources, which are hosted on raw.githubusercontent.com
and Google Drive
, and preprocess them into the required format.
- GPT2
- Baichuan2
- Qwen1.5
- Qwen2
- Qwen2.5
- InternLM2
- Enable insert and delete operations (Almost done).
- Top-k voting for better performance.
- Package the code into a library.
- Speed up the inference process.
- Refactor the code to be compatible with vLLM (Long term plan).
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License - see the LICENSE file for details.
If you find this work useful, please consider citing:
@inproceedings{zhou-etal-2024-simple,
title = "A Simple yet Effective Training-free Prompt-free Approach to {C}hinese Spelling Correction Based on Large Language Models",
author = "Zhou, Houquan and
Li, Zhenghua and
Zhang, Bo and
Li, Chen and
Lai, Shaopeng and
Zhang, Ji and
Huang, Fei and
Zhang, Min",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.966",
pages = "17446--17467",
abstract = "This work proposes a simple training-free prompt-free approach to leverage large language models (LLMs) for the Chinese spelling correction (CSC) task, which is totally different from all previous CSC approaches. The key idea is to use an LLM as a pure language model in a conventional manner. The LLM goes through the input sentence from the beginning, and at each inference step, produces a distribution over its vocabulary for deciding the next token, given a partial sentence. To ensure that the output sentence remains faithful to the input sentence, we design a minimal distortion model that utilizes pronunciation or shape similarities between the original and replaced characters. Furthermore, we propose two useful reward strategies to address practical challenges specific to the CSC task. Experiments on five public datasets demonstrate that our approach significantly improves LLM performance, enabling them to compete with state-of-the-art domain-general CSC models.",
}