Original Dataset

We have conducted research on existing publicly available legal datasets and compiled approximately 600k records. These primarily cover criminal and civil cases, and also include cases related to constitutional law, social law, economic law, and other legal areas.

Dataset	Size	Domain	Task	Metric
CAIL2018	196k	Criminal	Multi-classification	Acc, F1
CAIL-2019-ER	69k	Civil	Multi-classification	Acc, F1
CAIL-2021-IE	5k	Criminal	Named Entity Recognition	F1, P, R
Criminal-S	77k	Criminal	Multi-classification	Acc, P, R, F1
MLMN	1k	Criminal	Multi-classification	P, R, F1
MSJudeg	70k	Civil	Multi-classification	F1
CAIL2019-SCM	9k	Civil	Classification	Acc
CAIL2020-AM	815	Criminal, Civil	Multiple-choice questions	Acc
JEC-QA	20k	-	Multiple-choice questions	Acc
CAIL-2020-TS	9k	Civil	Text summarization	ROUGE
CAIL-2022-TS	6k	-	Text summarization	ROUGE
AC-NLG	67k	Civil	Text Generation	ROUGE, BLEU
CJRC	10k	Criminal, Civil	Text	ROUGE, BLEU
CrimeKgAssitant	52k	-	Question-answering	ROUGE, BLEU

CAIL2018: The dataset for criminal judgment prediction in CAIL2018 aims to predict the relevant legal provisions, charges against the defendant, and the length of the defendant's sentence based on the factual descriptions and case facts in criminal legal documents.
CAIL-2019-ER: The dataset for criminal judgment prediction in CAIL-2019 requires systems to judge each sentence in judicial documents and identify key case elements. This task involves three domains: marriage and family, labor disputes, and loan contracts.
CAIL-2021-IE: Information extraction involves tasks such as named entity recognition and relation extraction. This task focuses on fraud cases and requires precise extraction of key information such as suspects, items involved, and criminal facts.
Criminal-S: Each sample in this dataset contains a single charge and aims to predict the relevant charges judged by judges based on the factual determination part of the case.
MLMN: This task categorizes sentences into five parts based on the length of the sentence: no criminal punishment, detention, imprisonment for less than 1 year, imprisonment for 1 year or more but less than 3 years, and imprisonment for 3 years or more but less than 10 years. The task involves traffic accident and intentional injury cases, predicting the category of the defendant's sentence based on legal documents.
MSJudeg: This task involves civil data on private lending disputes and aims to predict the judge's verdict based on the case facts and the plaintiff's claims.
CAIL2019-SCM: This task involves calculating and judging the similarity of multiple legal documents. Specifically, given the title and factual description of each document, participants need to find the most similar document from a candidate set for each query document.
CAIL2020-AM: This task aims to extract logical interaction arguments between defense and prosecution in judgment documents, i.e., points of contention.
JEC-QA: As a dataset for objective question and answer in the national judicial examination, it contains 7775 knowledge-driven questions and 13297 case analysis questions, each of which is a multiple-choice question or a multiple-choice question.
CAIL-2020-TS: This task involves generating judicial summary texts based on the original judgment documents.
CAIL-2022-TS: This task involves generating correct, complete, and concise legal sentiment summaries based on original public opinion texts.
AC-NLG: This task involves civil data related to private lending and aims to predict relevant court reasoning texts based on factual descriptions of cases.
CJRC: As a dataset for judicial reading comprehension, it contains 10,000 cases and 50,000 question-answer pairs, aiming to provide a substantive understanding of legal documents and answer related questions.
CrimeKgAssitant: A dataset of real Chinese lawyer consultations, cleaned by LAW-GPT, resulting in 52k single-turn question-answer pairs.

Instruction Datasets

According to the comprehensive benchmark framework LAiW for Chinese legal LLMs, consisting of 14 basic tasks, we constructed the Legal Instruction Tuning Dataset (LIT). The dataset is split with a ratio of train/valid/test=7/1/2. The train.jsonl and valid.jsonl files are used for model training, while test.jsonl serves as the evaluation dataset for LAiW to guide and advance the development and evaluation of LLMs. At the current stage, only the test datasets test.jsonl for each task are publicly available, which can be downloaded from LAiW.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Original Dataset

Instruction Datasets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Original Dataset

Instruction Datasets