Skip to content

Latest commit

 

History

History
53 lines (34 loc) · 5.9 KB

README.md

File metadata and controls

53 lines (34 loc) · 5.9 KB

Original Dataset

We have conducted research on existing publicly available legal datasets and compiled approximately 600k records. These primarily cover criminal and civil cases, and also include cases related to constitutional law, social law, economic law, and other legal areas.

Dataset Size Domain Task Metric
CAIL2018 196k Criminal Multi-classification Acc, F1
CAIL-2019-ER 69k Civil Multi-classification Acc, F1
CAIL-2021-IE 5k Criminal Named Entity Recognition F1, P, R
Criminal-S 77k Criminal Multi-classification Acc, P, R, F1
MLMN 1k Criminal Multi-classification P, R, F1
MSJudeg 70k Civil Multi-classification F1
CAIL2019-SCM 9k Civil Classification Acc
CAIL2020-AM 815 Criminal, Civil Multiple-choice questions Acc
JEC-QA 20k - Multiple-choice questions Acc
CAIL-2020-TS 9k Civil Text summarization ROUGE
CAIL-2022-TS 6k - Text summarization ROUGE
AC-NLG 67k Civil Text Generation ROUGE, BLEU
CJRC 10k Criminal, Civil Text ROUGE, BLEU
CrimeKgAssitant 52k - Question-answering ROUGE, BLEU
  1. CAIL2018: The dataset for criminal judgment prediction in CAIL2018 aims to predict the relevant legal provisions, charges against the defendant, and the length of the defendant's sentence based on the factual descriptions and case facts in criminal legal documents.

  2. CAIL-2019-ER: The dataset for criminal judgment prediction in CAIL-2019 requires systems to judge each sentence in judicial documents and identify key case elements. This task involves three domains: marriage and family, labor disputes, and loan contracts.

  3. CAIL-2021-IE: Information extraction involves tasks such as named entity recognition and relation extraction. This task focuses on fraud cases and requires precise extraction of key information such as suspects, items involved, and criminal facts.

  4. Criminal-S: Each sample in this dataset contains a single charge and aims to predict the relevant charges judged by judges based on the factual determination part of the case.

  5. MLMN: This task categorizes sentences into five parts based on the length of the sentence: no criminal punishment, detention, imprisonment for less than 1 year, imprisonment for 1 year or more but less than 3 years, and imprisonment for 3 years or more but less than 10 years. The task involves traffic accident and intentional injury cases, predicting the category of the defendant's sentence based on legal documents.

  6. MSJudeg: This task involves civil data on private lending disputes and aims to predict the judge's verdict based on the case facts and the plaintiff's claims.

  7. CAIL2019-SCM: This task involves calculating and judging the similarity of multiple legal documents. Specifically, given the title and factual description of each document, participants need to find the most similar document from a candidate set for each query document.

  8. CAIL2020-AM: This task aims to extract logical interaction arguments between defense and prosecution in judgment documents, i.e., points of contention.

  9. JEC-QA: As a dataset for objective question and answer in the national judicial examination, it contains 7775 knowledge-driven questions and 13297 case analysis questions, each of which is a multiple-choice question or a multiple-choice question.

  10. CAIL-2020-TS: This task involves generating judicial summary texts based on the original judgment documents.

  11. CAIL-2022-TS: This task involves generating correct, complete, and concise legal sentiment summaries based on original public opinion texts.

  12. AC-NLG: This task involves civil data related to private lending and aims to predict relevant court reasoning texts based on factual descriptions of cases.

  13. CJRC: As a dataset for judicial reading comprehension, it contains 10,000 cases and 50,000 question-answer pairs, aiming to provide a substantive understanding of legal documents and answer related questions.

  14. CrimeKgAssitant: A dataset of real Chinese lawyer consultations, cleaned by LAW-GPT, resulting in 52k single-turn question-answer pairs.

Instruction Datasets

According to the comprehensive benchmark framework LAiW for Chinese legal LLMs, consisting of 14 basic tasks, we constructed the Legal Instruction Tuning Dataset (LIT). The dataset is split with a ratio of train/valid/test=7/1/2. The train.jsonl and valid.jsonl files are used for model training, while test.jsonl serves as the evaluation dataset for LAiW to guide and advance the development and evaluation of LLMs. At the current stage, only the test datasets test.jsonl for each task are publicly available, which can be downloaded from LAiW.