Skip to content

YY0649/ICE-PIXIU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks

=======

Languages

Chinese

Disclaimer

This repository and its contents are for academic and educational purposes only. All materials do not constitute financial, legal, or investment advice. No express or implied warranty is provided for the accuracy, completeness, or usefulness of the content. The authors and contributors are not responsible for any errors, omissions, or consequences arising from the use of the information on this website. Users should exercise their own judgment and consult professional advisors before making any financial, legal, or investment decisions. The use of any software and information contained in this repository is entirely at the user's own risk.

By using or accessing information in this repository, you agree to indemnify, defend, and hold harmless the authors, contributors, and any affiliated organizations or individuals from any and all claims or damages.


News

📢 Update (Date: 2024/03/12)

🐹 We are excited to share our paper, "No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks".


Overview

ICE-PIXIU is our comprehensive framework, featuring the first cross-lingual bilingual financial instruction dataset ICE-FIND, large language model ICE-INTENT, and evaluation benchmark ICE-FLARE. ICE-PIXIU combines various Chinese classification, extraction, reasoning, and prediction financial NLP tasks, enhancing training and performance to address the shortcomings in Chinese financial NLP. It simultaneously integrates a series of translated and original English datasets, enriching the breadth and depth of bilingual financial modeling. It offers unrestricted access to various model variants, a compilation of diverse cross-lingual and multi-modal instruction data, and an evaluation benchmark with expert annotations, encompassing 10 NLP tasks and 20 bilingual specific tasks. Our comprehensive evaluation emphasizes the advantages of combining these bilingual datasets, particularly in translation tasks and leveraging original English data, thereby enhancing language flexibility and analytical acumen in financial contexts.

Key Features

  • Bilingual Capability: ICE-INTENT, a component of ICE-PIXIU, excels in Chinese-English bilingual abilities, crucial for global financial data processing and analysis.
  • Diverse Data: ICE-PIXIU combines various Chinese classification, extraction, reasoning, and prediction NLP tasks, strengthening training and performance to address shortcomings in Chinese financial NLP.
  • Expert Prompts: ICE-PIXIU offers a set of diverse, high-quality, expert-annotated prompts and adopts similar fine-tuning instructions to enhance understanding of financial tasks.
  • Multilingual: ICE-PIXIU extends its capabilities by incorporating translation tasks and English datasets, thereby strengthening its bilingual training and application.
  • Cross-lingual Evaluation: ICE-PIXIU introduces ICE-FLARE, a rigorous cross-lingual evaluation benchmark ensuring consistent model performance across different language contexts.
  • Openness: ICE-PIXIU adopts an open-access approach, offering resources to the research community to foster collaborative development in financial NLP.

ICE-FIND: Chinese-English Cross-lingual Instruction Data

ICE-PIXIU provides services in various financial scenarios for different user groups with its unique data types, financial tasks, and data sources in Chinese and English bilingual domains.

Sunburst chart describing the distribution of ICE-PIXIU across different language capabilities, data types, financial NLP tasks, specific financial tasks, and datasets

Evaluation Data
All our evaluation datasets can be found here.

Datasets (Evaluation Test)

Sentiment Analysis(FinSA)

Semantic Matching(FinSM)

News Classification(FinNS)

Negative Judgment(FinNJ)

Answer Selection(FinAS)

Relationship Extraction(FinRE)

Headline Classification(FinHC)

Credit Classification(FinCC)

Hawkish-dovish Classification(FinDC)

Event Detection(FinED)

Entity Recognition(FinER)

Question Answering(FinQA)

Stock Prediction(FinSF)

Text Summarization(FinTS)

Adding Cross-Lingual Datasets

Data Summary Table: Detailed information on Chinese-English bilingual multi-task financial teaching and evaluation raw data, including language capability (Lang), data type (D_T), NLP task (NLP_T), specific task (S_T), dataset name, instruction data size, evaluation data size, data source, and license information.

Lang D_T NLP_T S_T Dataset Raw Instruction Evaluation Data Source License
ZH DLC ZH-CLS FinSA FE 18,177 18,177 2,020 social texts Public
StockB 9,812 9,812 1,962 social texts Apache-2.0
FinSM BQC 120,000 110,000 10,000 bank service logs Public
AFQMC 38,650 38,650 4,316 online chat service Apache-2.0
FinNC NL 7,955 7,955 884 news articles Public
NL2 7,955 7,955 884 news articles Public
FinNJ NSP 4,499 4,499 500 social texts Public
FinAS FinevalF 1,115 1,115 222 financial exam Apache-2.0
FinRE RE 14,973 14,973 1,489 news, entity pairs Public
ZH-PRE FinSP StockA 14,769 14,769 1,477 news, historical prices Public
DLE ZH-EXT FinQA QA 22,375 22,375 2,469 QA pairs of news Public
FinER CNER 1,685 1,685 337 financial reports Public
FinED 19CCKS 156,834 14,674 2,936 social texts CC BY-SA 4.0
20CCKS 372,810 45,796 9,159 news, reports CC BY-SA 4.0
21CCKS 8,000 7,000 1,400 news, reports CC BY-SA 4.0
22CCKS 109,555 59,143 11,829 news, reports CC BY-SA 4.0
ZH-GEN FinTS NA 32,400 32,400 3,600 news, announcements Public
DTT ZH-TRA FinSA CFPB 4,845 4,838 970 economic news MIT license
CFiQA-SA 1,173 1,143 233 news headlines, tweets MIT license
FinSP CACL 27,056 2,555 511 tweets, historical prices MIT license
CBigdata 7,167 798 159 tweets, historical prices MIT license
CCIKM 4,970 431 86 tweets, historical prices MIT license
FinHC CHeadlines 102,708 10,256 2,051 news headlines MIT license
FinQA CEnQA 8,281 668 133 earnings reports MIT license
CConvFinQA 12,594 1,189 237 earnings reports MIT license
EN DTE EN-CLS FinSA FPB 4,845 4,845 970 economic news CC BY-SA 3.0
FiQA-SA 1,173 1,173 235 news headlines, tweets Public
FinHC Headlines 11,412 102,708 20,547 news headlines CC BY-SA 3.0
FinCC German 1,000 1,000 200 credit records CC BY-SA 4.0
Australian 690 690 139 credit records CC BY-SA 4.0
EN-PRE FinSP ACL18 27,053 27,053 3,720 tweets, historical prices MIT License
BigData22 7,164 7,164 1,472 tweets, historical prices Public
CIKM18 4,967 4,967 1,143 tweets, historical prices Public
EN-EXT FinER NER 609 609 98 financial agreements CC BY-SA-3.0
EN-REA FinQA EnQA 8,281 8,281 1,147 earnings reports MIT License
ConvFinQA 3,458 12,594 1,490 earnings reports MIT License
DOF EN-DOF FinER Finer-Ord 1,075 - 1,075 news articles CC BY-SA 4.0
FinTS ECTSUM 495 - 495 earning call transcripts Public
EDTSUM 2,000 - 2,000 news articles Public
FinDC FOMC 496 - 496 FOMC transcripts CC BY-SA 4.0

ICE-INTERN: Bilingual Financial Large Model

During the fine-tuning process, we employed QLoRA, an efficient parameter tuning technique, using a uniform sequence length of 2048 tokens. The fine-tuning process utilized the AdamW optimizer with an initial learning rate of 5e-5 and a weight decay of 1e-5, along with a 1% total step warm-up. All models underwent one round of fine-tuning on eight A100 40GB GPUs with a batch size of 24, using consistent hyperparameter settings.

ICE-FLARE: Cross-Language Financial Evaluation Benchmark

In order to conduct comparative analysis with other general large models (including Baichuan, ChatGPT, Qwen, etc.) and financial large models, we have selected a series of tasks and metrics that cover various aspects of financial natural language processing and financial forecasting.

Tasks

Data Task Raw Data Types Modalities License Paper
AFQMC Semantic Matching 38,650 Question Data, Dialogue Text Apache-2.0 [1]
corpus Semantic Matching 120,000 Question Data, Dialogue Text Public [2]
stockA Stock Classification 14,769 News, Historical Prices Text, Time Series Public [3]
Fineval Multiple Choice 1,115 Financial Exams Text Apache-2.0 [4]
NL News Classification 7,955 News Reports Text Public [5]
NL2 News Classification 7,955 News Reports Text Public [5]
NSP Negative News Judgement 4,499 News, Social Media Text Text Public [5]
RE Relation Extraction 14,973 News, Entity Pair Text Public [5]
FE Sentiment Analysis 18,177 Financial Social Media Text Text Public [5]
stockB Sentiment Analysis 9,812 Financial Social Media Text Text Apache-2.0 [6]
QA Financial Q&A 22,375 Financial News Announcements Text, Tables Public [5]
NA Text Summarization 32,400 News Articles, Announcements Text Public [5]
19CCKS Event Subject Extraction 156,834 News Reports Text CC BY-SA 4.0 [7]
20CCKS Event Subject Extraction 372,810 News Reports Text CC BY-SA 4.0 [8]
21CCKS Event Causal Relationship Extraction 8,000 News Reports Text CC BY-SA 4.0 [9]
22CCKS Event Subject Extraction 109,555 News Reports Text CC BY-SA 4.0 [10]
CNER Named Entity Recognition 1,685 News Reports Text Public [11]
CFPB Sentiment Analysis 4,845 News Text MIT license [12]
CFIQASA Sentiment Analysis 1,173 News Headlines, Tweets Text MIT license [12]
CHeadlines News Headline Classification 11,412 News Headlines Text MIT license [12]
CBigData Stock Trend Prediction 7,164 Tweets, Historical Prices Text, Time Series MIT license [12]
CACL Stock Trend Prediction 27,053 Tweets, Historical Prices Text, Time Series MIT license [12]
CCIKM Stock Trend Prediction 4,967 Tweets, Historical Prices Text, Time Series MIT license [12]
CFinQA Financial Q&A 14,900 Earnings Reports Text, Tables MIT license [12]
CConvFinQA Multi-Turn Q&A 48,364 Earnings Reports Text, Tables MIT license [12]
  1. Xu L, Hu H, Zhang X, et al. CLUE: A Chinese language understanding evaluation benchmark[J]. arXiv preprint arXiv:2004.05986, 2020.
  2. Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, and Buzhou Tang. 2018. The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4946–4951, Brussels, Belgium. Association for Computational Linguistics.
  3. Jinan Zou, Haiyao Cao, Lingqiao Liu, Yuhao Lin, Ehsan Abbasnejad, and Javen Qinfeng Shi. 2022. Astock: A New Dataset and Automated Stock Trading based on Stock-specific News Analyzing Model. In Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP), pages 178–186, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  4. Zhang L, Cai W, Liu Z, et al. FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models[J]. arxiv preprint arxiv:2308.09975, 2023.
  5. Lu D, Liang J, Xu Y, et al. BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark[J]. arxiv preprint arxiv:2302.09432, 2023.
  6. https://huggingface.co/datasets/kuroneko5943/stock11
  7. https://www.biendata.xyz/competition/ccks_2019_4/
  8. https://www.biendata.xyz/competition/ccks_2020_4_1/
  9. https://www.biendata.xyz/competition/ccks_2021_task6_2/
  10. https://www.biendata.xyz/competition/ccks2022_eventext/
  11. Jia C, Shi Y, Yang Q, et al. Entity enhanced BERT pre-training for Chinese NER[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6384-6396.
  12. Xie Q, Han W, Zhang X, et al. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance[J]. arXiv preprint arXiv:2306.05443, 2023.

Benchmark Evaluation Environment Deployment

Local Installation
git clone https://github.com/chancefocus/PIXIU.git --recursive
cd PIXIU
pip install -r requirements.txt
cd PIXIU/src/financial-evaluation
pip install -e .[multilingual]
Docker Image
sudo bash scripts/docker_run.sh

The above command will start a Docker container. You can modify docker_run.sh according to your own environment. We provide the pre-compiled image by running sudo docker pull tothemoon/pixiu:latest.

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    --network host \
    --env https_proxy=$https_proxy \
    --env http_proxy=$http_proxy \
    --env all_proxy=$all_proxy \
    --env HF_HOME=$hf_home \
    -it [--rm] \
    --name pixiu \
    -v $pixiu_path:$pixiu_path \
    -v $hf_home:$hf_home \
    -v $ssh_pub_key:/root/.ssh/authorized_keys \
    -w $workdir \
    $docker_user/pixiu:$tag \
    [--sshd_port 2201 --cmd "echo 'Hello, world!' && /bin/bash"]

Parameter Description:

  • [] indicates optional parameters
  • HF_HOME: Hugging Face cache directory
  • sshd_port: The SSHD port of the container. You can run ssh -i private_key -p $sshd_port root@$ip to connect to the container. The default is 22001.
  • --rm: Remove the container upon exit (i.e., CTRL + D).

Automated Task Evaluation

Before evaluation, please download the BART checkpoint to src/metrics/BARTScore/bart_score.pth.

To perform automatic evaluation, please follow the instructions below:

  1. Hugging Face Transformer

To evaluate models hosted on the Hugging Face Hub (e.g., ICE-INTERN-Full-7B), please use this command:

python eval.py \
    --model "hf-causal-llama" \
    --model_args "use_accelerate=True,pretrained=chancefocus/finma-7b-full,tokenizer=chancefocus/finma-7b-full,use_fast=False" \
    --tasks "flare_ner,flare_sm_acl,flare_fpb"

For more details, please refer to the lm_eval documentation.

  1. Commercial API

Please note that for tasks such as NER, automatic evaluation is based on specific patterns. This may not extract relevant information in zero-shot settings, resulting in performance that is relatively lower than previous human-annotated results.

export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python eval.py \
    --model gpt-4 \
    --tasks flare_zh_fe,flare_cner,flare_sm_acl
  1. Self-Hosted Evaluation

To run the inference backend, please execute the following command:

bash scripts/run_interface.sh

Predefined Task Metrics

Task Metric Illustration
Classification Accuracy This metric represents the ratio of correctly predicted observations to the total observations. The calculation formula is (correct predictions + incorrect predictions) / total observations.
Classification F1 Score The F1 score represents the harmonic mean of precision and recall, achieving a balance between these two factors. It is particularly useful when one factor is more important than the other. The score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 indicates the worst case. Additionally, we provide "weighted" and "macro" versions of the F1 score.
Classification Missing Ratio This metric calculates the proportion of responses that did not return any option among the given options in the task.
Classification Matthews Correlation Coefficient (MCC) MCC is a metric for assessing the quality of binary classification, with scores ranging from -1 to +1. A score of +1 indicates perfect prediction, 0 indicates no better than random chance, and -1 indicates completely opposite predictions.
Sequence Labeling F1 Score In "sequence labeling" tasks, we use the F1 score calculated by the "seqeval" library, which is a robust entity-level evaluation metric. This metric requires a complete match of entity spans and types between predicted entities and ground truth entities for correct evaluation. True positives (TP) represent correctly predicted entities, false positives (FP) represent incorrectly predicted entities or spans/types that do not match, and false negatives (FN) represent entities missed from the ground truth. These quantities are then used to calculate precision, recall, and F1 score, where the F1 score represents the harmonic mean of precision and recall.
Sequence Labeling Label F1 Score This metric evaluates model performance solely based on the correctness of predicted labels, without considering entity spans.
Relation Extraction Precision Precision measures the proportion of correctly predicted relations among all predicted relations. It is calculated as the number of true positives (TP) divided by the total of true positives and false positives (FP).
Relation Extraction Recall Recall measures the proportion of correctly predicted relations among all actual relations. It is calculated as the number of true positives (TP) divided by the total of true positives and false negatives (FN).
Relation Extraction F1 Score The F1 score is the harmonic mean of precision and recall, achieving a balance between these two metrics. The F1 score is best at 1 (both precision and recall are perfect) and worst at 0.
Extractive and Abstractive Summarization Rouge-L This metric assesses the longest common subsequence (LCS) between the system and the reference summary. The LCS naturally considers structural similarity at the sentence level and automatically identifies the longest co-occurring n-gram.
Extractive and Abstractive Summarization Rouge-N It measures the overlap of N-grams (a continuous sequence of N items in the given text sample) between the system-generated summary and the reference summary. "N" can be 1, 2, or more; ROUGE-1 and ROUGE-2 are typically used to evaluate unigram overlap and bigram overlap, respectively.
Question Answering EMACC EMACC evaluates the exact match between the model-generated answer and the reference answer. In other words, a model-generated answer is considered correct only if it matches the reference answer exactly.

Sample Input and Output for Specific Tasks

For specific financial tasks such as Question Answering (FinQA), Text Summarization (FinTS), Relation Extraction (FinRE), Event Detection (FinED), Entity Recognition (FinER), and News Classification (FinNC), various LLM input and response examples are provided.

Task Query Baichuan-7B GPT-4 ICE-INTERN-full-7B
FinQA 浦东建设公告,公司子公司上海市浦东新区建设(集团)有限公司、上海浦兴路桥建设工程有限公司中标多项重大工程项目,中标金额总计为10.94亿元。请问中标事件对应的主体有哪些? 浦东建设、浦兴路桥、浦东建设(集团) 中标事件对应的主体有上海市浦东新区建设(集团)有限公司 上海市浦东新区建设(集团)有限公司,上海浦兴路桥建设工程有限公司
FinTS 市场普遍预期日本央行将维持其超宽松货币政策不变,但知情人士表示,周二将公布的经济展望报告中,可能上调4月起财政年度的通胀预估。能源和原材料成本的上涨已将批发价格推高至40年来的最高水平,企业已开始将上涨的成本转嫁给消费者。近期通胀温和上升,主要是投入成本上升和日元疲软推动的,而不是强劲的国内需求。日本央行行长黑田东彦曾表示,物价上涨应该伴随着工资的强劲增长。请对上述金融报道进行总结,给出具有中心主旨的摘要,回答要求尽量简短。 日本央行行长黑田东彦曾表示,物价上涨应该伴随着工资的强劲增长。 日本央行预计将上调4月起的通胀预估,尽管市场普遍预期货币政策将保持超宽松。成本上涨推高批发价格,企业已将涨成本转嫁给消费者,但通胀主要由投入成本上升和日元疲软推动,而非国内需求强劲。 知情人士:日本央行可能上调4月起财政年度的通胀预估。
FinRE 新希望集团销售额约为上市公司倍有余,由于背靠农业帝国“希望系”, 资本市场一直预期集团能够向新希望股份注入农业资产, 如果集团资产整合启动, 新希望股份的股。请分析上述涉及实体对(新希望股份,新希望集团)的关系。 被收购 被注资 自己
FinED 大智慧(601519)信披违规遭调查“互联网+”炒作或终结金鹰等多家基金公司高管离职 基民担忧业绩波动。请分析上述金融报道中的事件类型及及其事件主体,并以‘事件类型,事件主体’的格式输出。 套现约1.5亿元 交易违规, 大智慧 信批违规, 大智慧
FinER 本公司非执行董事孙月英女士不再担任中国远洋海运集团有限公司总会计师和中远财务有限责任公司董事长。请给出上述金融报道中存在的个人(’PER’)、组 织(’ORG’)或地点(’LOC’)的特定命 名实体,回答应遵循的格式’实体名称, 实体类型’。 中国远洋海运集团有限公司,ORG 中远海运发展股份有限公司, ORG 孙月英, PER;中国远洋海运集团有限公司, ORG;中远财务有限责任公司, ORG
FinNC WTI原油涨幅回升至0.5%,现报75.58美元/桶。请对该金融报道进行分类,具体属于['中国','外国','国际','公司','行业','大盘','经济','政策','政治','期货','债券','房地产','外汇','虚拟货币','新冠','能源']中的哪些类别? 输出与内容无关 国际,能源 国际期货

Citation

If you use ICE-PIXIU in your project, please cite our article.



License

ICE-PIXIU is licensed under the [Apache] license. For details, please refer to the Apache file.

Releases

No releases published

Packages

No packages published