Before we start, we need to download the dataset and the database from Google Drive and put them in the directory dataset/
Then the file structure should be like this:
. tree
├── BULL-cn
├── BULL-en
├── README.md
├── database_cn
├── database_en
└── get_dev.py
Later, we need to preprocess the dataset
bash scripts/preprocessing_finsql.sh
Then train the Cross-Encoder model:
bash scripts/train_text2sql_schema_item_classifier_finsql.sh
At last, use the Cross-Encoder model to predict the dev set:
bash scripts/generate_text2sql_dataset_finsql.sh
After we preprossing the dataset, we perform hybrid data augmentation:
bash scripts/hybrid_augmentation.sh
Then we start to train the LLM model:
bash ds_sft.sh