mRAT-SQL-FIT - A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
Code and model from the paper paper published in Springer-Nature - International Journal of Information Technology, here the SharedIt link.
Code and model from our BRACIS 2021 paper published in Springer Lecture Notes in Computer Science, here the pre-print in arXiv.
Based on: RAT-SQL+GAP: Github. Paper: AAAI 2021 paper
mRAT-SQL+GAP is a multilingual version of the RAT-SQL+GAP, wich start with Portuguese Language. Here is available the code, dataset and the results.
Go to the directory where you want to install the structure
git clone https://github.com/C4AI/gap-text2sql
cd gap-text2sql/mrat-sql-gap
Go to you browser and download:
https://drive.google.com/uc?id=1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
Put the spider.zip file into the directory: gap-text2sql/mrat-sql-gap
conda create --name mtext2sql python=3.7
conda activate mtext2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -U "huggingface_hub[cli]"
pip install hf-transfer
export HF_HUB_ENABLE_HF_TRANSFER=0
pip install gdown
conda install -c conda-forge jsonnet
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"
conda install jupyter notebook
conda install -c conda-forge jupyter_contrib_nbextensions
Just run this script below, it will copy the datasets. The original version of the Spider dataset is distributed under the CC BY-SA 4.0 license. The modified versions (translated to Portuguese, Spanish, French, double-size(English and Portuguese) and quad-size (English, Portuguese, Spanish and French)) of train_spider.json, train_others.json, and dev.json are distributed under the CC BY-SA 4.0 license, respecting ShareAlike.
chmod +x setup.sh
./setup.sh
The models and checkpoints have big files (GBytes), so if you have enough disk space you can run all shell scripts. To understand how things work, run just BART_large.sh and after run the others.
./BART_large.sh
./mBART50MtoM-large.sh
./mT5_large.sh
./BERTimbau-base.sh
./BERTimbau-large.sh
Now the environment is ready for training (fine-tune) and inferences. The training is very slow more than 60 hours for BART, BERTimbau, mBART50, and more than 28 hours for mT5. Therefore I recommend testing the environment with the inference.
This preprocess step is necessary both for inference and for training. It will take some time, maybe 40 minutes. I will use the script for BART, but you can use the other, look the directory experiments/spider-configs
python run.py preprocess experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet
You can see the files processed in the paths:
data/spider-en/nl2code-1115,output_from=true,fs=2,emb=bart,cvlink
I will use the script for BART again.
Note: We are making inferences using the checkpoint already trained (directory logdir) and defined in:
experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet
logdir: "logdir/BART-large-en-train",
and
eval_steps: [40300],
python run.py eval experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet
You then get the inference results and evaluation results in the paths:
ie_dirs/BART-large-en-train/bart-large-en_run_1_true_1-step41000.infer
and
ie_dirs/BART-large-en-train/bart-large-en_run_1_true_1-step41000.eval
.
Execute if it is really necessary, if you want to fine-tune the model, this will take a long time. But if you have a good machine available and want to see different checkpoints in the logdir, do it.
python run.py train experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet
You then get the training checkpoints in the paths:
logdir/BART-large-en-train
The checkpoints are available here (ESM - Exact Set Matching Accuracy): Paper mRAT-SQL+GAP - Multilingual version of the RAT-SQL+GAP
- BART-large trained in English [, ESM all: 0.718]
- Checkpoint: 40300
- Inference - English: ESM all: 0.718 - Baseline
- Checkpoint: 40300
- BERTimbau-base trained in Portuguese
- Checkpoint: 24100
- Inference - Portuguese: ESM all: 0.417
- Checkpoint: 24100
- mBART50MtoM-large trained in English
- Checkpoint 23100
- Inference - English: ESM all: 0.651
- Checkpoint 23100
- mBART50MtoM-large trained in Portuguese
- Checkpoint 39100
- Inference - Portuguese: ESM all: 0.588
- Checkpoint 39100
- mBART50MtoM-large trained in English and Portuguese (together)
- Checkpoint 41000
- Inference - English: ESM all: 0.664
- Inference - Portuguese: ESM all: 0.595 Best inferences in Portuguese
- Checkpoint 21100
- Inference - English: ESM all: 0.678 Best inferences in English
- Inference - Portuguese: ESM all: 0.581
- Checkpoint 41000
Future work of the paper mRAT-SQL+GAP
- BERTimbau-large trained in Portuguese
- Checkpoint: 40100
- Inference - Portuguese: ESM all: 0.418
- Checkpoint: 40100
- mBART50MtoM-large trained in English, Portuguese, Spanish and French (together) - just best inferences
- Checkpoint: 39100
- Inference - English: ESM all: 0.696
- Checkpoint: 42100
- Inference - Portuguese: ESM all pt: 0.626
- Inference - Spanish: ESM all: 0.628
- Checkpoint: 44100
- Inference - French: ESM all: 0.649
- Checkpoint: 39100
Paper mRAT-SQL-FIT
-
mT5-large trained in English 51Ksteps
- Checkpoint: 50100
- Inference - English: ESM all: 0.684
- Checkpoint: 50100
-
mT5-large trained in English, Portuguese, Spanish and French (together) 51Ksteps - just best inferences
- Checkpoint: 51100
- Inference - English: ESM all: 0.715
- Checkpoint: 42100
- Inference - Portuguese: ESM all: 0.680
- Checkpoint: 50100
- Inference - Spanish: ESM all: 0.660
- Inference - French: ESM all: 0.672
- Checkpoint: 51100
-
mT5-large trained in English, Portuguese, Spanish and French (together) 120Ksteps - just best inferences
- Checkpoint: 77500
- Inference - English: ESM all: 0.718
- Checkpoint: 85500
- Inference - Portuguese: ESM all: 0.675
- Checkpoint: 76500
- Inference - Spanish: ESM all: 0.675
- Checkpoint: 67500
- Inference - French: ESM all: 0.681
- Checkpoint: 77500
-
mT5-large trained in English, Portuguese, Spanish and French (together) FIT 120Ksteps - just best inferences
- Checkpoint: 105100
- Inference - English: (simplemma.load_data('en','pt','es','fr')): ESM all: 0.735 Best inferences in English
- Inference - English: (simplemma.load_data('en'): ESM all: 0.736 Best inferences in English
- Checkpoint: 102100
- Inference - Portuguese: ESM all: 0.687
- Checkpoint: 114100
- Inference - Spanish: ESM all: 0.689
- Inference - French: ESM all: 0.698
- Checkpoint: 105100
-
mT5-large trained in English, Portuguese, Spanish and French (together) 2048TKs - 480Ksteps - just inference in English
- Checkpoint: 290100
- Inference - English: ESM all: 0.697
- Checkpoint: 290100
Other Best Results
-
T5-v1_1-large trained in English FIT 150Ksteps
- Checkpoint: 150300
- Inference - English: ESM all: 0.736
- Checkpoint: 150300
-
mT5-large trained in English, Portuguese, Spanish and French (together) + Non Linear Data Augmentation by rules for extra question 3enr-3ptr-3esr-3frr FIT 150Ksteps
- Checkpoint: 128100
- Inference - English: ESM all: 0.726
- Checkpoint: 125100
- Inference - Portuguese: ESM all: 0.698
- Inference - French: ESM all: 0.700
- Checkpoint: 136100
- Inference - Spanish: ESM all: 0.691
- Checkpoint: 128100
All intermediate files of the results are in the directory inference-results.
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.