This is the official code release of the following paper:
Xi Victoria Lin, Richard Socher and Caiming Xiong. Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing. Findings of EMNLP 2020.
Cross-domain tabular semantic parsing (X-TSP) is the task of predicting the executable structured query language given a natural language question issued to some database. The model may or may not have seen the target database during training.
This library implements
- A strong sequence-to-sequence based cross-domain text-to-SQL semantic parser that achieved state-of-the-art performance on two widely used benchmark datasets: Spider and WikiSQL.
- A set of SQL processing tools for parsing, tokenizing and validating SQL queries, adapted from the Moz SQL Parser.
The parser can be adapted to learn mappings from text to other structured query languages such as SOQL by modifying the formal langauge pre-processing and post-processing modules.
Our model takes a natural language utterance and a database (schema + field picklists) as input, and generates SQL queries as token sequences. We apply schema-guided decoding and post-processing to make sure the final output is executable.
- Preprocessing: We concatenate the serialized database schema with the utterance to form a tagged sequence. A fuzzy string matching algorithm is used to identify picklist items mentioned in the utterance. The mentioned picklist items are appended to the corresponding field name in the tagged sequence.
- Translating: The hybrid sequence is passed through the BRIDGE model, which output raw program sequences with probability scores via beam search.
- Postprocessing: The raw program sequences are passed through a SQL checker, which verifies its syntactical correctness and schema consistency. Sequences that failed to pass the checker are discarded from the output.
Our implementation has been tested using Pytorch 1.7 and Cuda 11.0 with a single GPU.
git clone https://github.com/salesforce/TabularSemanticParsing
cd TabularSemanticParsing
pip install torch torchvision
python3 -m pip install -r requirements.txt
export PYTHONPATH=`pwd` && python -m nltk.downloader punkt
Download the official data release and unzip the folder. Manually merge spider/train_spider.json
with spider/train_others.json
into a single file spider/train.json
.
mv spider data/
# Data Repair (more details in section 4.3 of paper)
python3 data/spider/scripts/amend_missing_foreign_keys.py data/spider
./experiment-bridge.sh configs/bridge/spider-bridge-bert-large.sh --process_data 0
Download the official data release.
wget https://github.com/salesforce/WikiSQL/raw/master/data.tar.bz2
tar xf data.tar.bz2 -C data && mv data/data data/wikisql1.1
./experiment-bridge.sh configs/bridge/wikisql-bridge-bert-large.sh --process_data 0
The processed data will be stored in a separate pickle file.
Train the model using the following commands. The checkpoint of the best model will be stored in a directory specified by the hyperparameters in the configuration file.
./experiment-bridge.sh configs/bridge/spider-bridge-bert-large.sh --train 0
./experiment-bridge.sh configs/bridge/wikisql-bridge-bert-large.sh --train 0
Decode SQL predictions from pre-trained models. The following commands run inference with the checkpoints stored in the directory specified by the hyperparameters in the configuration file.
./experiment-bridge.sh configs/bridge/spider-bridge-bert-large.sh --inference 0
./experiment-bridge.sh configs/bridge/wikisql-bridge-bert-large.sh --inference 0
Note:
- Add the
--test
flag to the above commands to obtain the test set evaluation results on the corresponding dataset. This flag is invalid for Spider, as its test set is hidden. - Add the
--checkpoint_path [path_to_checkpoint_tar_file]
flag to decode using a checkpoint that's not stored in the default location. - Evaluation metrics will be printed out at the end of decoding. The WikiSQL evaluation takes some time because it computes execution accuracy.
To decode with model ensemble, first list the checkpoint directories of the individual models in the ensemble model configuration file, then run the following command(s).
./experiment-bridge.sh configs/bridge/spider-bridge-bert-large.sh --ensemble_inference 0
You can interact with a pre-trained checkpoint through the commandline using the following commands:
./experiment-bridge.sh configs/bridge/spider-bridge-bert-large.sh --demo 0 --demo_db [db_name] --checkpoint_path [path_to_checkpoint_tar_file]
To change the hyperparameters and other experiment set up, start from the configuration files.
Download pre-trained checkpoints here:
URL | E-SM | EXE |
https://drive.google.com/file/d/1dlrUdGMLvvvfR3kNVy76H12rR7gr4DXI/view?usp=sharing | 70.1 | 68.2 |
mv bridge-spider-bert-large-ems-70-1-exe-68-2.tar.gz model
gunzip model/bridge-spider-bert-large-ems-70-1-exe-68-2.tar.gz
Download cached SQL execution order to normal order mappings:
URL |
https://drive.google.com/file/d/1vk14iR4V_f5x4e17MAaL_L8T9wgjcKCy/view?usp=sharing |
Why this cache? The overhead of converting thousands of SQL queries from execution order to normal order is large, so we cached the conversion for Spider dev set in our experiments. Without using the cache inference on the dev set will be slow. The model still runs fast for individual queries without using a cache.
mv dev.eo.pred.restored.pkl.gz data/spider
gunzip data/spider/dev.eo.pred.restored.pkl.gz
If you find the resource in this repository helpful, please cite
@inproceedings{LinRX2020:BRIDGE,
author = {Xi Victoria Lin and Richard Socher and Caiming Xiong},
title = {Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: Findings, {EMNLP} 2020, November 16-20, 2020},
year = {2020}
}
The parser has been integrated in the Photon web demo: http://naturalsql.com/. Please visit our website to test it live and try it on your own databases!