NL QL data augmentation

This repo served all codes related to data augmentation approach for NL-to-SQL tasks.

Components introduction

Data folder `/data`

Each subfolder under data holds the data for different datasets respectively. After entering the subfolder, the subsubfolders contain the data that stands for,

data_aug for data augmentation files for scripts in src/data_augmentation
generative for data augmentation files for scripts in src/syntehtic_data
handmade_training_data for generated data to train / evaluate ValueNet
original for containing data schemata for the database

Script folder `/src`

Folder src holds all scripts for the data augmentation. The following subfolders serve for different puposes,

data_augmentation for data augmentation pipe with seeding data and training CRITIC model.
synthetic_data for data augmentation pipe with data generative schema
intermediate_representation, preprocessing and spider for all AST related helper files
tools for all other helper files

Usage

Install dependencies and prerequisites

Run pip install -r requirements.txt to install all required dependencies
Sign up in openai.com and configure an API key in the page https://beta.openai.com/account/api-keys
Add the api key as the line shown below in .env file under the root folder of the project. If you haven't this file, please create it.

OPENAI_API_KEY=<api_key_from_openai>

Data augmentation based on shuffling on AST

with analytics-based re-ranking via sentence transformers

Prepare generative schema
- Run te script shown as below
```
python3 src/tools/transform_generative_schema.py --data_path <datapath>
```
, where <datapath> is the path in data which contains the subfolder data/<datapath>/generative and original schema data/<datapath>/original/tables.json
Add more query types into src/synthetic_data/common_query_types if neccessary -- find out query types with
```
python3 src/synthetic_data/group_paris_to_find_templates.py \
--data_containing_semql <training_or_dev_json_file>
```
-- Add query types found into file src/synthetic_data/common_query_types.py
Generate data
- Write your own generating file as similar as src/synthetic_data/generate_synthetical_data_cordis (TODO: generalize this step for all dataset)

re-ranking data and generate handmade training data

python3 src/synthetic_data/apply_sentence_embedding_reranking.py \
--input_data_path <input_data_path> \
--output_file <output_path> \
--db_id <db_id>

Hyperparameters and hardware specifications in evaluation experiments

Evaluation on ValueNet

Hyperparameter	Selected Values
Pretrained model of encoder	bart-base
Seed	90
Dropout	0.3
Optimizer	Adam
Learning rate base	0.001
Beam size	1
Clip grad	5
Accumulated iterations	4
Batch size	4
Column attention	affine
Number of epochs	100
Hidden size	300
Attention vector size	300
Learning rate connection	0.0001
Max grad norm	1
Column embedding size	300
Column poniter	true
Learning rate of Transformer	2e-5
Max sequence length	1024
Scheduler gamma	0.5
Type embedding size	128
Action embedding size	128
Sketch loss weight	1
Decode max time step	80
Loss epoch threshold	70

Hardware specifications	Values
CPU count	8
GPU count	1
GPU type	nVidia V100
Total running time	ca. 12 days

Evaluation on T5-Large

Hyperparameter	Selected Values
Pretrained model	T5-Large
Dropout	0.1
Learning rate	base 0.0001
Optimizer	Adafactor
Clip grad	5
Accumulated iterations	4
Batch size	4
Gradient Accum.	135
Max sequence length	512
No of steps	6500

Hardware specifications	Values
CPU count	8
GPU count	1
GPU type	nVidia A100
Total running time	ca. 16 days

Evaluation on SmBoP

Hyperparameter	Selected Values
Pretrained model	GraPPa-Large
Dropout	0.1
Learning rate base	0.000186
Optimizer	Adam
RAT Layers	8
Beam Size	30
Batch size	16
Gradient Accum.	4
Max sequence length	512
Max steps	60000
Early Stopping	5 epochs

Hardware specifications	Values
CPU count	8
GPU count	1
GPU type	nVidia T4
Total running time	ca. 26 days

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
new.txt		new.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NL QL data augmentation

Components introduction

Data folder `/data`

Script folder `/src`

Usage

Install dependencies and prerequisites

Data augmentation based on shuffling on AST

with analytics-based re-ranking via sentence transformers

Hyperparameters and hardware specifications in evaluation experiments

Evaluation on ValueNet

Evaluation on T5-Large

Evaluation on SmBoP

Contributing

License

About

Releases

Packages

Languages

yizhang-unifr/nl-ql-data-augmentation

Folders and files

Latest commit

History

Repository files navigation

NL QL data augmentation

Components introduction

Data folder /data

Script folder /src

Usage

Install dependencies and prerequisites

Data augmentation based on shuffling on AST

with analytics-based re-ranking via sentence transformers

Hyperparameters and hardware specifications in evaluation experiments

Evaluation on ValueNet

Evaluation on T5-Large

Evaluation on SmBoP

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Data folder `/data`

Script folder `/src`

Packages