AnnotatedTables: A Large Tabular Dataset with Language Model Annotations

Installation

conda create -n annotatedtables python=3.10 conda activate annotatedtables pip install tqdm jsonlines pytest pytest-pycharm networkx prodict records babel tabulate openai pip install ray openai rai-sdk datasets accelerate fuzzywuzzy python-Levenshtein kaggle wrapt-timeout-decorator conda install pandas sqlalchemy

Project structure

./sql contains the dataset construction code for SQL annotation and SQL-to-Rel translation ./llm contains the files needed to query ChatGPT and use the large language model ./rel contains files needed to run Rel code to evaluate SQL-to-Rel translation accuracy with execution accuracy ./worksheets contains scripts for reproducing the figures and tables in the paper

How to run the project and build my own AnnotatedTables?

Go to ./sql/synth.py and the main method is SynthesisPipeline.run_from_step(). Depending on the pipeline step you choose, you may catalog the Kaggle datasets, download the Kaggle datasets, get the schema and example row descriptions, synthesize SQL queries, filter SQL queries with execution, few-shot translation of SQL to Rel, evaluate Rel execution accuracy, etc..

Note that the Kaggle datasets you catalog may be different from ours, as Kaggle is a platform with new datasets added everyday.

Data

The data can be found on Zenodo: https://zenodo.org/records/11626802

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
TabPFN		TabPFN
llm		llm
rel		rel
sql		sql
toolkit		toolkit
worksheets		worksheets
README.md		README.md
diabetes.csv		diabetes.csv
rerun.sh		rerun.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AnnotatedTables: A Large Tabular Dataset with Language Model Annotations

Installation

Project structure

How to run the project and build my own AnnotatedTables?

Data

About

Releases

Packages

Languages

RelationalAI/annotated-tables

Folders and files

Latest commit

History

Repository files navigation

AnnotatedTables: A Large Tabular Dataset with Language Model Annotations

Installation

Project structure

How to run the project and build my own AnnotatedTables?

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages