Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider: Yale Semantic Parsing and Text-to-SQL Challenge #650

Open
1 task
irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Open
1 task

Spider: Yale Semantic Parsing and Text-to-SQL Challenge #650

irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. dataset public datasets and embeddings Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic

Comments

@irthomasthomas
Copy link
Owner

Spider: Yale Semantic Parsing and Text-to-SQL Challenge

DESCRIPTION:
Spider 1.0
Yale Semantic Parsing and Text-to-SQL Challenge

What is Spider?
Feb. 5th, 2024: We will no longer accept submissions for Spider 1.0 evaluations or update its leaderboard. Look forward to the release of Spider 2.0, a more realistic and challenging benchmark in the era of LLMs, expected this March. Stay tuned!
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.

Why we call it "Spider"?
It is because our dataset is complex and cross-domain like a spider crawling across multiple complex (with many foreign keys) nests (databases).
Related works: DS-1000, Binder, UnifiedSKG, multi-turn SParC, and conversational CoSQL text-to-SQL tasks.

News
02/05/2024
We will no longer accept submissions for Spider 1.0 evaluations or update its leaderboard. The test set of Spider 1.0 has already been released (check the Spider dataset link below). Look forward to the release of Spider 2.0, a more realistic and challenging benchmark in the era of LLMs, expected this March. Stay tuned!
08/10/2023 Please check out XLang language model agents!
05/27/2023 Please check out Dr.Spider, a robustness evaluation benchmark based on Spider, from AWS AI Lab for studying robustness in semantic parsing!
11/20/2022 Please check out our recent work DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. Please check out examples, data, and code on the DS-1000 project site!!
10/18/2022 Please check out our recent work Binder: an easy but sota neural-symbolic built on GPT-3 Codex & SQL/Python interpreter. It injects GPT-3 Codex prompt API calls in programming languages! Please check out Binder demo, code, paper, and video on the Binder project site!!
01/18/2022 Please check out our recent work UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. We open-sourced simple but SOTA/strong models for 21 tasks including text-to-SQL! Please check out our code in the UnifiedSKG repo!!
03/11/2021 Please check out a nice work from Google Research (including new Spider splits) for studying compositional generalization in semantic parsing!
11/15/2020 We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here. Also, Notice that Test results after May 02, 2020 are reported on the new release (collected some annotation errors).
08/03/2020 Corrected "column_name" and "column_name_original" mismatches in 2 dbs ("scholar" and "formula_1") in tables.json, and reparsed SQL queries (this only affects some models (e.g. RATSQL) which use our parsed SQL as the SQL input). Please download the Spider dataset from this page again.
06/07/2020 We corrected some annotation errors and label mismatches (not errors) in Spider dev and test sets (~4% of dev examples updated, click here for more details). Please download the Spider dataset from this page again.
01/16/2020 For value prediction (in order to compute the execution accuracy), your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3").
9/24/2019 (Min et al., EMNLP 2019) translated Spider to Chinese! Check out the Chinese challenge page.
5/17/2019 Our paper SParC: Cross-Domain Semantic Parsing in Context with Salesforce Research was accepted to ACL 2019! It introduces the context-dependent version of the Spider challenge: SParC!
5/17/2019 Please report any annotation errors here, we really appreciate your help and will update the data release in this summer!
1/14/2019 The submission tutorial is out!.
12/17/2018 We updated 7 sqlite database files (issue 14). Please download the Spider dataset from this page again.
10/25/2018 The evaluation script and results were updated (issue 5). Please download the latest versions of the script and papers. Also, please follow instructions in issue 3 to generate the latest SQL parsing results (fixed a bug).

Why Spider?
As the above spider chart shows, Spider 1.0 is distinct from most of the previous semantic parsing tasks because:
ATIS, Geo, Academic: Each of them contains only a single database with a limited number of SQL queries, and has exact same SQL queries in train and test splits.
WikiSQL: The numbers of SQL queries and tables are significantly large. But all SQL queries are simple, and each database is only a simple table without any foreign key.
Spider 1.0 spans the largest area in the chart, making it the first complex and cross-domain semantic parsing and text-to-SQL dataset! Read more on the blog post.

Getting Started
The data is split into training, development, and test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):
Details of baseline models and evaluation script can be found on the following GitHub site:

Data Examples
Some examples look like the following:

Have Questions or Want to Contribute?
Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Michihiro Yasunaga.
We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.

Acknowledgement
We thank Graham Neubig, Tianze Shi, Catherine Finegan-Dollak, and the anonymous reviewers for their precious comments on this project. Also, we thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.
Our team at the summit of the East Rock park in New Haven (The pose is "NLseq2SQL"):

Leaderboard - Execution with Values
Our current models do not predict any value in SQL conditions so that we do not provide execution accuracies. However, we encourage you to provide it in the future submissions. For value prediction, your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3"). Notice: Test results after May 02, 2020 are reported on the new release (collected some annotation errors).

Rank Model Test
1 Nov 2, 2023 MiniSeek Anonymous Code and paper coming soon 91.2
1 Aug 20, 2023 DAIL-SQL + GPT-4 + Self-Consistency Alibaba Group (Gao and Wang et al.,'2023) code 86.6
2 Aug 9, 2023 DAIL-SQL + GPT-4 Alibaba Group (Gao and Wang et al.,'2023) code 86.2
3 October 17, 2023 DPG-SQL + GPT-4 + Self-Correction Anonymous Code and paper coming soon 85.6
4 Apr 21, 2023 DIN-SQL + GPT-4 University of Alberta (Pourreza et al.,'2023) code 85.3
5 July 5, 2023 Hindsight Chain of Thought with GPT-4 Anonymous Code and paper coming soon 83.9
6 Jun 1, 2023 C3 + ChatGPT + Zero-Shot Zhejiang University & Hundsun (Dong et al.,'2023) code 82.3
7 July 5, 2023 Hindsight Chain of Thought with GPT-4 and Instructions Anonymous Code and paper coming soon 80.8
8 Feb 7, 2023 RESDSQL-3B + NatSQL (DB content used) Renmin University of China (Li et al., AAAI'23) code 79.9
9 Nov 21, 2022 SeaD + PQL (DB content used) Anonymous 78.5
10 Apr 21, 2023 DIN-SQL + CodeX University of Alberta (Pourreza et al.,'2023) code 78.2
11 August 10, 2023 T5-3B+NatSQL+Token Preprocessing (DB content used) George Mason University & MIT (Rai et al., ACL '23) code 78.0
12 Sep 14, 2022 CatSQL + GraPPa (DB content used) Anonymous 78.0
13 Sep 13, 2022 Graphix-3B+PICARD (DB content used) Alibaba DAMO & HKU STAR & SIAT (Li et al., AAAI'2023) code 77.6
14 Sep 1, 2022 SHiP+PICARD (DB content used) AWS AI Labs (Zhao et al.,'22) 76.6
15 Apr 4, 2023 RASAT + NatSQL + Reranker (DB content used) Anonymous Paper coming soon 76.5
16 Dec 15, 2022 N-best List Rerankers + PICARD (DB content used) Alexa AI (Zeng et al., IEEE SLT 2023) 75.9
17 Jun 4, 2022 RASAT+PICARD (DB content used) SJTU LUMIA & Netmind.AI (Qi et al., EMNLP'22) code 75.5
18 May 8, 2022 T5-SR (DB content used) Anonymous 75.2
19 Aug 12, 2022 RESDSQL+T5-1.1-lm100k-xl (DB content used) Anonymous 75.1
20 Jul 14, 2021 T5-3B+PICARD (DB content used) Element AI, a ServiceNow company (Scholak et al., EMNLP'21) code 75.1
21 Aug 12, 2022 RESDSQL+T5-1.1-lm100k-large (DB content used) Anonymous 74.8
22 May 18, 2022 SeaD + SP (DB content used) Anonymous 74.1
23 May 4, 2021 RATSQL+GAP+NatSQL (DB content used) Queen Mary University of London (Gan et al., EMNLP Findings'21) code 73.3
24 August 10, 2021 T5-Base+NatSQL+Token Preprocessing (DB content used) George Mason University & MIT (Rai et al., ACL '23) code 71.1
25 Mar 10, 2021 SmBoP + GraPPa (DB content used) Tel-Aviv University & Allen Institute for AI (Rubin and Berant, NAACL'21) code 71.1
26 Aug 05, 2021 RaSaP + ELECTRA (DB content used) Ant Group, ZhiXiaoBao & Ada (Huang et al.,'21) 70.0
27 Nov 24, 2020 BRIDGE v2 + BERT(ensemble) (DB content used) Salesforce Research (Lin et al., EMNLP-Findings '20) code 68.3
28 Jan 16, 2021 COMBINE (DB content used) Novelis.io Research (Youssef et al.,'21) 68.2
29 Jul 22, 2022 T5QL-Base (DB content used) Anonymous 66.8
30 Nov 24, 2020 BRIDGE v2 + BERT (DB content used) Salesforce Research (Lin et al., EMNLP-Findings '20) code 64.3
31 May 30, 2020 AuxNet + BART (DB content used) Anonymous 62.6
32 May 30, 2020 BRIDGE + BERT (DB content used) Salesforce Research (Lin et al., EMNLP-Findings '20) code 59.9
33 May 20, 2020 GAZP + BERT (DB content used) University of Washington & Facebook AI Research (Zhong et al., EMNLP '20) 53.5

URL: Spider Website

Suggested labels

@irthomasthomas irthomasthomas added Algorithms Sorting, Learning or Classifying. All algorithms go here. dataset public datasets and embeddings Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic labels Feb 28, 2024
@irthomasthomas
Copy link
Owner Author

Related issues

#546: [2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

### DetailsSimilarity score: 0.9 - [ ] [[2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction](https://arxiv.org/abs/2304.11015)

[2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

DESCRIPTION:

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

Mohammadreza Pourreza, Davood Rafiei

There is currently a significant gap between the performance of fine-tuned models and prompting approaches using Large Language Models (LLMs) on the challenging task of text-to-SQL, as evaluated on datasets such as Spider. To improve the performance of LLMs in the reasoning process, we study how decomposing the task into smaller sub-tasks can be effective. In particular, we show that breaking down the generation problem into sub-problems and feeding the solutions of those sub-problems into LLMs can be an effective approach for significantly improving their performance. Our experiments with three LLMs show that this approach consistently improves their simple few-shot performance by roughly 10%, pushing the accuracy of LLMs towards SOTA or surpassing it. On the holdout test set of Spider, the SOTA, in terms of execution accuracy, was 79.9 and the new SOTA at the time of this writing using our approach is 85.3. Our approach with in-context learning beats many heavily fine-tuned models by at least 5%. Additionally, when evaluated on the BIRD benchmark, our approach achieved an execution accuracy of 55.9%, setting a new SOTA on its holdout test set.

URL: https://arxiv.org/abs/2304.11015

Suggested labels

{'label-name': 'Text-to-SQL', 'label-description': 'Focuses on generating SQL queries from natural language text.', 'confidence': 76.74}

#649: README.md · PipableAI/pip-sql-1.3b at main

### DetailsSimilarity score: 0.88 - [ ] [README.md · PipableAI/pip-sql-1.3b at main](https://huggingface.co/PipableAI/pip-sql-1.3b/blob/main/README.md?code=true)

README.md · PipableAI/pip-sql-1.3b at main

DESCRIPTION:

  • license: apache-2.0
  • datasets:
    • PipableAI/pip-txt-to-sql-spider-bird-dataset
  • language:
    • en
  • metrics:
    • accuracy
  • tags:
    • sql
    • code
    • text2sql
    • instruction_tuned
    • basemodel
    • jax
    • pytorch
    • tensorflow
    • text-generation-inference
  • library_name: transformers
  • pipeline_tag: text-generation
  • widget:
    • text: "CREATE TABLE system(JobID: String,GID: String, UID: String, Start:Time(yyyy/mm/dd), End: Time,ElapsedRaw: Time, CPUTimeRAW: Time,NCPUS: Number,NNodes: Number, NodeList: List, State:String, Timelimit: Time);Get UID and job id for Jobs that started on Jan 20 , 2023 ended on feb 14 2023 and has job id 20"
      example_title: "example"

pipSQL-1.3b

pipableAi

colab_notebook

What have we built?

A 1.3 bn SQL model that outperforms most SQL expert models and chatgpt on popular benchmarks.
This is a distilled model built on the deepseek base model.

How we built it?

We used softmax cross entropy and a modified form of policy grad along with Q loss, optimized in an EM set up.
Loss behaviour in the set up mentioned above -

image/png

Benchmarking:

For benchmarking purposes we are using Semantic Evaluation for Text-to-SQL with Distilled Test Suites, an officially accepted evaluation framework for Spider, SParC, and CoSQL which was proposed by a research team of Yale and Berkeley.
The benchmark contains 2200 test data points
Here is the link to run the evaluation:

Test Suite SQL Eval

model easy medium hard extra
sqlcoder-7b-2 72.0 58.0 40.6 37.3
pipSQL-1.3b 78.5 57.5 42.1 28.3
pipSQL-7b 63.0 40.0 30.2 25.0
sqlcoder-7b 60.6 48.2 28.3 20.4
gpt-3.5 58.8 44.7 31.0 28.4

We have also benchmarked it on defog eval.
It contains 200 test data points handpicked by defog team.
Here is the link to it:

Defog SQL-Eval
These are the results -

image/png

License

The model is open source under apache 2.0. License

Usage

Installation

pip install transformers

Prompt

prompt = f"""<schema>{schema}</schema>
<question>{question}</question>
<sql>"""

PyTorch

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b")
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

Flax

from transformers import FlaxAutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = FlaxAutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b",from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
inputs = tokenizer(text, return_tensors="jax")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

TensorFlow

from transformers import TFAutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = TFAutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b",from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
inputs = tokenizer(text, return_tensors="tf")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

Examples

Schema

CREATE TABLE Products (
  product_id number,
  parent_product_id number,
  product_name text,
  product_price number,
  product_color text,
  product_size text,
  product_description text);
CREATE TABLE Customers (
  customer_id number,
  gender_code text,
  customer_first_name text,
  customer_middle_initial text,
  customer_last_name text,
  email_address text,
  login_name text,
  login_password text,
  phone_number text,
  address_line_1 text,
  town_city text,
  county text,
  country text);
CREATE TABLE Customer_Payment_Methods (
  customer_id number,
  payment_method_code text);
CREATE TABLE Invoices (
  invoice_number number,
  invoice_status_code text,
  invoice_date time);
CREATE TABLE Orders (
  order_id number,
  customer_id number,
  order_status_code text,
  date_order_placed time);
CREATE TABLE Order_Items (
  order_item_id number,
  product_id number,
  order_id number,
  order_item_status_code text);
CREATE TABLE Shipments (
  shipment_id number,
  order_id number,
  invoice_number number,
  shipment_tracking_number text,
  shipment_date time);
CREATE TABLE Shipment_Items (
  shipment_id number,
  order_item_id number);

Questions

What are the email address, town and county of the customers who are of the least common gender?

SELECT email_address ,  town_city ,  county FROM customers GROUP BY gender_code ORDER BY count(*) ASC LIMIT 1

What are the product price and the product size of the products whose price is above average?

SELECT product_price ,  product_size FROM products WHERE product_price  > (SELECT avg(product_price) FROM products)

Which customers did not make any orders? List the first name, middle initial and last name.

SELECT T1.customer_first_name ,  T1.customer_middle_initial ,  T1.customer_last_name FROM Customers AS T1 WHERE T1.customer_id NOT IN (SELECT T2.customer_id FROM Orders AS T2)

Team

Avi Kothari, Pratham Gupta, Ritvik Aryan Kalra, Rohan Bhatial, Soham Acharya

URL: PipableAI/pip-sql-1.3b

Suggested labels

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

### DetailsSimilarity score: 0.86 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

  • Open-Web (Common Crawl Foundation Terms of Use)
  • Books: the_pile_books3 license and pg19 license
  • ArXiv Terms of Use
  • Wikipedia License
  • StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#333: Paper Digest: NeurIPS-2023 Highlights (Full List)

### DetailsSimilarity score: 0.86 - [ ] [Paper Digest: NeurIPS-2023 Highlights (Full List)](https://www.paperdigest.org/data/neurips-2023-full.html)

Paper Digest: NeurIPS 2023 Highlights

https://www.paperdigest.org

1, Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick; Jane Dwivedi-Yu; Roberto Dessi; Roberta Raileanu; Maria Lomeli; Eric Hambro; Luke Zettlemoyer; Nicola Cancedda; Thomas Scialom;
Related Papers   Related Patents   Related Grants   Related Venues   Related Experts   Related Code   View
Highlight: In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds.

2, Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan; Niket Tandon; Prakhar Gupta; Skyler Hallinan; Luyu Gao; Sarah Wiegreffe; Uri Alon; Nouha Dziri; Shrimai Prabhumoye; Yiming Yang; Shashank Gupta; Bodhisattwa Prasad Majumder; Katherine Hermann; Sean Welleck; Amir Yazdanbakhsh; Peter Clark;
Related Papers   Related Patents   Related Grants   Related Venues   Related Experts   Related Code   View
Highlight: Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement.

3, Vicuna Evaluation: Exploring LLM-as-a-Judge and Chatbot Arena
Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; Siyuan Zhuang; Zhanghao Wu; Yonghao Zhuang; Zi Lin; Zhuohan Li; Dacheng Li; Eric Xing; Hao Zhang; Joseph Gonzalez; Ion Stoica;
Related Papers   Related Patents   Related Grants   Related Venues   Related Experts   View
Highlight: To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them.

Suggested labels

{ "key": "LLM-Applications", "value": "Topics related to practical applications of Large Language Models in various fields" }

#309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"

### DetailsSimilarity score: 0.85 - [ ] [openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"](https://github.com/openai/human-eval)

HumanEval: Hand-Written Evaluation Set

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".

Installation

Make sure to use python 3.7 or later:

$ conda create -n codex python=3.7
$ conda activate codex
Check out and install this repository:

$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
Usage

This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.

After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so:

{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging.

Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl.

from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
To evaluate the samples, run

$ evaluate_functional_correctness samples.jsonl
Reading samples...
32800it [00:01, 23787.50it/s]
Running test suites...
100%|...| 32800/32800 [16:11<00:00, 33.76it/s]
Writing results to samples.jsonl_results.jsonl...
100%|...| 32800/32800 [00:00<00:00, 42876.84it/s]
{'pass@1': ..., 'pass@10': ..., 'pass@100': ...}
This script provides more fine-grained information in a new file ending in <input_path>_results.jsonl. Each row now contains whether the completion passed along with the execution result which is one of "passed", "timed out", or "failed".

As a quick sanity-check, the example samples should yield 0.5 pass@1.

$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl
Reading samples...
6it [00:00, 3397.11it/s]
Running example suites...
100%|...| 6/6 [00:03<00:00, 1.96it/s]
Writing results to data/example_samples.jsonl_results.jsonl...
100%|...| 6/6 [00:00<00:00, 6148.50it/s]
{'pass@1': 0.4999999999999999}
Because there is no unbiased way of estimating pass@k when there are fewer samples than k, the script does not evaluate pass@k for these cases. To evaluate with other k values, pass --k=. For other options, see

$ evaluate_functional_correctness --help
However, we recommend that you use the default values for the rest.

Known Issues

While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again.

malloc: can't allocate region
Citation

Please cite using the following bibtex entry:

@Article{chen2021codex,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Suggested labels

{ "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }

#644: cohereai_classify table | CohereAI plugin | Steampipe Hub

### DetailsSimilarity score: 0.85 - [ ] [cohereai_classify table | CohereAI plugin | Steampipe Hub](https://hub.steampipe.io/plugins/mr-destructive/cohereai/tables/cohereai_classify)

TITLE: cohereai_classify table | CohereAI plugin | Steampipe Hub

DESCRIPTION:
Overview
8Tables
Versions
GitHub
steampipe plugin install mr-destructive/cohereai

cohereai_classify
cohereai_detect_language
cohereai_detokenize
cohereai_embed
cohereai_generation
cohereai_summaraize
cohereai_summarize
cohereai_tokenize

ON THIS PAGE
Examples
Schema

GET INVOLVED
Edit on GitHub
Discuss on Slack

Table: cohereai_classify

Get classification for a given input strings and examples.

Notes:

  • A inputs is a list of strings to classify.(max 96 strings)
  • A examples is a list of {"text": "apple", "label": "fruit"} structure of type Example
  • Minimum 2 examples should be provided and the maximum value is 2500 with each example of maximum of 512 tokens.

Examples

Basic classification with given set of inputs and examples

select
  classification
from
  cohereai_classify
where
  inputs = '["apple", "blue", "pineapple"]'
  and examples = '[{"text": "apple", "label": "fruit"}, {"text": "green", "label": "color"}, {"text": "grapes", "label": "fruit"}, {"text": "purple", "label": "color"}]';

Classification with specific settings(model, preset)

select
  classification
from
  cohereai_classify
where
  settings = '{
 "model": "embed - multilingual - v2.0" }'
  and inputs = '["Help!", "Call me when you can"]'
  and examples = '[{"text": "Help!", "label": "urgent"}, {"text": "SOS", "label": "urgent"}, {"text": "Call me when you can", "label": "not urgent"}, {"text": "Talk later?", "label": "not urgent"}]';

Email Spam Classification

select
  classification
from
  cohereai_classify
where
  inputs = '["Confirm your email address", "hey i need u to send some $"]'
  and examples = '[{"label": "Spam", "text": "Dermatologists don't like her!"}, {"label": "Spam", "text": "Hello, open to this?"}, {"label": "Spam", "text": "I need help please wire me $1000 right now"}, {"label": "Spam", "text": "Hot new investment, don't miss this!"}, {"label": "Spam", "text": "Nice to know you ;)"}, {"label": "Spam", "text": "Please help me?"}, {"label": "Not spam", "text": "Your parcel will be delivered today"}, {"label": "Not spam", "text": "Review changes to our Terms and Conditions"}, {"label": "Not spam", "text": "Weekly sync notes"}, {"label": "Not spam", "text": "Re: Follow up from today's meeting"}, {"label": "Not spam", "text": "Pre-read for tomorrow"}]';

Schema for cohereai_classify

Name Type Operators Description
_ctx jsonb Steampipe context in JSON form, e.g. connection_name.
classification text The classification results for the given input text(s).
confidence double precision The confidence score of the classification.
examples text The example text classified.
id text The ID of the classification.
inputs text The input text that was classified.
labels jsonb The labels of the classification.
settings jsonb Settings is a JSONB object that accepts any of the classify API request parameters.

URL: cohereai_classify table | CohereAI plugin | Steampipe Hub

Suggested labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. dataset public datasets and embeddings Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic
Projects
None yet
Development

No branches or pull requests

1 participant