Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases. This repository contains the Break dataset along with information on the exact data format.
For more details check out our TACL paper "Break It Down: A Question Understanding Benchmark", and website.
The code and models presented in our paper, see our repository at: https://github.com/tomerwolgithub/Break.
- Key Links
- Break Dataset: Download
- Paper: "Break It Down: A Question Understanding Benchmark"
- Models Code: https://github.com/tomerwolgithub/Break
- Leaderboard:
- Break: Leaderboard
- Break High-Level: Leaderboard
- Evaluator Code: https://github.com/allenai/break-evaluator
- Website: https://allenai.github.io/Break/
- Huggingface
nlp
library: https://huggingface.co/datasets/break_data
7/04/2020
Break is now part of HuggingFacenlp
library see details.4/10/2020
Pretrained QDMR Parsing models are now available.4/02/2020
New AI2 leaderboards for Break and Break High-Level.2/26/2020
Our paper's entire codebase is now available.1/31/2020
The entire codebase and official leaderboard will be released soon.1/31/2020
The full Break dataset has been released!
- The Break dataset contains questions from the following 10 datasets:
- Semantic Parsing: Academic, ATIS, GeoQuery, Spider
- Visual Question Answering: CLEVR-humans, NLVR2
- Reading Comprehension (and KB-QA): ComQA, ComplexWebQuestions, DROP, HotpotQA
QDMR
: Contains questions over text, images and databases annotated with their Question Decomposition Meaning Representation. In addition to the train, dev and (hidden) test sets we providelexicon_tokens
files. For each question, the lexicon file contains the set of valid tokens that could potentially appear in its decomposition (Section 3).QDMR high-level
: Contains questions annotated with the high-level variant of QDMR. These decomposition are exclusive to Reading Comprehension tasks (Section 2).lexicon_tokens
files are also provided.logical-forms
: Contains questions and QDMRs annotated with full logical-forms of QDMR operators + arguments. Full logical-forms were inferred by the annotation-consistency algorithm described in Section 4.3.
- QDMR & QDMR high-level:
- train.csv, dev.csv, test.csv:
question_id
: The Break question id, of the format[ORIGINAL DATASET]_[original split]_[original id]
. E.g.,NLVR2_dev_dev-1049-1-1
is from NLVR2 dev split with its NLVR2 id being,dev-1049-1-1
.question_text
: Original question text.decomposition
: The annotated QDMR of the question, its steps delimited by;
. E.g.,return flights ;return #1 from washington ;return #2 to boston ;return #3 in the afternoon
.operators
: List of tagged QDMR operators for each step. QDMR operators are fully described in (Section 2) of the paper. The 14 potential operators are,select, project, filter, aggregate, group, superlative, comparative, union, intersection, discard, sort, boolean, arithmetic, comparison
. Unidefntified operators are tagged withNone
.split
: The Break dataset split of the example, train / dev / test.
- train_lexicon_tokens.json, dev_lexicon_tokens.json, test_lexicon_tokens.json:
"source"
: The source question."allowed_tokens"
: The set of valid lexicon tokens that can appear in the QDMR of the question. For the method used to generate lexicon tokens see here.
- train.csv, dev.csv, test.csv:
- logical-forms:
- train.csv, dev.csv, test.csv:
question_id
: Same as before.question_text
: Same as before.decomposition
: Same as before.program
: List of QDMR operators and arguments that the original QDMR was mapped to. E.g., for the QDMR,return citations ;return #1 of Making database systems usable ;return number of #2
, its program is,[ SELECT['citations'], FILTER['#1', 'of Making database systems usable'], AGGREGATE['count', '#2'] ]
.operators
: Same as before.split
: Same as before.
- train.csv, dev.csv, test.csv:
Break question decomposition datasets:
Data | Examples | Train | Dev | Test |
---|---|---|---|---|
QDMR | 60,150 | 44,321 (73.7%) | 7,760 (12.9%) | 8,069 (13.4%) |
QDMR High-level | 23,828 | 17,503 (73.5%) | 3,130 (13.1%) | 3,195 (13.4%) |
logical-forms (QDMR) | 59,823 | 44,098 (73.7%) | 7,719 (12.9%) | 8,006 (13.4%) |
QDMR annotations by original dataset:
Data | Examples | Train | Dev | Test |
---|---|---|---|---|
Academic | 195 | 195 | 0 | 0 |
ATIS | 4,906 | 4,042 | 457 | 407 |
GeoQuery | 877 | 547 | 50 | 280 |
Spider | 7,982 | 6,955 | 502 | 525 |
CLEVR-humans | 13,935 | 9,453 | 2,215 | 2,267 |
NLVR2 | 13,517 | 9,915 | 1,805 | 1,797 |
ComQA | 5,520 | 3,546 | 988 | 986 |
ComplexWebQuestions | 2,988 | 1,985 | 475 | 528 |
DROP | 10,230 | 7,683 | 1,268 | 1,279 |
QDMR High-level annotations by original dataset:
Data | Examples | Train | Dev | Test |
---|---|---|---|---|
ComplexWebQuestions | 2,991 | 1,988 | 475 | 528 |
DROP | 10,262 | 7,705 | 1,273 | 1,284 |
HotpotQA-hard | 10,575 | 7,810 | 1,382 | 1,383 |
@article{Wolfson2020Break,
title={Break It Down: A Question Understanding Benchmark},
author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
journal={Transactions of the Association for Computational Linguistics},
year={2020},
}
You can also access Break as part of the HuggingFace nlp
library:
!pip install nlp
from nlp import load_dataset
dataset = load_dataset('break_data', 'QDMR-high-level')
# dataset = load_dataset('break_data', 'QDMR')
Break is referenced here and can be browsed online as part of a simple viewer.
More details on the options and usage for this library can be found on the nlp
repository at https://github.com/huggingface/nlp.