Skip to content

allenai/Break

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Break: A Question Understanding Benchmark

Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases. This repository contains the Break dataset along with information on the exact data format.

For more details check out our TACL paper "Break It Down: A Question Understanding Benchmark", and website.
The code and models presented in our paper, see our repository at: https://github.com/tomerwolgithub/Break.

Changelog

  • 7/04/2020 Break is now part of HuggingFace nlp library see details.
  • 4/10/2020 Pretrained QDMR Parsing models are now available.
  • 4/02/2020 New AI2 leaderboards for Break and Break High-Level.
  • 2/26/2020 Our paper's entire codebase is now available.
  • 1/31/2020 The entire codebase and official leaderboard will be released soon.
  • 1/31/2020 The full Break dataset has been released!

Question Answering Datasets

Data Description

Datasets

  • QDMR: Contains questions over text, images and databases annotated with their Question Decomposition Meaning Representation. In addition to the train, dev and (hidden) test sets we provide lexicon_tokens files. For each question, the lexicon file contains the set of valid tokens that could potentially appear in its decomposition (Section 3).
  • QDMR high-level: Contains questions annotated with the high-level variant of QDMR. These decomposition are exclusive to Reading Comprehension tasks (Section 2). lexicon_tokens files are also provided.
  • logical-forms: Contains questions and QDMRs annotated with full logical-forms of QDMR operators + arguments. Full logical-forms were inferred by the annotation-consistency algorithm described in Section 4.3.

Data Format

  • QDMR & QDMR high-level:
    • train.csv, dev.csv, test.csv:
      • question_id: The Break question id, of the format [ORIGINAL DATASET]_[original split]_[original id]. E.g., NLVR2_dev_dev-1049-1-1 is from NLVR2 dev split with its NLVR2 id being, dev-1049-1-1.
      • question_text: Original question text.
      • decomposition: The annotated QDMR of the question, its steps delimited by ;. E.g., return flights ;return #1 from washington ;return #2 to boston ;return #3 in the afternoon.
      • operators: List of tagged QDMR operators for each step. QDMR operators are fully described in (Section 2) of the paper. The 14 potential operators are, select, project, filter, aggregate, group, superlative, comparative, union, intersection, discard, sort, boolean, arithmetic, comparison. Unidefntified operators are tagged with None.
      • split: The Break dataset split of the example, train / dev / test.
    • train_lexicon_tokens.json, dev_lexicon_tokens.json, test_lexicon_tokens.json:
      • "source": The source question.
      • "allowed_tokens": The set of valid lexicon tokens that can appear in the QDMR of the question. For the method used to generate lexicon tokens see here.
  • logical-forms:
    • train.csv, dev.csv, test.csv:
      • question_id: Same as before.
      • question_text: Same as before.
      • decomposition: Same as before.
      • program: List of QDMR operators and arguments that the original QDMR was mapped to. E.g., for the QDMR, return citations ;return #1 of Making database systems usable ;return number of #2, its program is, [ SELECT['citations'], FILTER['#1', 'of Making database systems usable'], AGGREGATE['count', '#2'] ].
      • operators: Same as before.
      • split: Same as before.

Data Statistics

Break question decomposition datasets:

Data Examples Train Dev Test
QDMR 60,150 44,321 (73.7%) 7,760 (12.9%) 8,069 (13.4%)
QDMR High-level 23,828 17,503 (73.5%) 3,130 (13.1%) 3,195 (13.4%)
logical-forms (QDMR) 59,823 44,098 (73.7%) 7,719 (12.9%) 8,006 (13.4%)

QDMR annotations by original dataset:

Data Examples Train Dev Test
Academic 195 195 0 0
ATIS 4,906 4,042 457 407
GeoQuery 877 547 50 280
Spider 7,982 6,955 502 525
CLEVR-humans 13,935 9,453 2,215 2,267
NLVR2 13,517 9,915 1,805 1,797
ComQA 5,520 3,546 988 986
ComplexWebQuestions 2,988 1,985 475 528
DROP 10,230 7,683 1,268 1,279

QDMR High-level annotations by original dataset:

Data Examples Train Dev Test
ComplexWebQuestions 2,991 1,988 475 528
DROP 10,262 7,705 1,273 1,284
HotpotQA-hard 10,575 7,810 1,382 1,383

Reference

@article{Wolfson2020Break,
  title={Break It Down: A Question Understanding Benchmark},
  author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
}

HuggingFace nlp library

You can also access Break as part of the HuggingFace nlp library:

!pip install nlp
from nlp import load_dataset
dataset = load_dataset('break_data', 'QDMR-high-level')
# dataset = load_dataset('break_data', 'QDMR')

Break is referenced here and can be browsed online as part of a simple viewer.
More details on the options and usage for this library can be found on the nlp repository at https://github.com/huggingface/nlp.