LOFT: A 1 Million+ Token Long-Context Benchmark

This repository houses the resources for LOFT, the Long Context Frontiers benchmark, introduced in the research paper Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?. LOFT consists of 6 long-context task categories spanning retrieval, multi-hop compositional reasoning, and more, totaling 30+ datasets and 4 modalities.

We've provided links to download many of the text datasets in LOFT, evaluation code, and code to regenerate some of the datasets that we do not fully release. We also provide an example prompt in PROMPT_EXAMPLE.txt showing how Corpus-in-Context (CiC) prompting can be done for the text retrieval task.

Install the dependencies in requirements.txt to use this repository.

Future Releases

Multi-modal data and evaluation code.
Task-specific prompts.

Releases

[6/29/24]: Release of the evaluation code for text tasks and code to regenerate some of the LOFT datasets.
[6/20/24]: Initial release with links to download many of the LOFT text datasets.

Dataset Creation via Infilling

For many of the datasets, we release the complete set of queries and corpus used in the LOFT paper via the links in the Datasets table. For a small subset, we require the user to first download using the links in the Datasets table, then run preprocess.py which downloads the original dataset and infills the missing fields in the queries and corpus files. The datasets that do require infilling have a ✅ under the Infilling Needed? column.

For example, FIQA for text retrieval requires infilling. To infill the FIQA dataset, first download the ZIP file and unzip. Then run:

python preprocess.py \
  --input_dir path/to/unzipped/fiqa \
  --dataset fiqa \

Evaluation

To evaluate predictions:

python run_evaluation.py \
  --answer_file_path path/to/queries.jsonl \
  --pred_file_path path/to/preds.jsonl \
  --task_type <task_type>

We provide example queries and predictions files in evaluation/example_predictions/. For example, to run evaluation on the RAG Natural Questions example predictions:

python run_evaluation.py \
  --answer_file_path evaluation/example_predictions/rag_nq/queries.jsonl \
  --pred_file_path evaluation/example_predictions/rag_nq/preds.jsonl \
  --task_type rag

The task_type's are defined in evaluation/init.py. Each task_type outputs many different metric scores. To understand which task_type to use for each dataset and also to see the primary evaluation metric reported in the paper for each dataset, see the Datasets table.

Evaluation expects a prediction file in a JSONLines format where each line has the following structure:

{"qid": "test103", "num_turns": 1, "model_outputs": [["Spain"]]}

qid: QID of the prediction corresponding to an entry in the queries file.
num_turns: Number of turns for the QID. This is 1 except for multi-turn datasets (TopiOCQA and SParC).
model_outputs: The model predictions extracted as a list. We leave it to the user of LOFT to extract the model predictions into the right structure.

The required structure of the model_outputs field differs slightly for each task_type. See evaluation/example_predictions/ to understand how to format the predictions file.

Datasets

Task	Dataset	Description	Task Type	Primary Metric	Infilling Needed?	Download
Text Retrieval	ArguAna	Argument Retrieval	`retrieval`	`recall@1`	-	Link
Text Retrieval	FEVER	Fact Checking	`retrieval`	`recall@1`	-	Link
Text Retrieval	FIQA	Question Answering	`retrieval`	`recall@1`	✅	Link
Text Retrieval	MS MARCO	Web Search	`retrieval`	`recall@1`	✅	Link
Text Retrieval	NQ	Question Answering	`retrieval`	`recall@1`	-	Link
Text Retrieval	Quora	Duplication Detection	`retrieval`	`recall@1`	✅	Link
Text Retrieval	SciFact	Citation Prediction	`retrieval`	`recall@1`	-	Link
Text Retrieval	Touché-2020	Argument Retrieval	`retrieval`	`recall@1`	✅	Link
Text Retrieval	TopiOCQA	Multi-turn QA	`retrieval`	`recall@1`	✅	Coming Soon
Text Retrieval	HotPotQA	Multi-hop QA	`retrieval`	`mrecall@2`	-	Link
Text Retrieval	MuSiQue	Multi-hop QA	`retrieval`	`mrecall@5`	-	Link
Text Retrieval	QAMPARI	Multi-target QA	`retrieval`	`mrecall@5`	-	Link
Text Retrieval	QUEST	Multi-target QA	`retrieval`	`mrecall@3`	-	Link
Visual Retrieval	Flickr30k	Image Retrieval	-	-	✅	Coming Soon
Visual Retrieval	MS COCO	Image Retrieval	-	-	✅	Coming Soon
Visual Retrieval	OVEN	Image-text Retrieval	-	-	-	Coming Soon
Visual Retrieval	MSR-VTT	Video Retrieval	-	-	✅	Coming Soon
Audio Retrieval	FLEURS-en	Audio Retrieval	-	-	-	Coming Soon
Audio Retrieval	FLEURS-es	Audio Retrieval	-	-	-	Coming Soon
Audio Retrieval	FLEURS-fr	Audio Retrieval	-	-	-	Coming Soon
Audio Retrieval	FLEURS-hi	Audio Retrieval	-	-	-	Coming Soon
Audio Retrieval	FLEURS-zh	Audio Retrieval	-	-	-	Coming Soon
RAG	NQ	Question Answering	`rag`	`subspan_em`	-	Link
RAG	TopiOCQA	Multi-turn QA	`rag`	`subspan_em`	✅	Coming Soon
RAG	HotPotQA	Multi-hop QA	`rag`	`subspan_em`	-	Link
RAG	MuSiQue	Multi-hop QA	`rag`	`subspan_em`	-	Link
RAG	QAMPARI	Multi-target QA	`multi_value_rag`	`subspan_em`	-	Link
RAG	QUEST	Multi-target QA	`multi_value_rag`	`subspan_em`	-	Link
SQL	Spider	Single-turn SQL	`sql`	`exec_acc`	-	Link
SQL	SParC	Multi-turn SQL	`sql`	`exec_acc`	-	Link
Many-Shot ICL	BBH-date	Multiple-choice QA	-	-	-	Coming Soon
Many-Shot ICL	BBH-salient	Multiple-choice QA	-	-	-	Coming Soon
Many-Shot ICL	BBH-tracking7	Multiple-choice QA	-	-	-	Coming Soon
Many-Shot ICL	BBH-web	Multiple-choice QA	-	-	-	Coming Soon
Many-Shot ICL	LIB-dialogue	Classification	-	-	✅	Coming Soon

Citing this work

@article{Lee2024LongContext,
  title={Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?},
  author={Jinhyuk Lee and Anthony Chen and Zhuyun Dai and Dheeru Dua and Devendra Singh Sachan and Michael Boratko and Yi Luan and Sébastien M. R. Arnold and Vincent Perot and Siddharth Dalmia and Hexiang Hu and Xudong Lin and Panupong Pasupat and Aida Amini and Jeremy R. Cole and Sebastian Riedel and Iftekhar Naim and Ming-Wei Chang and Kelvin Guu},
  journal={ArXiv},
  year={2024},
  volume={abs/2406.13121},
  url={https://arxiv.org/abs/2406.13121}
}

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Individual tasks may be subject to copyright and licensing from their respective owners - please see individual download files for details.

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LOFT: A 1 Million+ Token Long-Context Benchmark

Dataset Creation via Infilling

Evaluation

Datasets

Citing this work

License and disclaimer

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
evaluation		evaluation
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROMPT_EXAMPLE.txt		PROMPT_EXAMPLE.txt
README.md		README.md
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py

License

google-deepmind/loft

Folders and files

Latest commit

History

Repository files navigation

LOFT: A 1 Million+ Token Long-Context Benchmark

Dataset Creation via Infilling

Evaluation

Datasets

Citing this work

License and disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages