Scripts and data links for M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, and Luke Zettlemoyer.
Update: The data is currently hosted on HuggingFace here!
To load the dataset use the following steps:
pip install --upgrade datasets
import datasets
dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice
print(dataset['train'][0]['text']
We're currently exploring ways to host this large amount of data online in an accessible manner, so please stay tuned! If you would like to access sooner, feel free to reach out at {machelreid}-{at}-{google-dot-com}.
Feel free to download the test sets for all domains at this Google Drive link.
or via gdown
:
#!/bin/bash
# install and/or upgrade gdown with pip
pip install --upgrade gdown
# Download M2D2 test sets
gdown "1U5wki_V-IFQy733HC6NO5ZuM2jaOaw8y"
tar -xvzf m2d2_test_sets.tar.gz
# File structure
# m2d2_test_sets/
# ├─ DOMAIN_AA/
# │ ├─ test.txt
# ├─ DOMAIN_AB/
# │ ├─ test.txt/
Find scripts for finetuning language models in lm_scripts/adapt.sh
. Furthermore, we provide meta-scripts for generating scripts for multiple domains given an input file containing a list of directories containing domain specfici data (within train.txt
and valid.txt
should exist): lm_scripts/generate_multiple.sh
. Respective instructions/parameters are included in each file.
For validation on multiple files, we also include lm_scripts/validate_on_multiple_files.py
for calculating perplexity measures given a file containing a list of evaluation text files and a model checkpoint.
For Wikipedia data collection, we include scripts for data dump processing (data_scripts/wiki/get_data
), ontology gathering (data_scripts/wiki/ontology
), and generating splits (data_scripts/wiki/split_generation
).
To be uploaded with documentation
To be uploaded with documentation