Skip to content

Latest commit

 

History

History
53 lines (46 loc) · 1.79 KB

README.md

File metadata and controls

53 lines (46 loc) · 1.79 KB

Dataset

Two datasets (SPNLG and Wiki) can be downloaded from https://drive.google.com/drive/folders/1FsNlFh2aUbuBl45zEjgvAXDkp_e4hQmV?usp=sharing

Statistics

Train Valid Test
Paired Raw Paired Raw Paired
SPNLG 14k 150k 21k / 21k
Wiki 84k 842k 73k 43k 73k

How we get the datasets?

  • SPNLG

    • The dataset is from sentence-planning-NLG dataset, a dataset describing the restaurant informations, containing 3 CSV files.
    • We aggregate all the 3 CSV files, and leave train:valid:test=8:1:1, paired:raw=1:10 for the train set.
  • Wiki

    • The dataset is constructed from both Wiki-Bio Dataset and Wikipedia Person and Animal Dataset.
    • We used same valid and test set as Wiki-Bio.
    • For training set, we only randomly use 84k samples in Wiki-Bio-train for paired data. We use the remain sentences in Wiki-Bio-train and person descriptions from Wikipedia Person and Animal as raw data (totally up to 842k).

Related links: