Skip to content

joshua-decoder/fisher-callhome-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Fisher and CALLHOME Spanish--English Speech Translation Corpus

The Fisher and CALLHOME Spanish--English Speech Translation Corpus contains English reference translations and speech recognizer output (in various forms) that complement the LDC Fisher and CALLHOME Spanish audio and transcripts. Together, they make a four-way parallel dataset whose goal is to further research in Spanish--English speech translation.

For a complete description of this corpus, and for citation in your own published research, please cite the following paper. A copy can be found in the doc/ directory.

@inproceedings{post2013improved,
  Title = {Improved Speech-to-Text Translation with the {F}isher and {C}allhome {S}panish--{E}nglish Speech Translation Corpus},
  Author = {Post, Matt and Kumar, Gaurav and Lopez, Adam and Karakos, Damianos and Callison-Burch, Chris and Khudanpur, Sanjeev},
  Booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
  Year = {2013},
  Address = {Heidelberg, Germany},
  Month = {December}
}

The mapping/ direocty contains files corresponding to our data splits. Each line in these files contains a reference to the LDC transcript file and line numbers.

The corpus/ directory houses the various pieces of the corpus. Each subdirectory contains (a) a single Spanish side and (b) either one (for Fisher training and all CALLHOME data) or four (for Fisher test sets) English references. The Spanish side always has the extension ".es", and varies among (a) LDC transcript (b) Kaldi ASR output (c) Kaldi lattice output and (d) lattice oracle paths.

Due to licensing restrictions, we cannot include the LDC Spanish transcripts with this dataset. We have, however, provided scripts that will construct our data splits. To build these, first define the environment variables $LDC2010T04 and $LDC96T17 to point to your LDC2010T04 and LDC96T17 installations, respectively. Then run:

make

You can also run the two scripts directly by manually listing the directories, e.g.,

./bin/build_fisher.sh $LDC2010T04
./bin/build_callhome.sh $LDC96T17

Either way, you should end up with the following generated files, in addition to the files included with the data release:

corpus/ldc/fisher_train.es
corpus/ldc/fisher_dev.es
corpus/ldc/fisher_dev2.es
corpus/ldc/fisher_test.es
corpus/ldc/callhome_train.es
corpus/ldc/callhome_devtest.es
corpus/ldc/callhome_evltest.es

About

The Fisher and CALLHOME Spanish–English Speech Translation Corpus

Resources

License

Stars

Watchers

Forks

Packages

No packages published