Skip to content

wammar/transliterator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scripts for training a transliterator using a list of transliteration pairs.

dependencies:

configurations:

an example configuration file is provided ruen-config.tape. The following variables are mandatory:

  • ducttape_output output directory
  • transliterator_home root of the transliterator's repository
  • all_oovs source-language words which needs to be transliterated (e.g. a test set)
  • char_lm kenlm-compiled language model of target language characters. An English character language model is provided
  • transliteration_pairs src-tgt transliterations, one per line, formatted as SOURCE LANGUAGE ||| CEURSE LAUNJE
  • m2m_maxX maximum source-language character sequence which corresponds to one character in target language
  • m2m_maxY maximum target-language character sequence which corresponds to one character in source language
  • nprocs number of processors to use for training
  • wammar_utils_dir root of this repository
  • m2m_aligner path to m2m aligner
  • cdec_dir path to cdec decoder
  • DelX: yes means that some characters in the source language may be deleted
  • DelY: yes means that some characters in the target language may be deleted

example usage:

ducttape translit.tape -C ruen-config.tape -p Full -y

todos:

  • use mpi_adagrad_optimize instead of mpi_flex_optimize
  • rewrite convert-alignments-to-cdec-format.py

##disclaimer:

scripts are still under development and may be unstable. please do contact me if anything does not work.

if you use this software, consider citing our ACL 2012 workshop paper: http://www.cs.cmu.edu/~wammar/pubs/translit-acl12.pdf