Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make force alignment accessible from pocketsphinx_batch and the ps_decoder API #144

Merged
merged 1 commit into from
Sep 9, 2018
Merged

Make force alignment accessible from pocketsphinx_batch and the ps_decoder API #144

merged 1 commit into from
Sep 9, 2018

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Sep 9, 2018

This provides a simple (maybe too simple) API for doing force alignment as well as a command-line interface for it via pocketsphinx_batch. This works like any other kind of search, you do:

ps_set_align(decoder, name, text); /* where name is whatever you like, and text is the transcription */
ps_set_search(name);
/* now decode as usual, and get the alignment using ps_seg_iter() */

In pocketsphinx_batch there is -alignctl, -aligndir and -alignext, these point you to a control file with transcription files (one file per utterance), the directory and file extensions.

The transcription is expected to be whitespace-separated tokens. It will add the <s> and </s> tokens for you, which may or may not be the right thing to do (perhaps we should just add them if they aren't present).

This will only do word alignments even though it is capable of doing more than that, because that's all the ps_seg_iter interface allows. We should probably fix that. In the near term I will add output to TextGrid files to the batch interface so we can get the phone segmentation that way, and also be drop-in compatible with the Montreal Force Aligner.

…imal command line interface in pocketsphinx_batch which allows you to do force alignment of transcripts. Test it out on test/data/librivox, also there is a unit test.
@nshmyrev nshmyrev merged commit f54f6c3 into cmusphinx:master Sep 9, 2018
@nshmyrev
Copy link
Contributor

nshmyrev commented Sep 9, 2018

Welcome back, David!

@dhdaines
Copy link
Contributor Author

dhdaines commented Sep 10, 2018 via email

@dhdaines
Copy link
Contributor Author

Hi! I realized that this branch doesn't exactly do what the user would expect for force alignment. The issue is that state_align search wasn't actually designed to do force alignment - it really just aligns a state sequence to a feature sequence. The reason why I wrote it in the first place was for two purposes:

  • To obtain exact phone alignments from ASR output
  • To collect state occupation counts for speaker adaptation

So, there are some things that force alignment should do which it doesn't do, specifically:

  • It won't insert optional silences between words or at the start or end of the utterance
  • It won't choose between alternate pronunciations

On the other hand, I have gotten very good results by using FSG search for force alignment at the word level - this is because it is already equipped to do the stuff mentioned above.

The other issue is that VAD and noise removal must be turned off for force alignment, because otherwise the output timestamps won't necessarily correspond to the input.

Since the state alignment search isn't useful on its own I would like to switch the meaning of the "alignment" interface in pocketsphinx_batch to do traditional force alignment with FSG search. In addition I would probably add something to either force noise removal and VAD off if -adcin is enabled, because this behaviour is very unexpected to the user in this case (even if it is super useful for ASR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants