Allow token list as CTC decoder input #2112

carolineechen · 2021-12-29T18:04:33Z

Additionally accept list of tokens as CTC decoder input. This makes it possible to directly pass in something like bundles.get_labels() into the decoder factory function instead of requiring a separate tokens file.

hwangjeff · 2021-12-29T19:05:21Z

test/torchaudio_unittest/prototype/ctc_decoder_test.py

+        if tokens == None:
+            tokens = get_asset_path("decoder/tokens.txt")


is this branch necessary?

this is just to set the default behavior if the tokens parameter is not passed in to this function, as in the test_shape test, without having a function call being used in the function header

hwangjeff · 2021-12-29T19:08:20Z

torchaudio/prototype/ctc_decoder/ctc_decoder.py

@@ -177,7 +177,7 @@ def kenlm_lexicon_decoder(

    Args:
        lexicon (str): lexicon file containing the possible words
-        tokens (str): file containing valid tokens
+        tokens (str or List[str]): file or list containing valid tokens


not in the scope of the pr, but is there some reference that we can cite to help users understand how tokens files should be formatted?

good point, this is somewhat shown in the tutorial but could definitely be improved. I'm thinking either in some README section (w/ docstrings linking to it), directly in the docstrings, or in more detail in a tutorial if you have any thoughts?

I think we can get rid of loading from the file-path.
List[str] is simple to construct, users can format data such way after loading data from a file in a different format. Say, it's easy to construct from lexicon.
I am looking at the changes to tests, but keeping the pass-like type increase the test cost.
Specifically, I think test_construct_decoder is getting redundant yet not enough without test_index_to_tokens.

there is a case from flashlight README that can not be done using List[str], where they mention If two tokens are on the same line in the tokens file, they are mapped to the same index for training/decoding. I'm not sure of specific use cases of this but may be good to have the option open?

Huh🤔 so in the case of ASR, this is to allow one phoneme to be mapped to different tokens.

Okay, then let's keep it.

I'm thinking either in some README section (w/ docstrings linking to it), directly in the docstrings, or in more detail in a tutorial if you have any thoughts?

yeah that all seems reasonable. i think if the format is simple enough to describe concisely, it'd be ideal to include it directly in the docstrings for ease of access. otherwise, linking to a readme sounds good

re: the file loading — i guess if the file format is standardized and something users are accustomed to, it'd be helpful to allow for directly loading files of that format. otherwise, it seems like pulling out the file loading logic would yield a more minimalist/flexible solution. the one-index-to-many-tokens case wouldn't necessarily require file loading — we could account for it with a list of lists or dictionary input

facebook-github-bot · 2021-12-29T19:24:47Z

@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

nateanl

LGTM

Summary: Additionally accept list of tokens as CTC decoder input. This makes it possible to directly pass in something like `bundles.get_labels()` into the decoder factory function instead of requiring a separate tokens file. Pull Request resolved: pytorch#2112 Reviewed By: hwangjeff, nateanl, mthrok Differential Revision: D33352909 Pulled By: carolineechen fbshipit-source-id: 6d22072e34f6cd7c6f931ce4eaf294ae4cf0c5cc

carolineechen requested review from mthrok, hwangjeff and nateanl December 29, 2021 18:04

pytorch-probot bot added the ciflow/default label Dec 29, 2021

facebook-github-bot added the CLA Signed label Dec 29, 2021

hwangjeff approved these changes Dec 29, 2021

View reviewed changes

allow token list as input

7af534f

carolineechen force-pushed the ctc-decoder-tokens-list branch from af27dc1 to 7af534f Compare December 29, 2021 19:23

nateanl approved these changes Dec 29, 2021

View reviewed changes

mthrok approved these changes Dec 29, 2021

View reviewed changes

facebook-github-bot closed this in 896ade0 Dec 29, 2021

carolineechen added module: ops prototype labels Jan 24, 2022

mthrok pushed a commit to mthrok/audio that referenced this pull request Dec 13, 2022

Fix typo leanr -> learn (pytorch#2112)

4c27a06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow token list as CTC decoder input #2112

Allow token list as CTC decoder input #2112

carolineechen commented Dec 29, 2021

hwangjeff Dec 29, 2021

carolineechen Dec 29, 2021

hwangjeff Dec 29, 2021

carolineechen Dec 29, 2021

mthrok Dec 29, 2021 •

edited

Loading

carolineechen Dec 29, 2021

mthrok Dec 29, 2021

hwangjeff Dec 29, 2021

facebook-github-bot commented Dec 29, 2021

nateanl left a comment

		if tokens == None:
		tokens = get_asset_path("decoder/tokens.txt")

Allow token list as CTC decoder input #2112

Allow token list as CTC decoder input #2112

Conversation

carolineechen commented Dec 29, 2021

hwangjeff Dec 29, 2021

Choose a reason for hiding this comment

carolineechen Dec 29, 2021

Choose a reason for hiding this comment

hwangjeff Dec 29, 2021

Choose a reason for hiding this comment

carolineechen Dec 29, 2021

Choose a reason for hiding this comment

mthrok Dec 29, 2021 • edited Loading

Choose a reason for hiding this comment

carolineechen Dec 29, 2021

Choose a reason for hiding this comment

mthrok Dec 29, 2021

Choose a reason for hiding this comment

hwangjeff Dec 29, 2021

Choose a reason for hiding this comment

facebook-github-bot commented Dec 29, 2021

nateanl left a comment

Choose a reason for hiding this comment

mthrok Dec 29, 2021 •

edited

Loading