Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow token list as CTC decoder input #2112

Closed

Conversation

carolineechen
Copy link
Contributor

Additionally accept list of tokens as CTC decoder input. This makes it possible to directly pass in something like bundles.get_labels() into the decoder factory function instead of requiring a separate tokens file.

Comment on lines 19 to 20
if tokens == None:
tokens = get_asset_path("decoder/tokens.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this branch necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just to set the default behavior if the tokens parameter is not passed in to this function, as in the test_shape test, without having a function call being used in the function header

@@ -177,7 +177,7 @@ def kenlm_lexicon_decoder(

Args:
lexicon (str): lexicon file containing the possible words
tokens (str): file containing valid tokens
tokens (str or List[str]): file or list containing valid tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in the scope of the pr, but is there some reference that we can cite to help users understand how tokens files should be formatted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, this is somewhat shown in the tutorial but could definitely be improved. I'm thinking either in some README section (w/ docstrings linking to it), directly in the docstrings, or in more detail in a tutorial if you have any thoughts?

Copy link
Collaborator

@mthrok mthrok Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of loading from the file-path.
List[str] is simple to construct, users can format data such way after loading data from a file in a different format. Say, it's easy to construct from lexicon.
I am looking at the changes to tests, but keeping the pass-like type increase the test cost.
Specifically, I think test_construct_decoder is getting redundant yet not enough without test_index_to_tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a case from flashlight README that can not be done using List[str], where they mention If two tokens are on the same line in the tokens file, they are mapped to the same index for training/decoding. I'm not sure of specific use cases of this but may be good to have the option open?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh🤔 so in the case of ASR, this is to allow one phoneme to be mapped to different tokens.

Okay, then let's keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking either in some README section (w/ docstrings linking to it), directly in the docstrings, or in more detail in a tutorial if you have any thoughts?

yeah that all seems reasonable. i think if the format is simple enough to describe concisely, it'd be ideal to include it directly in the docstrings for ease of access. otherwise, linking to a readme sounds good

re: the file loading — i guess if the file format is standardized and something users are accustomed to, it'd be helpful to allow for directly loading files of that format. otherwise, it seems like pulling out the file loading logic would yield a more minimalist/flexible solution. the one-index-to-many-tokens case wouldn't necessarily require file loading — we could account for it with a list of lists or dictionary input

@facebook-github-bot
Copy link
Contributor

@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Member

@nateanl nateanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

xiaohui-zhang pushed a commit to xiaohui-zhang/audio that referenced this pull request May 4, 2022
Summary:
Additionally accept list of tokens as CTC decoder input. This makes it possible to directly pass in something like `bundles.get_labels()` into the decoder factory function instead of requiring a separate tokens file.

Pull Request resolved: pytorch#2112

Reviewed By: hwangjeff, nateanl, mthrok

Differential Revision: D33352909

Pulled By: carolineechen

fbshipit-source-id: 6d22072e34f6cd7c6f931ce4eaf294ae4cf0c5cc
mthrok pushed a commit to mthrok/audio that referenced this pull request Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants