-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow token list as CTC decoder input #2112
Conversation
if tokens == None: | ||
tokens = get_asset_path("decoder/tokens.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this branch necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just to set the default behavior if the tokens
parameter is not passed in to this function, as in the test_shape
test, without having a function call being used in the function header
@@ -177,7 +177,7 @@ def kenlm_lexicon_decoder( | |||
|
|||
Args: | |||
lexicon (str): lexicon file containing the possible words | |||
tokens (str): file containing valid tokens | |||
tokens (str or List[str]): file or list containing valid tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in the scope of the pr, but is there some reference that we can cite to help users understand how tokens files should be formatted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, this is somewhat shown in the tutorial but could definitely be improved. I'm thinking either in some README section (w/ docstrings linking to it), directly in the docstrings, or in more detail in a tutorial if you have any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get rid of loading from the file-path.
List[str]
is simple to construct, users can format data such way after loading data from a file in a different format. Say, it's easy to construct from lexicon.
I am looking at the changes to tests, but keeping the pass-like type increase the test cost.
Specifically, I think test_construct_decoder
is getting redundant yet not enough without test_index_to_tokens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a case from flashlight README that can not be done using List[str]
, where they mention If two tokens are on the same line in the tokens file, they are mapped to the same index for training/decoding
. I'm not sure of specific use cases of this but may be good to have the option open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh🤔 so in the case of ASR, this is to allow one phoneme to be mapped to different tokens.
Okay, then let's keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking either in some README section (w/ docstrings linking to it), directly in the docstrings, or in more detail in a tutorial if you have any thoughts?
yeah that all seems reasonable. i think if the format is simple enough to describe concisely, it'd be ideal to include it directly in the docstrings for ease of access. otherwise, linking to a readme sounds good
re: the file loading — i guess if the file format is standardized and something users are accustomed to, it'd be helpful to allow for directly loading files of that format. otherwise, it seems like pulling out the file loading logic would yield a more minimalist/flexible solution. the one-index-to-many-tokens case wouldn't necessarily require file loading — we could account for it with a list of lists or dictionary input
af27dc1
to
7af534f
Compare
@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary: Additionally accept list of tokens as CTC decoder input. This makes it possible to directly pass in something like `bundles.get_labels()` into the decoder factory function instead of requiring a separate tokens file. Pull Request resolved: pytorch#2112 Reviewed By: hwangjeff, nateanl, mthrok Differential Revision: D33352909 Pulled By: carolineechen fbshipit-source-id: 6d22072e34f6cd7c6f931ce4eaf294ae4cf0c5cc
Additionally accept list of tokens as CTC decoder input. This makes it possible to directly pass in something like
bundles.get_labels()
into the decoder factory function instead of requiring a separate tokens file.