Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support duplicated tokens in Vocabulary #87

Open
ThomasKluiters opened this issue May 29, 2024 · 0 comments
Open

Support duplicated tokens in Vocabulary #87

ThomasKluiters opened this issue May 29, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@ThomasKluiters
Copy link

ThomasKluiters commented May 29, 2024

Feature Description

Currently, the dictionary cannot handle duplicate entries. It would be interesting if this would be supported. Possibly a flag that allows one to 'allow' multiples would be a feature.

Use Case

When using code-switched tokenizers (Like the 'AggregateTokenizer' in NeMo) you may have the same token appear twice. For example "Is" in the Dutch language and "Is" in the English language. Generally, we observe better Word Error Rates when using code-switched (aggregate) tokenizers as opposed to single tokenizers.

Additional Context

I would be happy to implement this feature, if, this is something the Flashlight team would be open to!

@ThomasKluiters ThomasKluiters added the enhancement New feature or request label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant