Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Feature: Use lambda function for out_of_vocabulary_token_option #23

Open
keshprad opened this issue Jul 3, 2021 · 3 comments

Comments

@keshprad
Copy link
Contributor

keshprad commented Jul 3, 2021

Let me know what you think of allowing users to specify their own lambda func if they aren't satisfied with the out of vocab options.

I can work on this in my fork and create a PR.

@daltonfury42
Copy link
Owner

daltonfury42 commented Jul 3, 2021

Yes, we can do that.

I would prefer extracting the logic to a member function out_of_vocabulary_handler and adding instructions to the readme on how users can override it with their own custom implementation.

What do you think?

@keshprad
Copy link
Contributor Author

keshprad commented Jul 3, 2021

Yes, that's good. I'll work up an implementation.

In addition to out_of_vocabulary, an out_of_dictionary option could also be useful in a later update. This could be an early-stage way to differentiate between names and words that are simply not in the vocabulary

For example:
"hip-hop" is not in vocabulary, but is certainly a word. I would want it in lowercase.

However, my name (Keshav) is not in the vocabulary and won't be found in a dictionary. I'd want to capitalize "Keshav."

This certainly won't work for all names, as some names are words in the dictionary. eg: "Trump"

@keshprad
Copy link
Contributor Author

keshprad commented Jul 3, 2021

Another thing to consider:
If the first word is classified as "out_of_vocabulary", then should we capitalize it, or just go along with the user's out_of_vocabulary_token_option.

Currently, it is the latter; however, I think we should capitalize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants