-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to omit removing n-grams with space #76
Conversation
I'd much prefer an option to keep spaces than having to replace my spaces with a special character that is otherwise not used. If you do not want to add this change, please at least consider properly documenting this behaviour of removing n-grams with spaces in them, I just spent way too much time figuring out what was going on.
I understand the frustration but let's take a step back first. Could you show what issue it is exactly that you are facing? Why is it important for your use case to have this change? That way, it is easier for me to understand why this change might be necessary or if there are other ways to support it. Generally, these kinds of changes are first discussed in an open issue but it's alright for me to discuss it further here. |
I think the basic use-case is there:
but i can certainly provide some more details. I am matching some input from a scanned document against a list of materials, that can be rather cryptic. I had somewhat unexpected behaviours but I finally figured something was definitely off when "PE" matched with "PE 1" with 100% as "first match" when there was actually also "PE" in the list, which made me go to look at the code and find out. This could be addressed by adding 1-grams, but I don't think it's a good solution. The main issue is certainly that this behaviour is not documented. Once I know what is going on it's not a big thing to simply pre-process the input accordingly, i.e. replace spaces (or any whitespace, really) with some special character ("_" is a good candidate, but is already used so I might use "#" or whatever). But I don't think this is a good solution either, it feels much more like a workaround than a configuration. If word order matters, the n-grams with space simply should not be ignored... Side note: not having this option would make it complicated to work-around to "enable" spaces when Lastly: I think the |
Thank you for sharing the extensive description. I can imagine wanting to keep the spaces in certain use cases as they might relate to the words/characters that you want to compare. Seeing as this is a relatively low-impact change, I'll go ahead and merge this. For any other suggestions, it would be best to open up a new issue. |
I'd much prefer an option to keep spaces than having to replace my spaces with a special character that is otherwise not used.
If you do not want to add this change, please at least consider properly documenting this behaviour of removing n-grams with spaces in them, I just spent way too much time figuring out what was going on.
I also removed the
use_knn=False
in the example, since this option does not exist (anymore?)