-
Notifications
You must be signed in to change notification settings - Fork 875
Remove query processing and default processor #150
Remove query processing and default processor #150
Conversation
…ts to not give false failures if scorer processes query to empty string
Looking again there's probably an optimization in there to only run full_process on the query once if the scorer is going to run that anyway, but would need to change all the scorers to accept a "query_already_processed" param or something, and still be independent of whatever processor is passed in, maybe not worthwhile? |
I'm not sure how I feel about this changeset in general. You've removed/changed existing tests, so it's kinda hard to suss out whether your fix has unintended consequences. |
Can you add just the one new test at least and make it pass that however you see fit? New pull request with just that branch? The existing tests were modified to not depend on processor as the processor passed in should have no effect on what those tests are looking for. The scoring functions can not depend on any behavior from the passed in processor, whatever it may do, and if they do then they're broken. Also why default processor has to be None, so it won't (correctly) fail the tests once that change is made. Also have adjusted the tests to reflect the default processing behavior of each scorer. |
Also note that this issue causes the current tests to pass incorrectly. If you change the query in the testWithProcessor test from query = "new york mets vs chicago cubs" to say.. query = "p new york mets vs chicago cubs" then it will no longer pass, while it should if it had been processed correctly. |
Ok I added back in test_fuzzywuzzy_pytest.py, and also some extra code in process.py so that the warning it tests for will get thrown appropriately. It scorer will apply processing, it will check that processing to see if it will result in an empty string. |
One of the reasons the input was being processed was because the default processor was causing so many issues but it seemed (from a previous comment) that they did not want to change it. I moved the defaults into variables at the top to make them more visible and easier to replace because I thought that it might be necessary. I don't really like the WRatio scorer as it gives some inconsistent results depending on string lengths but I didn't feel that I could go and change it or the default processor at the time. It does make sense not to process the query for the use case you've given. For the use case I had clearing up both query AND choices was more useful (my use was changing characters with diacritical marks to those without). It also followed the existing design somewhat - before my changes if you didn't specify a scorer but did specify a processor your processor would be ignored (WRatio runs full process on query and choice). Whatever the outcome I would suggest there should be an example added to the readme to make the behaviour clear. |
Right on, mostly I use it with token_set_ratio for my use case, basically searching through large lists of long and highly variable model numbers that token_set_ratio seems to be particularly good at compared to anything else I've found. Our processor will send something like model name + model number to the scorer, and then we'll reference the results by guid. Would be happy to write an example or two if it helps. Always specify both processor and scorer which is probably (hopefully) why I never noticed the issue with processor getting ignored. |
Could we have both a choice_processor and text_processor param? And specify it as text_processor is a function in form of string > string and runs on both, and choice_processor is a function in the form of object > string and runs only on choice? |
If they want to do this I see two options you could go for. Either a boolean process_query which decides whether to run the processor on the query or a separate query_processor parameter. I think I prefer the boolean option. I would also avoid changing the name of the current parameter as that's likely to break more things. |
Agree on not changing the current parameter, and doing the boolean option if it was added. Would have the additional caveat of having to say if process_query = True, processor must accept string input, but at least the behavior would be clear. Still think there should be no default processor unless all of the standalone scoring functions have that same default, but not as big of an issue. Would be very easy to use a processor like lambda x: x[0] though without realizing that it would then no longer do the default cleanup routines. |
Went with the bool in my js version, but instead of it meaning to run the processor, its used to run the full_process function exclusively, and the processor is totally separate. Could also do it how fuse.js does where you have a keys parameter instead of needing a function to access fields, though not something I personally feel like messing with at the moment. Also added in some optimizations out of personal necessity, fwiw, where you can pass in tokens to the scorers to save from recalculating every time. Adds complexity but our current JavaScript version is substantially faster than our old version with that enabled. |
I agree that there should be some separate parameter for extracting/parsing text from choices. In my case, I want to perform a search on a list of objects with the 'name' attribute. Currently, to do so, I pass Looking at standard library, there is a |
@MelomanCool pull requests welcome. |
Added test case to check that using a processor of the form lambda x: x['key'] doesn't fail.
Set default_processor to None and do not run processor on query.
Remove test case that checked if string reduced to 0 by processor, as string will no longer be processed.
Adjusted test cases in test_fuzzywuzzy_hypothesis to not use a processor, but not fail if scorer reduces query to empty string. If the user supplies a processor that modifies the choice so that it is no longer an exact match for the query, not finding an exact match would be the expected behavior.
Saw some relevant discussion around issues #77 and #141 etc. If processor doesn't run on the query, which I feel it CAN NOT, then processor must also default to None to avoid unexpected behaviors.
(also have a separate branch that only adds the new test case fwiw)