Remove query processing and default processor #150

nol13 · 2016-12-04T20:38:06Z

Added test case to check that using a processor of the form lambda x: x['key'] doesn't fail.

Set default_processor to None and do not run processor on query.

Remove test case that checked if string reduced to 0 by processor, as string will no longer be processed.

Adjusted test cases in test_fuzzywuzzy_hypothesis to not use a processor, but not fail if scorer reduces query to empty string. If the user supplies a processor that modifies the choice so that it is no longer an exact match for the query, not finding an exact match would be the expected behavior.

Saw some relevant discussion around issues #77 and #141 etc. If processor doesn't run on the query, which I feel it CAN NOT, then processor must also default to None to avoid unexpected behaviors.

(also have a separate branch that only adds the new test case fwiw)

…ts to not give false failures if scorer processes query to empty string

nol13 · 2016-12-06T21:44:17Z

Looking again there's probably an optimization in there to only run full_process on the query once if the scorer is going to run that anyway, but would need to change all the scorers to accept a "query_already_processed" param or something, and still be independent of whatever processor is passed in, maybe not worthwhile?

josegonzalez · 2016-12-06T21:52:38Z

I'm not sure how I feel about this changeset in general. You've removed/changed existing tests, so it's kinda hard to suss out whether your fix has unintended consequences.

…or partial_ratio

nol13 · 2016-12-06T22:36:19Z

Can you add just the one new test at least and make it pass that however you see fit? New pull request with just that branch?

The existing tests were modified to not depend on processor as the processor passed in should have no effect on what those tests are looking for. The scoring functions can not depend on any behavior from the passed in processor, whatever it may do, and if they do then they're broken. Also why default processor has to be None, so it won't (correctly) fail the tests once that change is made.

Also have adjusted the tests to reflect the default processing behavior of each scorer.

nol13 · 2016-12-06T23:49:37Z

Also note that this issue causes the current tests to pass incorrectly. If you change the query in the testWithProcessor test from

query = "new york mets vs chicago cubs"

to say..

query = "p new york mets vs chicago cubs"

then it will no longer pass, while it should if it had been processed correctly.

…process.py

nol13 · 2016-12-07T00:26:28Z

Ok I added back in test_fuzzywuzzy_pytest.py, and also some extra code in process.py so that the warning it tests for will get thrown appropriately. It scorer will apply processing, it will check that processing to see if it will result in an empty string.

DavidCEllis · 2016-12-08T02:26:18Z

One of the reasons the input was being processed was because the default processor was causing so many issues but it seemed (from a previous comment) that they did not want to change it. I moved the defaults into variables at the top to make them more visible and easier to replace because I thought that it might be necessary. I don't really like the WRatio scorer as it gives some inconsistent results depending on string lengths but I didn't feel that I could go and change it or the default processor at the time.

It does make sense not to process the query for the use case you've given. For the use case I had clearing up both query AND choices was more useful (my use was changing characters with diacritical marks to those without). It also followed the existing design somewhat - before my changes if you didn't specify a scorer but did specify a processor your processor would be ignored (WRatio runs full process on query and choice).

Whatever the outcome I would suggest there should be an example added to the readme to make the behaviour clear.

nol13 · 2016-12-08T04:11:55Z

Right on, mostly I use it with token_set_ratio for my use case, basically searching through large lists of long and highly variable model numbers that token_set_ratio seems to be particularly good at compared to anything else I've found. Our processor will send something like model name + model number to the scorer, and then we'll reference the results by guid. Would be happy to write an example or two if it helps.

Always specify both processor and scorer which is probably (hopefully) why I never noticed the issue with processor getting ignored.

nol13 · 2016-12-08T04:58:16Z

Could we have both a choice_processor and text_processor param? And specify it as text_processor is a function in form of string > string and runs on both, and choice_processor is a function in the form of object > string and runs only on choice?

DavidCEllis · 2016-12-08T07:46:26Z

If they want to do this I see two options you could go for. Either a boolean process_query which decides whether to run the processor on the query or a separate query_processor parameter. I think I prefer the boolean option. I would also avoid changing the name of the current parameter as that's likely to break more things.

nol13 · 2016-12-08T19:21:17Z

Agree on not changing the current parameter, and doing the boolean option if it was added. Would have the additional caveat of having to say if process_query = True, processor must accept string input, but at least the behavior would be clear.

Still think there should be no default processor unless all of the standalone scoring functions have that same default, but not as big of an issue. Would be very easy to use a processor like lambda x: x[0] though without realizing that it would then no longer do the default cleanup routines.

nol13 · 2016-12-30T20:12:18Z

Went with the bool in my js version, but instead of it meaning to run the processor, its used to run the full_process function exclusively, and the processor is totally separate. Could also do it how fuse.js does where you have a keys parameter instead of needing a function to access fields, though not something I personally feel like messing with at the moment.

Also added in some optimizations out of personal necessity, fwiw, where you can pass in tokens to the scorers to save from recalculating every time. Adds complexity but our current JavaScript version is substantially faster than our old version with that enabled.

MelomanCool · 2017-09-22T17:37:24Z

I agree that there should be some separate parameter for extracting/parsing text from choices.

In my case, I want to perform a search on a list of objects with the 'name' attribute. Currently, to do so, I pass lambda x: x.name as processor. Because the processor function also processes query, I can't simply pass a string as query anymore — I need to create an object with the 'name' attribute.

Looking at standard library, there is a key argument in functions like sorted or max, which does exactly what it should — extracts the key for comparison. In FuzzyWuzzy, on the other hand, the processor argument by default contains a function that clean-ups the string. I think, this is great — one can replace it with another clean-up function with ease. But then it shouldn't be used for key-extracting. There must be another parameter for that.

josegonzalez · 2017-09-22T17:39:10Z

@MelomanCool pull requests welcome.

nolan and others added 5 commits December 4, 2016 14:07

add test to use processor that maps Dict to List

d12d2cc

Remove query processing and set default processor to None, adjust tes…

2fe1de3

…ts to not give false failures if scorer processes query to empty string

run empty_check_function before perfect match checks

ca6da29

update test param comments

e3101b8

remove some comented out code

b54e0b7

update test case to not unnecessarily use full_process on with ratio …

f990c56

…or partial_ratio

added back in test_fuzzywuzzy_pytest.py and modified warning code in …

904ebc8

…process.py

add back test_fuzzywuzzy_pytest.py to .travis.yml

714401a

nol13 closed this Mar 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove query processing and default processor #150

Remove query processing and default processor #150

nol13 commented Dec 4, 2016 •

edited

Loading

nol13 commented Dec 6, 2016 •

edited

Loading

josegonzalez commented Dec 6, 2016

nol13 commented Dec 6, 2016 •

edited

Loading

nol13 commented Dec 6, 2016

nol13 commented Dec 7, 2016

DavidCEllis commented Dec 8, 2016

nol13 commented Dec 8, 2016

nol13 commented Dec 8, 2016

DavidCEllis commented Dec 8, 2016

nol13 commented Dec 8, 2016

nol13 commented Dec 30, 2016

MelomanCool commented Sep 22, 2017

josegonzalez commented Sep 22, 2017

Remove query processing and default processor #150

Remove query processing and default processor #150

Conversation

nol13 commented Dec 4, 2016 • edited Loading

nol13 commented Dec 6, 2016 • edited Loading

josegonzalez commented Dec 6, 2016

nol13 commented Dec 6, 2016 • edited Loading

nol13 commented Dec 6, 2016

nol13 commented Dec 7, 2016

DavidCEllis commented Dec 8, 2016

nol13 commented Dec 8, 2016

nol13 commented Dec 8, 2016

DavidCEllis commented Dec 8, 2016

nol13 commented Dec 8, 2016

nol13 commented Dec 30, 2016

MelomanCool commented Sep 22, 2017

josegonzalez commented Sep 22, 2017

nol13 commented Dec 4, 2016 •

edited

Loading

nol13 commented Dec 6, 2016 •

edited

Loading

nol13 commented Dec 6, 2016 •

edited

Loading