Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Remove query processing and default processor #150

Conversation

nol13
Copy link
Contributor

@nol13 nol13 commented Dec 4, 2016

Added test case to check that using a processor of the form lambda x: x['key'] doesn't fail.

Set default_processor to None and do not run processor on query.

Remove test case that checked if string reduced to 0 by processor, as string will no longer be processed.

Adjusted test cases in test_fuzzywuzzy_hypothesis to not use a processor, but not fail if scorer reduces query to empty string. If the user supplies a processor that modifies the choice so that it is no longer an exact match for the query, not finding an exact match would be the expected behavior.

Saw some relevant discussion around issues #77 and #141 etc. If processor doesn't run on the query, which I feel it CAN NOT, then processor must also default to None to avoid unexpected behaviors.

(also have a separate branch that only adds the new test case fwiw)

@nol13
Copy link
Contributor Author

nol13 commented Dec 6, 2016

Looking again there's probably an optimization in there to only run full_process on the query once if the scorer is going to run that anyway, but would need to change all the scorers to accept a "query_already_processed" param or something, and still be independent of whatever processor is passed in, maybe not worthwhile?

@josegonzalez
Copy link
Contributor

I'm not sure how I feel about this changeset in general. You've removed/changed existing tests, so it's kinda hard to suss out whether your fix has unintended consequences.

@nol13
Copy link
Contributor Author

nol13 commented Dec 6, 2016

Can you add just the one new test at least and make it pass that however you see fit? New pull request with just that branch?

The existing tests were modified to not depend on processor as the processor passed in should have no effect on what those tests are looking for. The scoring functions can not depend on any behavior from the passed in processor, whatever it may do, and if they do then they're broken. Also why default processor has to be None, so it won't (correctly) fail the tests once that change is made.

Also have adjusted the tests to reflect the default processing behavior of each scorer.

@nol13
Copy link
Contributor Author

nol13 commented Dec 6, 2016

Also note that this issue causes the current tests to pass incorrectly. If you change the query in the testWithProcessor test from

query = "new york mets vs chicago cubs"

to say..

query = "p new york mets vs chicago cubs"

then it will no longer pass, while it should if it had been processed correctly.

@nol13
Copy link
Contributor Author

nol13 commented Dec 7, 2016

Ok I added back in test_fuzzywuzzy_pytest.py, and also some extra code in process.py so that the warning it tests for will get thrown appropriately. It scorer will apply processing, it will check that processing to see if it will result in an empty string.

@DavidCEllis
Copy link
Contributor

One of the reasons the input was being processed was because the default processor was causing so many issues but it seemed (from a previous comment) that they did not want to change it. I moved the defaults into variables at the top to make them more visible and easier to replace because I thought that it might be necessary. I don't really like the WRatio scorer as it gives some inconsistent results depending on string lengths but I didn't feel that I could go and change it or the default processor at the time.

It does make sense not to process the query for the use case you've given. For the use case I had clearing up both query AND choices was more useful (my use was changing characters with diacritical marks to those without). It also followed the existing design somewhat - before my changes if you didn't specify a scorer but did specify a processor your processor would be ignored (WRatio runs full process on query and choice).

Whatever the outcome I would suggest there should be an example added to the readme to make the behaviour clear.

@nol13
Copy link
Contributor Author

nol13 commented Dec 8, 2016

Right on, mostly I use it with token_set_ratio for my use case, basically searching through large lists of long and highly variable model numbers that token_set_ratio seems to be particularly good at compared to anything else I've found. Our processor will send something like model name + model number to the scorer, and then we'll reference the results by guid. Would be happy to write an example or two if it helps.

Always specify both processor and scorer which is probably (hopefully) why I never noticed the issue with processor getting ignored.

@nol13
Copy link
Contributor Author

nol13 commented Dec 8, 2016

Could we have both a choice_processor and text_processor param? And specify it as text_processor is a function in form of string > string and runs on both, and choice_processor is a function in the form of object > string and runs only on choice?

@DavidCEllis
Copy link
Contributor

If they want to do this I see two options you could go for. Either a boolean process_query which decides whether to run the processor on the query or a separate query_processor parameter. I think I prefer the boolean option. I would also avoid changing the name of the current parameter as that's likely to break more things.

@nol13
Copy link
Contributor Author

nol13 commented Dec 8, 2016

Agree on not changing the current parameter, and doing the boolean option if it was added. Would have the additional caveat of having to say if process_query = True, processor must accept string input, but at least the behavior would be clear.

Still think there should be no default processor unless all of the standalone scoring functions have that same default, but not as big of an issue. Would be very easy to use a processor like lambda x: x[0] though without realizing that it would then no longer do the default cleanup routines.

@nol13
Copy link
Contributor Author

nol13 commented Dec 30, 2016

Went with the bool in my js version, but instead of it meaning to run the processor, its used to run the full_process function exclusively, and the processor is totally separate. Could also do it how fuse.js does where you have a keys parameter instead of needing a function to access fields, though not something I personally feel like messing with at the moment.

Also added in some optimizations out of personal necessity, fwiw, where you can pass in tokens to the scorers to save from recalculating every time. Adds complexity but our current JavaScript version is substantially faster than our old version with that enabled.

@nol13 nol13 closed this Mar 19, 2017
@MelomanCool
Copy link

I agree that there should be some separate parameter for extracting/parsing text from choices.

In my case, I want to perform a search on a list of objects with the 'name' attribute. Currently, to do so, I pass lambda x: x.name as processor. Because the processor function also processes query, I can't simply pass a string as query anymore — I need to create an object with the 'name' attribute.

Looking at standard library, there is a key argument in functions like sorted or max, which does exactly what it should — extracts the key for comparison. In FuzzyWuzzy, on the other hand, the processor argument by default contains a function that clean-ups the string. I think, this is great — one can replace it with another clean-up function with ease. But then it shouldn't be used for key-extracting. There must be another parameter for that.

@josegonzalez
Copy link
Contributor

@MelomanCool pull requests welcome.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants