-
Notifications
You must be signed in to change notification settings - Fork 875
Strange results that depends on sort and case #141
Comments
I'm using fuzzywuzzy 0.12.0 python-Levenshtein |
@acslater00 thoughts? |
@josegonzalez @Sovetnikov v0.12.0 converts everything in the list to lower case (along with some other filtering) but does not do the same to the query, hence the lower case version matches while the upper case version does not. With the Github version results 2 and 3 are the same.
|
Great, then just one issue: |
It looks to me like you're hitting an edge case in the design of WRatio (the default scorer). WRatio does a comparison between the length of two strings to decide how it will compare them. I guess the idea is that the longer a string is the more likely a random correlation will appear.
When this ratio is over 8 it applies a 0.6 scalar to the results on partial matches while by default there is a 0.9 scalar on partial results. The method takes partial ratios from both (it uses ratio, token sort and token set and takes the max) partial_ratio gives the best result so that is used. "Tat'janin Park" gives 66.6666... -> 66.666 * 0.9 = 60 Which at least explains why you're getting this result. I'm not sure what the best way of solving this is while still maintaining the intended behaviour (partial matches are less valuable when comparing long and short strings). |
Thanks, seems that it's just unexpected results to me, will use partial_ratio scorer ...
|
Having just been bit by this undocumented lower casing behavior...May I suggest a flag that allows one to ignore case in both query and candidates. For the life of me I couldn't figure out why the same string was scoring less than 100. |
Does someone mind adding a PR to the documentation for whats going on here? I'd like to shy away from adding extra flags or post-processing entries if at all possible. |
@christian-storm I believe I fixed your issue with PR #140 which is the last commit that has been merged. Can you check this? You may need to install directly from Github if pypi hasn't been updated. @josegonzalez The behaviour of WRatio for sovietnikov's original example is more complicated - as explained in my reply. Not sure if this is something that can be solved with documentation, the behaviour itself is unusual due to the choice of hard cutoffs. |
@josegonzalez The case sensitivity bug goes back to issue #77 The patch #140 adds processing to the query which @acslater00 had argued against in that issue but keeping the behaviour as it was beforehand has been the cause of a number of subsequent issues (#77, #139, #129, #141) as it's unexpected default behaviour (using the basic ratio as the scorer, identical values don't match without changing processor). You could add a boolean process_query to decide if you just want to also process the query to maintain BC but there's another point to be made here. If you set a processor but not a scorer it defaults to WRatio and ignores whatever you selected as processor. So if you were trying to strip timestamps (as in the example) and put in a processor to do so but no scorer - assuming it would use the default WRatio - your processor is ignored and replaced by the one within WRatio - which is also run on both query and choices. |
If you set a scorer but not a processor - or vice-versa - that doesn't work as expected, that sounds like an application bug. I also don't think sending up a warning when that occurs is correct, because it's possible that is what the user wanted. It's not clear to me that there is a "best" default here either. Documentation around the issues - both in the readme and the docblocks - should make usage a bit more clear. |
I resolve my case by making a custom scorer based on WRatio and cutting off condition that lowers partial score based on string length ratio.
|
I still feel the behaviour of WRatio is somewhat unpredictable when you're right on the other edge case for len_ratio.
Which causes some unexpected behaviour for example when comparing with a compound word.
This WRatio method probably solved the specific problem it was created for but as a general fuzzy searcher it has some unusual behaviour like this. If more of a word matches you shouldn't really be able to get a lower score (these results are the same with or without removing the lines you presented). I think fixing this issue would require some redesign of the scorer. |
To a new user, such as myself, I'd expect a scorer to give the same results For example, source_sent = u'Why am I crazy?' similarity_tups = process.extractBests(source_sent, target_sent, print(repr(source_sent)) produces u'Why am I crazy?' I went down a rabbit hole of being convinced there were unprintable To me a good API would expose what kind of preprocessing you'd like, , I could contend you took a step in the preprocessing direction by exposing On Sun, Oct 30, 2016 at 8:06 AM, David Ellis notifications@github.com
|
@christian-storm I've submitted some patches (PRs #140 and #142). #142 adds tests to make sure exact matches return 100 when going through process. Returning the same results for extract and ratio without specifying processor=None would require changing the default processor (replacing with a noop). I've made this easy with #142 (and doing so doesn't break any tests) but that seems a more significant behaviour change to how the process works, although I would recommend it. I think processing as functions allows for more flexibility over a preset list. It may be worth having a process_choices option that just runs on choices - although it would need to be clear if this ran before or after or instead of the main process option. |
I think the disconnect is over what we expect the processor to be. As it is now processor is being handled as something that takes a string and outputs a modified string. That is not really that useful (to me) and breaks the original spec. (and causes silent failures that return garbage results in case of lambda x: x[0], which is given as an example in the documentation) The processor argument is needed so that I can send it a list of objects, tell it what value of each object to use for scoring, and return objects unmodified with their associated data which wasn't used for scoring still attached. If i want to have that function perform some text manipulation as well I can do it without any added inconvenience. This means that if I do do any manipulation in my processor, I'll have to make sure to run it on my query too separately, but there should be no expectation for it to run on the query unless I tell it too. If there were some preset options for the basic cleanup functions that will get run that's great too, but it's a whole separate issue. |
I mean, I guess on second look there's no real practical difference between having to wrap the query in a dummy object so that it can be run through the processor vs. having to run the processor separately on the query. A change from how it worked before but I'll close my pull request on it.. made another one to update the relevant test and docstring though. :) |
This is sample code:
Code output is:
The text was updated successfully, but these errors were encountered: