Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Strange results that depends on sort and case #141

Open
Sovetnikov opened this issue Oct 14, 2016 · 17 comments
Open

Strange results that depends on sort and case #141

Sovetnikov opened this issue Oct 14, 2016 · 17 comments

Comments

@Sovetnikov
Copy link

This is sample code:

from fuzzywuzzy import process
from fuzzywuzzy.fuzz import partial_ratio

data = ["Dom na Prishvina", "Krylatskij",
        "Prigorod.Lesnoe", "Kotel'nicheskie vysotki",
        "Novaja Presnja", "Stolichnyj", "Bukinist",
        "Voznesenskij", "Oranzhvud",
        "Akadem-Palas", "Novoe Tushino",
        "Alekseevskaja roscha", "Marshal Grad",
        "Novomolokovo", "Ljubertsy 2017",
        "Malaja Ordynka 19",
        "Rezidentsija na Vsevolozhskom",
        "Kashintsevo", "Sojuznyj",
        "Michurino-Zapad",
        "Tat'janin Park",
        "Peredelkino Blizhnee",
        "Rozhdestvenskij",
        "Vostochnoe Butovo",
        "Nemchinovka-rezidents",
        "LIFE-Mitinskaja ECOPARK",
        "Nagornaja 7", "TehnoPark",
        "Tarasovskaja, 2",
        "Dom na VDNH", "Polet",
        "VLjublino", "Letchika Babushkina 17",
        "Dacha Shatena", "Versis",
        "AFI Residence Paveletskaya", "Tarasovskaja, 25", ]

print('1 %s' % process.extractBests('AFI', data))
print('2 %s' % process.extractBests('AFI', data, scorer=partial_ratio))
print('3 %s' % process.extractBests('afi', data, scorer=partial_ratio))
print('4 %s' % process.extractBests('afi', sorted(data)))

Code output is:

1 [("Tat'janin Park", 60), ('AFI Residence Paveletskaya', 60), ('Krylatskij', 31), ('Dom na Prishvina', 30), ('Prigorod.Lesnoe', 30)]
2 [('Dom na Prishvina', 0), ('Krylatskij', 0), ('Prigorod.Lesnoe', 0), ("Kotel'nicheskie vysotki", 0), ('Novaja Presnja', 0)]
3 [('AFI Residence Paveletskaya', 100), ("Tat'janin Park", 67), ('Dom na Prishvina', 33), ('Krylatskij', 33), ('Prigorod.Lesnoe', 33)]
4 [('AFI Residence Paveletskaya', 60), ("Tat'janin Park", 60), ('Krylatskij', 31), ('Akadem-Palas', 30), ('Alekseevskaja roscha', 30)]
  1. Strange to me that process.extractBests('AFI', data) makes no difference between ("Tat'janin Park", 60), ('AFI Residence Paveletskaya', 60)
  2. Strange that process.extractBests('AFI', data, scorer=partial_ratio) does not find "AFI Residence Paveletskaya".
  3. Very strange that process.extractBests('afi', data, scorer=partial_ratio)) finds "AFI Residence Paveletskaya"!
@Sovetnikov
Copy link
Author

I'm using fuzzywuzzy 0.12.0 python-Levenshtein

@josegonzalez
Copy link
Contributor

@acslater00 thoughts?

@DavidCEllis
Copy link
Contributor

@josegonzalez @Sovetnikov
The second point is related to issue #139 so running the latest github version gives different results.

v0.12.0 converts everything in the list to lower case (along with some other filtering) but does not do the same to the query, hence the lower case version matches while the upper case version does not.

With the Github version results 2 and 3 are the same.

1 [("Tat'janin Park", 60), ('AFI Residence Paveletskaya', 60), ('Krylatskij', 31), ('Dom na Prishvina', 30), ('Prigorod.Lesnoe', 30)]
2 [('AFI Residence Paveletskaya', 100), ("Tat'janin Park", 67), ('Dom na Prishvina', 33), ('Krylatskij', 33), ('Prigorod.Lesnoe', 33)]
3 [('AFI Residence Paveletskaya', 100), ("Tat'janin Park", 67), ('Dom na Prishvina', 33), ('Krylatskij', 33), ('Prigorod.Lesnoe', 33)]
4 [('AFI Residence Paveletskaya', 60), ("Tat'janin Park", 60), ('Krylatskij', 31), ('Akadem-Palas', 30), ('Alekseevskaja roscha', 30)]

@Sovetnikov
Copy link
Author

Great, then just one issue:
process.extractBests('AFI', data) makes no difference between ("Tat'janin Park", 60), ('AFI Residence Paveletskaya', 60)
This result does not meet any expectations, considering that this is default behavior.

@DavidCEllis
Copy link
Contributor

DavidCEllis commented Oct 16, 2016

@Sovetnikov

It looks to me like you're hitting an edge case in the design of WRatio (the default scorer).

WRatio does a comparison between the length of two strings to decide how it will compare them. I guess the idea is that the longer a string is the more likely a random correlation will appear.

In[6]: len("Tat'janin Park") / len('AFI')
Out[6]: 
4.666666666666667
In[7]: len('AFI Residence Paveletskaya') / len('AFI')
Out[7]: 
8.666666666666666

When this ratio is over 8 it applies a 0.6 scalar to the results on partial matches while by default there is a 0.9 scalar on partial results.

The method takes partial ratios from both (it uses ratio, token sort and token set and takes the max) partial_ratio gives the best result so that is used.

"Tat'janin Park" gives 66.6666... -> 66.666 * 0.9 = 60
'AFI Residence Paveletskaya' gives 100 -> 100 * 0.6 = 60

Which at least explains why you're getting this result.

I'm not sure what the best way of solving this is while still maintaining the intended behaviour (partial matches are less valuable when comparing long and short strings).

@Sovetnikov
Copy link
Author

Thanks, seems that it's just unexpected results to me, will use partial_ratio scorer ...
As suggestion maybe make this code of WRation parametric? This is ideal for my situation to just ignore length ratio.

    # if one string is much much shorter than the other
    if len_ratio > 8:
        partial_scale = .6

@christian-storm
Copy link

Having just been bit by this undocumented lower casing behavior...May I suggest a flag that allows one to ignore case in both query and candidates. For the life of me I couldn't figure out why the same string was scoring less than 100.

@josegonzalez
Copy link
Contributor

Does someone mind adding a PR to the documentation for whats going on here? I'd like to shy away from adding extra flags or post-processing entries if at all possible.

@DavidCEllis
Copy link
Contributor

@christian-storm I believe I fixed your issue with PR #140 which is the last commit that has been merged. Can you check this? You may need to install directly from Github if pypi hasn't been updated.

@josegonzalez The behaviour of WRatio for sovietnikov's original example is more complicated - as explained in my reply. Not sure if this is something that can be solved with documentation, the behaviour itself is unusual due to the choice of hard cutoffs.

@DavidCEllis
Copy link
Contributor

DavidCEllis commented Oct 29, 2016

@josegonzalez The case sensitivity bug goes back to issue #77

The patch #140 adds processing to the query which @acslater00 had argued against in that issue but keeping the behaviour as it was beforehand has been the cause of a number of subsequent issues (#77, #139, #129, #141) as it's unexpected default behaviour (using the basic ratio as the scorer, identical values don't match without changing processor).

You could add a boolean process_query to decide if you just want to also process the query to maintain BC but there's another point to be made here.

If you set a processor but not a scorer it defaults to WRatio and ignores whatever you selected as processor. So if you were trying to strip timestamps (as in the example) and put in a processor to do so but no scorer - assuming it would use the default WRatio - your processor is ignored and replaced by the one within WRatio - which is also run on both query and choices.

@josegonzalez
Copy link
Contributor

If you set a scorer but not a processor - or vice-versa - that doesn't work as expected, that sounds like an application bug. I also don't think sending up a warning when that occurs is correct, because it's possible that is what the user wanted. It's not clear to me that there is a "best" default here either.

Documentation around the issues - both in the readme and the docblocks - should make usage a bit more clear.

@Sovetnikov
Copy link
Author

I resolve my case by making a custom scorer based on WRatio and cutting off condition that lowers partial score based on string length ratio.
Now my fuzzy results are great and predictable.
I cut off this code:

    # if one string is much much shorter than the other
    if len_ratio > 8:
        partial_scale = .6

@DavidCEllis
Copy link
Contributor

I still feel the behaviour of WRatio is somewhat unpredictable when you're right on the other edge case for len_ratio.

# if strings are similar length, don't use partials
if len_ratio < 1.5:
    try_partial = False

Which causes some unexpected behaviour for example when comparing with a compound word.

In[3]: fuzz.WRatio('upstream', 'up')
Out[3]: 
90
In[4]: fuzz.WRatio('upstream', 'stream')
Out[4]: 
86

This WRatio method probably solved the specific problem it was created for but as a general fuzzy searcher it has some unusual behaviour like this. If more of a word matches you shouldn't really be able to get a lower score (these results are the same with or without removing the lines you presented). I think fixing this issue would require some redesign of the scorer.

@christian-storm
Copy link

To a new user, such as myself, I'd expect a scorer to give the same results
whether called directly or through a process call. Furthermore,
when it is called when a==b I'd always expect it to return 100% match.

For example,

source_sent = u'Why am I crazy?'
target_sent = [source_sent]

similarity_tups = process.extractBests(source_sent, target_sent,
scorer=fuzz.ratio, limit=1)

print(repr(source_sent))
print(repr(target_sent))
print(similarity_tups)
print("fuzz.ratio={}".format(fuzz.ratio(source_sent, target_sent[0])))

produces

u'Why am I crazy?'
[u'Why am I crazy?']
[(u'Why am I crazy?', 83)]
fuzz.ratio=100

I went down a rabbit hole of being convinced there were unprintable
characters (hence the repr() call) and other theories until I stumbled upon
this issue.

To me a good API would expose what kind of preprocessing you'd like, ,
e.g., preprocessing_pipeline=['alphanum_only', 'lowercase',
'strip_timestamps'] where the processing is done in the given order, etc.
(there could be a growing catalog of options depending on the needs of the
user, and whether to do apply to both query and candidates or one or the
other, e.g.,

I could contend you took a step in the preprocessing direction by exposing
force_ascii and having a
StringProcessor.replace_non_letters_non_numbers_with_whitespace which only
has one method at the moment.

On Sun, Oct 30, 2016 at 8:06 AM, David Ellis notifications@github.com
wrote:

I still feel the behaviour of WRatio is somewhat unpredictable when
you're right on the other edge case for len_ratio.

if strings are similar length, don't use partials

if len_ratio < 1.5:
try_partial = False

Which causes some unexpected behaviour for example when comparing with a
compound word.

In[3]: fuzz.WRatio('upstream', 'up')
Out[3]:
90
In[4]: fuzz.WRatio('upstream', 'stream')
Out[4]:
86

This WRatio method probably solved the specific problem it was created
for but as a general fuzzy searcher it has some unusual behaviour like
this. If more of a word matches you shouldn't really be able to get a lower
score (these results are the same with or without removing the lines you
presented). I think fixing this issue would require some redesign of the
scorer.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@DavidCEllis
Copy link
Contributor

@christian-storm
I stumbled on the same thing when comparing a list of report titles. I'd also extracted exact matches and noticed a number of them hadn't shown up in the fuzzy search list. Ended up finding the processing issue.

I've submitted some patches (PRs #140 and #142). #142 adds tests to make sure exact matches return 100 when going through process. Returning the same results for extract and ratio without specifying processor=None would require changing the default processor (replacing with a noop). I've made this easy with #142 (and doing so doesn't break any tests) but that seems a more significant behaviour change to how the process works, although I would recommend it.

I think processing as functions allows for more flexibility over a preset list. It may be worth having a process_choices option that just runs on choices - although it would need to be clear if this ran before or after or instead of the main process option.

@nol13
Copy link
Contributor

nol13 commented Dec 8, 2016

I think the disconnect is over what we expect the processor to be. As it is now processor is being handled as something that takes a string and outputs a modified string. That is not really that useful (to me) and breaks the original spec. (and causes silent failures that return garbage results in case of lambda x: x[0], which is given as an example in the documentation)

The processor argument is needed so that I can send it a list of objects, tell it what value of each object to use for scoring, and return objects unmodified with their associated data which wasn't used for scoring still attached. If i want to have that function perform some text manipulation as well I can do it without any added inconvenience.

This means that if I do do any manipulation in my processor, I'll have to make sure to run it on my query too separately, but there should be no expectation for it to run on the query unless I tell it too.

If there were some preset options for the basic cleanup functions that will get run that's great too, but it's a whole separate issue.

@nol13
Copy link
Contributor

nol13 commented Mar 19, 2017

I mean, I guess on second look there's no real practical difference between having to wrap the query in a dummy object so that it can be run through the processor vs. having to run the processor separately on the query. A change from how it worked before but I'll close my pull request on it.. made another one to update the relevant test and docstring though. :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants