You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If method = "soundex", the max_dist is automatically set to 0.5, since soundex returns either a 0 (match) or a 1 (no match).
And that's good. But the same should be set for other normalized metrics including 'jaccard' and 'cosine'.
Right now, the default value max_dist= 2 leads to all possible matches returned in case of 'jaccard' and 'cosine' as metrics.
library(ggplot2)
library(fuzzyjoin)
library(dplyr)
data(diamonds)
d <- tibble(approximate_name = c("Idea", "Premiums", "Premioom",
"VeryGood", "VeryGood", "Faiir"),
type = 1:6)
print(dim(diamonds)) # 53940x10
# no matches when they are inner-joined:
match1 <- diamonds %>%
inner_join(d, by = c(cut = "approximate_name"))
print(dim(match1)) # 0x11
# but we can match when they're fuzzy joined
match2 <- diamonds %>%
stringdist_inner_join(d, by = c(cut = "approximate_name"), method='jaccard')
print(dim(match2)) # 323640x12, i.e. all pairs 53940 * 6 = 323640
The default value of max_dist need to be set up to 0.5 in case of method = "jaccard" or method = "cosine".
The text was updated successfully, but these errors were encountered:
The docs state that
And that's good. But the same should be set for other normalized metrics including 'jaccard' and 'cosine'.
Right now, the default value
max_dist
= 2 leads to all possible matches returned in case of 'jaccard' and 'cosine' as metrics.The default value of
max_dist
need to be set up to 0.5 in case ofmethod
= "jaccard" ormethod
= "cosine".The text was updated successfully, but these errors were encountered: