Separate/together way of writing and synonymes aren't recognized #31

e-orlov · 2021-11-15T10:33:58Z

Keywords "trinkwasser test", "trinkwassertest" and "analyse trinkwasser" aren't clustered at all.

MaartenGr · 2021-11-15T11:54:20Z

Which version of PolyFuzz are you using? Also, could you create a reproducible example? Since PolyFuzz can use many models, without any code it is difficult to see what is happening in your use case.

e-orlov · 2021-11-15T11:59:22Z

I'm using IF-IDF, implemented under https://share.streamlit.io/charlywargnier/keyword-clustering-app/main/app.py / https://github.com/searchsolved/search-solved-public-seo/blob/main/Keyword_Clustering_Tool/Keyword_Clustering_Tool_V2.ipynb (codeblock 12)

Keywords are here: https://docs.google.com/spreadsheets/d/1nkiFNO8JadbaFcL7BvYKCLNPYPB5ILJwk2K__2DOzdc/edit?usp=sharing

Maybe PolyFuzz is not a right tool for this. To catch "trinkwasser test" and "trinkwassertest" into the same cluster, keywords must be permutated and then searched for a minimal Levenshteyn between permutations. But for "trinkwasser test" and "analyse trinkwasser" the should be probably any "real" synonyme search, maybe even based on a synonym vocabulary...

MaartenGr · 2021-11-15T12:30:01Z

Let me start by saying that I cannot give much support for that tool specifically as I did not create it. Having said that, I did try it out with PolyFuzz directly and it seems that "trinkwasses test" gets grouped with "trinkwassertest" but not with "analyse trinkwasser". Most likely, using TF-IDF they are simply not similar enough to each other. You can try to circumvent this issue by using a different technique than TF-IDF as it tries to mirror Levenshtein distance.

You can implement or use any distance measure in PolyFuzz that you would like. However, if you are looking at semantic similarity and not such much string similarity, then I would advise going for embedding-based methods such as BERT models, sentence-transformers, Hugging Face, or Flair.

You can find more information about that here and here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate/together way of writing and synonymes aren't recognized #31

Separate/together way of writing and synonymes aren't recognized #31

e-orlov commented Nov 15, 2021

MaartenGr commented Nov 15, 2021

e-orlov commented Nov 15, 2021 •

edited

Loading

MaartenGr commented Nov 15, 2021

Separate/together way of writing and synonymes aren't recognized #31

Separate/together way of writing and synonymes aren't recognized #31

Comments

e-orlov commented Nov 15, 2021

MaartenGr commented Nov 15, 2021

e-orlov commented Nov 15, 2021 • edited Loading

MaartenGr commented Nov 15, 2021

e-orlov commented Nov 15, 2021 •

edited

Loading