-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate/together way of writing and synonymes aren't recognized #31
Comments
Which version of PolyFuzz are you using? Also, could you create a reproducible example? Since PolyFuzz can use many models, without any code it is difficult to see what is happening in your use case. |
I'm using IF-IDF, implemented under https://share.streamlit.io/charlywargnier/keyword-clustering-app/main/app.py / https://github.com/searchsolved/search-solved-public-seo/blob/main/Keyword_Clustering_Tool/Keyword_Clustering_Tool_V2.ipynb (codeblock 12) Keywords are here: https://docs.google.com/spreadsheets/d/1nkiFNO8JadbaFcL7BvYKCLNPYPB5ILJwk2K__2DOzdc/edit?usp=sharing Maybe PolyFuzz is not a right tool for this. To catch "trinkwasser test" and "trinkwassertest" into the same cluster, keywords must be permutated and then searched for a minimal Levenshteyn between permutations. But for "trinkwasser test" and "analyse trinkwasser" the should be probably any "real" synonyme search, maybe even based on a synonym vocabulary... |
Let me start by saying that I cannot give much support for that tool specifically as I did not create it. Having said that, I did try it out with PolyFuzz directly and it seems that You can implement or use any distance measure in PolyFuzz that you would like. However, if you are looking at semantic similarity and not such much string similarity, then I would advise going for embedding-based methods such as BERT models, sentence-transformers, Hugging Face, or Flair. |
Keywords "trinkwasser test", "trinkwassertest" and "analyse trinkwasser" aren't clustered at all.
The text was updated successfully, but these errors were encountered: