-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search Relevance #888
Comments
Useful examples, thanks. Popularity is definitely something we could use here (and I thought we did, actually, but clearly not, or not well enough) |
I was fairly sure popularity figured in the search results as well, FWIW. |
Maybe worth digging into those particular examples to see where we are going wrong? |
I'd swear that this used to give more reasonable ordering, but it really doesn't seem to take popularity into account not. Comparing the arctic fox (Vulpes lagopus, ott=775766) with the fox moth (Macrothylacia rubi, ott=140039): https://www.onezoom.org/popularity/list?key=0&otts=775766,140039 "data": [
[775766, 260797.77, 318],
[140039, 148282.78, 618482]
] So the arctic fox is far more popular, but you need to scroll like 4 pages down in the search results to find it. |
Some rough notes on how search ordering works at the moment:
|
Thanks for the analysis! I think that taking popularity into account is going to be necessary here. Otherwise, we'll never get away from 'fox' being a moth, since it's an exact vernacular match. Whereas no actual species of fox is just called 'fox'. Of course, we should fix that moth's vernacular to be 'fox moth', but there are probably many cases of inaccurate vernaculars (e.g. same story for 'butterfly'). |
Fox caseWhat's curious about the results?
What's good about the results?
|
Good triaging, thanks. I agree with your assessment. Any suggested fixes would be good. |
Yeah it's tricky. Might not be tremendous quick wins here, and probably should have at least some small, manually-chosen test data to get some rough search relevance metrics (e.g. at least Mean Reciprocal Rank) when changing this. One thought would be to keep the match_score approach, where full matches score better than partial, match at the end scores better than match at the start, etc. with a couple of modifications:
and then from there:
Taking a rough swing at that (without any objective metrics), here's what the results look like for some of the cases above: |
Notes about that:
But overall not very impressive results. I think maybe we won't get much further without some form of rating of vernaculars. |
There are definitely some suspicious vernaculars floating around:
|
Maybe a side issue, but searching for 'Broad-leaved lavender' doesn't find anything at all. I think there may be a general issue with searching for names that contain a dash. But we should open a separate issue for that. |
I do think that popularity needs to be part of the formula. I have a very simple branch I created a while ago that integrates popularity: In a nutshell, I adjust the search ranking using a popularity factor of 0.5 to 1.5 for each species based on its popularity rank, so that the average adjustment is 1 when averaged over all species. My thinking was to try to avoid bias towards species, but it doesn't really work because half the species do get a bump over the higher taxa, so species will still dominate the top K results for any small K. I don't think it would really fix it to apply an average popularity to taxa either, because any large taxa would be dominated by relatively unpopular species. Maybe the right approach, as I think James Rosindell may have been alluding to this morning, is to sum the popularity of species for any higher taxa. I'm not sure offhand whether a simple sum would work, or whether we would need a more sophisticated formula that sums the inputs to popularity (I would have to take a closer look at how popularity is calculated, e.g. to avoid double-counting). |
It’s sometimes hard to tell which search result is the one you’re looking for due to very different nodes containing the common name term that you are using. I often find myself searching wikipedia for a species name first and then coming back to OneZoom with that.
Examples of poor relevance:
Solutions may mean improving the ordering of search results (perhaps taking into account popularity), or making it easier to tell whether a result is the one you intended (perhaps including pictures or major group icons in the search list)
The text was updated successfully, but these errors were encountered: