Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Relevance #888

Open
jaredkhan opened this issue Oct 31, 2024 · 16 comments
Open

Search Relevance #888

jaredkhan opened this issue Oct 31, 2024 · 16 comments

Comments

@jaredkhan
Copy link
Collaborator

It’s sometimes hard to tell which search result is the one you’re looking for due to very different nodes containing the common name term that you are using. I often find myself searching wikipedia for a species name first and then coming back to OneZoom with that.

Examples of poor relevance:

Solutions may mean improving the ordering of search results (perhaps taking into account popularity), or making it easier to tell whether a result is the one you intended (perhaps including pictures or major group icons in the search list)

@hyanwong
Copy link
Member

Useful examples, thanks. Popularity is definitely something we could use here (and I thought we did, actually, but clearly not, or not well enough)

@lentinj
Copy link
Collaborator

lentinj commented Oct 31, 2024

I was fairly sure popularity figured in the search results as well, FWIW.

@hyanwong
Copy link
Member

Maybe worth digging into those particular examples to see where we are going wrong?

@davidebbo
Copy link
Collaborator

I'd swear that this used to give more reasonable ordering, but it really doesn't seem to take popularity into account not.

Comparing the arctic fox (Vulpes lagopus, ott=775766) with the fox moth (Macrothylacia rubi, ott=140039):

https://www.onezoom.org/popularity/list?key=0&otts=775766,140039

  "data": [
    [775766, 260797.77, 318],
    [140039, 148282.78, 618482]
  ]

So the arctic fox is far more popular, but you need to scroll like 4 pages down in the search results to find it.

@davidebbo
Copy link
Collaborator

I just noticed that we incorrectly have 'Fox' as the vernacular name for the 'Fox moth', so maybe being an exact match gives it an edge.

image

But it still doesn't explain why the arctic fox is far below many less popular taxa.

@jaredkhan
Copy link
Collaborator Author

Some rough notes on how search ordering works at the moment:

  • Popularity is not used for ordering search results in the tree of life explorer, though the search_nodes API does return popularity data
  • The API does have an option to order by popularity, but it's not in use by the frontend
  • The API uses MySQL full_text matching in boolean mode, so no relevance ordering is introduced there
  • The frontend receives a list of matching nodes and leaves and computes 'overall_search_score's for each based on the type of text match found.
    • Vernacular match is preferred over latin match
    • Vernacular match is very strongly preferred over 'Extra vernacular' matches
    • full string matches are preferred over partial matches
    • partial matches at the end of the string are preferred over matches at the start
    • etc.

@davidebbo
Copy link
Collaborator

Thanks for the analysis!

I think that taking popularity into account is going to be necessary here. Otherwise, we'll never get away from 'fox' being a moth, since it's an exact vernacular match. Whereas no actual species of fox is just called 'fox'. Of course, we should fix that moth's vernacular to be 'fox moth', but there are probably many cases of inaccurate vernaculars (e.g. same story for 'butterfly').

@jaredkhan
Copy link
Collaborator Author

jaredkhan commented Dec 7, 2024

Fox case

What's curious about the results?

  • The top result is a moth
    • Because the vernacular is an exact match, there is no competition as far as the existing code is concerned.
    • The popularity of this moth is 85,295
    • The popularity of 'Foxes' is 165,818
  • Flying foxes are also very high on the list here
    • The name 'fox' occurs, after a space and at the end of the species name, so they get a pretty good score from match_score
    • Flying foxes in general have a popularity of 130,550

What's good about the results?

  • 'Foxes' (http://localhost:8000/life/@_ozid=885806) is 2nd in this list and is a very relevant result
    • Pluralised exact matches (which this is) are not scored quite as highly as exact matches without modification. There is perhaps an argument that they should be scored just as highly.
image

And here is the list ordered by popularity instead:
image

@jaredkhan
Copy link
Collaborator Author

Chicken case

What's curious about the results?

  • A plant is the top result
    • The vernacular is correct
    • Pluralised version of the search term is at the end of the vernacular, so scores fairly highly
    • Popularity of this is 46,608
    • Gallu gallus gets 129,833
    • Gamebirds gets 150,625
  • Red Junglefowl is very far down the list
  • 'chicken' is only an extra vernacular for this, so ranks very low despite being an exact match
  • Gamebirds is also very far down the list, for a similar reason: 'chicken-like birds' is only an extra vernacular and 'chicken' only appears at the start of it and not surrounded by spaces

What's good about these results?

  • Hmm...
image

And here is the list ordered by popularity:
image

@jaredkhan
Copy link
Collaborator Author

jaredkhan commented Dec 7, 2024

Lavender Case

What's curious about these results?

  • Top result is a 'Sea Lavender'
    • Exact match with search term at the end of the vernacular scores pretty highly
    • This particular one has popularity 42,869
    • English lavender has 49,220
    • French lavender has 48,295
    • No sea lavender has greater than 45,000
  • There are several 'Sea Lavenders'
    • From a cursory Google, 'sea lavender' does seem to be a fair vernacular for many of them

What's good about these results?

  • There are many Lavandula's in the top results!
image

This is the list ordered naively by popularity

  • The Black-tailed Waxbill at the top of the list has a much higher popularity: 116,535
image

@hyanwong
Copy link
Member

hyanwong commented Dec 8, 2024

Good triaging, thanks. I agree with your assessment. Any suggested fixes would be good.

@jaredkhan
Copy link
Collaborator Author

Yeah it's tricky. Might not be tremendous quick wins here, and probably should have at least some small, manually-chosen test data to get some rough search relevance metrics (e.g. at least Mean Reciprocal Rank) when changing this.

One thought would be to keep the match_score approach, where full matches score better than partial, match at the end scores better than match at the start, etc. with a couple of modifications:

  • allow punctuation after/before the match (e.g. "Grey wolf (and domestic dog)" should still count as matching 'dog' at the end and score highly for that)
  • when checking that a word is delimited rather than being part of a larger word, should allow punctuation as delimiters. (e.g. "Chicken-like" should score as highly for 'chicken' as 'chicken like'

and then from there:

  • no longer downgrade extra vernaculars so much, if at all
  • no longer downgrade plural matches
  • sort first by the string match score, and then by popularity. That is: stronger string match takes precedence over higher popularity, but popularity is involved.

Taking a rough swing at that (without any objective metrics), here's what the results look like for some of the cases above:

image image image image image image

@jaredkhan
Copy link
Collaborator Author

Notes about that:

  • Top result for lavender is a crayfish
  • Top result for 'zebra' is horses in general, which isn't ideal, I guess horses are popular.
  • butterfly results are rubbish
  • 2nd result for 'cat' is a fish
    but on the bright side:
  • Red junglefowl is then the top result for chicken
  • Grey wolf is then the top result for dog

But overall not very impressive results. I think maybe we won't get much further without some form of rating of vernaculars.

@jaredkhan
Copy link
Collaborator Author

There are definitely some suspicious vernaculars floating around:

@davidebbo
Copy link
Collaborator

Maybe a side issue, but searching for 'Broad-leaved lavender' doesn't find anything at all. I think there may be a general issue with searching for names that contain a dash. But we should open a separate issue for that.

@wolfmanstout
Copy link
Contributor

I do think that popularity needs to be part of the formula. I have a very simple branch I created a while ago that integrates popularity:
https://github.com/wolfmanstout/OZtree/tree/popularity_ranking

In a nutshell, I adjust the search ranking using a popularity factor of 0.5 to 1.5 for each species based on its popularity rank, so that the average adjustment is 1 when averaged over all species. My thinking was to try to avoid bias towards species, but it doesn't really work because half the species do get a bump over the higher taxa, so species will still dominate the top K results for any small K. I don't think it would really fix it to apply an average popularity to taxa either, because any large taxa would be dominated by relatively unpopular species.

Maybe the right approach, as I think James Rosindell may have been alluding to this morning, is to sum the popularity of species for any higher taxa. I'm not sure offhand whether a simple sum would work, or whether we would need a more sophisticated formula that sums the inputs to popularity (I would have to take a closer look at how popularity is calculated, e.g. to avoid double-counting).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants