Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search refactor #2139

Merged
merged 15 commits into from
May 31, 2022
Merged

Search refactor #2139

merged 15 commits into from
May 31, 2022

Conversation

mouse-reeve
Copy link
Member

@mouse-reeve mouse-reeve commented May 30, 2022

This will make all remote searches take a maximum seconds defined by SEARCH_TIMEOUT (default, 8 seconds). It produces much better search results much more quickly.

Fixes #2051

Remaining work:

  • Filter Inventaire search results to remove results below the min_confidence threshold
  • Re-implement return_first to instead return the best option from all the results
  • Combine the connector formatter, processor, and parser into one function
  • Reconsider priority fields on connectors
  • Unit tests

This is the untest first pass at re-arranging remote search to work in
parallel rather than sequence. It moves a couple functions around
(raise_not_valid_url, for example, needs to be in connector_manager.py
now to avoid circular imports). It adds a function to Connector objects
that generates a search result (either to the isbn endpoint or the free
text endpoint) based on the query, which was previously done as part of
the search.

I also lowered the timeout to 8 seconds by default.
Instead of having individual search functions that make individual
requests, the connectors will always be searched asynchronously
together. The process_seach_response combines the parse and format
functions, which could probably be merged into one over-rideable
function.

The current to-do on this is to remove Inventaire search results that
are below the confidence threshhold after search, which used to happen
in the `search` function.
The database lookup doesn't work during the asyn process, so this change
loops through the connectors and grabs the formatted urls before sending
it to the async handler.
This sends out the request tasks all at once and then aggregates the
results, instead of just running them one after another asynchronously.
Adds logging and error handling for some of the numerous ways a request
could fail (the remote site is down, the url is blocked, etc).

I also have the results boxes open by default, which makes it more
legible imo.
The parser was extracting the list of search results from the json
object returned by the search endpoint, and the formatter was converting
an individual json entry into a SearchResult object. This just merged
them into one function, because they are never used separately.
Since we get all the results quickly now, this aggregates all the
results that came back and sorts them by confidence, and returns the
highest confidence result. The confidences aren't great on free text
search, but conceptually that's how it should work at least.

It may make sense to aggregate the search results in all contexts, but
I'll propose that in a separate PR.
By default, OpenLibrary and Inventaire were prioritzed below other
BookWyrm nodes. In practice, people have gotten better search results
from these connectors, hence the change. With the search refactor, this
has much less impact, but it will show these search results higher in
the list.

If the results page shows all the connectors' results integrated, this
field should be removed entirely.
@mouse-reeve mouse-reeve marked this pull request as ready for review May 31, 2022 16:47
@mouse-reeve mouse-reeve merged commit 355e703 into main May 31, 2022
@mouse-reeve mouse-reeve deleted the search-refactor branch May 31, 2022 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote search needs to be refactored
1 participant