-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search refactor #2139
Merged
Merged
Search refactor #2139
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is the untest first pass at re-arranging remote search to work in parallel rather than sequence. It moves a couple functions around (raise_not_valid_url, for example, needs to be in connector_manager.py now to avoid circular imports). It adds a function to Connector objects that generates a search result (either to the isbn endpoint or the free text endpoint) based on the query, which was previously done as part of the search. I also lowered the timeout to 8 seconds by default.
Instead of having individual search functions that make individual requests, the connectors will always be searched asynchronously together. The process_seach_response combines the parse and format functions, which could probably be merged into one over-rideable function. The current to-do on this is to remove Inventaire search results that are below the confidence threshhold after search, which used to happen in the `search` function.
The database lookup doesn't work during the asyn process, so this change loops through the connectors and grabs the formatted urls before sending it to the async handler.
Gotta ask for json
This sends out the request tasks all at once and then aggregates the results, instead of just running them one after another asynchronously.
Adds logging and error handling for some of the numerous ways a request could fail (the remote site is down, the url is blocked, etc). I also have the results boxes open by default, which makes it more legible imo.
The parser was extracting the list of search results from the json object returned by the search endpoint, and the formatter was converting an individual json entry into a SearchResult object. This just merged them into one function, because they are never used separately.
Since we get all the results quickly now, this aggregates all the results that came back and sorts them by confidence, and returns the highest confidence result. The confidences aren't great on free text search, but conceptually that's how it should work at least. It may make sense to aggregate the search results in all contexts, but I'll propose that in a separate PR.
By default, OpenLibrary and Inventaire were prioritzed below other BookWyrm nodes. In practice, people have gotten better search results from these connectors, hence the change. With the search refactor, this has much less impact, but it will show these search results higher in the list. If the results page shows all the connectors' results integrated, this field should be removed entirely.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This will make all remote searches take a maximum seconds defined by
SEARCH_TIMEOUT
(default, 8 seconds). It produces much better search results much more quickly.Fixes #2051
Remaining work:
min_confidence
thresholdreturn_first
to instead return the best option from all the resultspriority
fields on connectors