Search refactor #2139

mouse-reeve · 2022-05-30T19:10:54Z

This will make all remote searches take a maximum seconds defined by SEARCH_TIMEOUT (default, 8 seconds). It produces much better search results much more quickly.

Fixes #2051

Remaining work:

Filter Inventaire search results to remove results below the min_confidence threshold
Re-implement return_first to instead return the best option from all the results
Combine the connector formatter, processor, and parser into one function
Reconsider priority fields on connectors
Unit tests

This is the untest first pass at re-arranging remote search to work in parallel rather than sequence. It moves a couple functions around (raise_not_valid_url, for example, needs to be in connector_manager.py now to avoid circular imports). It adds a function to Connector objects that generates a search result (either to the isbn endpoint or the free text endpoint) based on the query, which was previously done as part of the search. I also lowered the timeout to 8 seconds by default.

Instead of having individual search functions that make individual requests, the connectors will always be searched asynchronously together. The process_seach_response combines the parse and format functions, which could probably be merged into one over-rideable function. The current to-do on this is to remove Inventaire search results that are below the confidence threshhold after search, which used to happen in the `search` function.

The database lookup doesn't work during the asyn process, so this change loops through the connectors and grabs the formatted urls before sending it to the async handler.

Gotta ask for json

This sends out the request tasks all at once and then aggregates the results, instead of just running them one after another asynchronously.

Adds logging and error handling for some of the numerous ways a request could fail (the remote site is down, the url is blocked, etc). I also have the results boxes open by default, which makes it more legible imo.

The parser was extracting the list of search results from the json object returned by the search endpoint, and the formatter was converting an individual json entry into a SearchResult object. This just merged them into one function, because they are never used separately.

Since we get all the results quickly now, this aggregates all the results that came back and sorts them by confidence, and returns the highest confidence result. The confidences aren't great on free text search, but conceptually that's how it should work at least. It may make sense to aggregate the search results in all contexts, but I'll propose that in a separate PR.

By default, OpenLibrary and Inventaire were prioritzed below other BookWyrm nodes. In practice, people have gotten better search results from these connectors, hence the change. With the search refactor, this has much less impact, but it will show these search results higher in the list. If the results page shows all the connectors' results integrated, this field should be removed entirely.

mouse-reeve added 15 commits May 30, 2022 10:15

Verify url before async search

9a9cef7

The database lookup doesn't work during the asyn process, so this change loops through the connectors and grabs the formatted urls before sending it to the async handler.

Set request headers in async search get request

5e81ec7

Gotta ask for json

Gather and wait on async requests

45f2199

This sends out the request tasks all at once and then aggregates the results, instead of just running them one after another asynchronously.

More error handing

525e2a5

Adds logging and error handling for some of the numerous ways a request could fail (the remote site is down, the url is blocked, etc). I also have the results boxes open by default, which makes it more legible imo.

Removes outdated unit tests

af19d72

Filter intentaire results by confidence

83ee5a7

Python formatting and test update

98ed03b

Pylint fixes in connector tests

05fd30c

Safely return None in remote search return_first

969db13

Updates test mocks for remote search

c3b3576

mouse-reeve marked this pull request as ready for review May 31, 2022 16:47

mouse-reeve merged commit 355e703 into main May 31, 2022

mouse-reeve deleted the search-refactor branch May 31, 2022 17:22

mouse-reeve mentioned this pull request May 31, 2022

Searching for between the world and me returns 502 or 504 #1895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search refactor #2139

Search refactor #2139

mouse-reeve commented May 30, 2022 •

edited

Loading

Search refactor #2139

Search refactor #2139

Conversation

mouse-reeve commented May 30, 2022 • edited Loading

mouse-reeve commented May 30, 2022 •

edited

Loading