Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find API: add experimental spell checking #1459

Merged
merged 4 commits into from
Aug 2, 2024

Conversation

andersju
Copy link
Member

@andersju andersju commented Aug 1, 2024

A first stab at adding spell checking, to be used by Libris sök. Uses Elasticsearch's phrase suggester with two generators as described in the ES docs.

Uses _sortKeyByLang.sv as the field to get suggestions from, as it seems like a reasonable choice; with e.g. _all we'd get lots of "bad" suggestions since it contains lots of oft-repeated vocab stuff. (sv vs en shouldn't matter for these purposes.)

Quick testing on the dev data is promising. We might want to tinker with both the suggester configuration/query and the field(s) to target. This is just somerthing to start with.

With /find?q=foobar&_spell=true the spell checking is done in addition to the regular query and returned along with the usual results.

With /find?q=foobar&_spell=only only the spell checking query is performed.

(Possibly we want to keep it separate from /find though? 🤔 ...or have it both in /find, for internal use, and a bibspell-like thing (returning "bibspell-compatible" results would of course be trivial))

~ curl -s "http://localhost:8180/find?q=dynimical%20systems&_spell=true"|jq '._spell'
[
  {
    "text": "dynamical systems",
    "highlighted": "<em>dynamical</em> systems",
    "score": 0.005399387
  }
]
~ curl -s "http://localhost:8180/find?q=dynimical%20systemms&_spell=true"|jq '._spell'
[
  {
    "text": "dynamical systems",
    "highlighted": "<em>dynamical systems</em>",
    "score": 0.005399387
  }
]

curl directly against ES, local dev:

curl --request POST \
  --url https://localhost:9200/libris_local/_search \
  -k \
  -u elastic:elastic \
  --header 'Content-Type: application/json' \
  --data '{
  "suggest": {
    "text" : "dynimical system",
    "simple_phrase" : {
      "phrase" : {
        "field" : "_sortKeyByLang.sv.trigram",
        "size" : 1,
	"max_errors": 2,
        "direct_generator" : [
		{
			"field" : "_sortKeyByLang.sv.trigram",
			"suggest_mode" : "always"
        	},
		{
			"field" : "_sortKeyByLang.sv.reverse",
			"suggest_mode" : "always",
			"pre_filter" : "reverse",
			"post_filter" : "reverse"
      		}
	],
	"highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}'

https://kbse.atlassian.net/browse/LWS-87

@andersju andersju requested a review from olovy August 1, 2024 10:24
Copy link
Contributor

@olovy olovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

  • Agree that _sortKeyByLang.sv is a reasonable starting point
  • Perhaps _spell=true should be the default. 🤔 Has performance implications though.
  • _spell=only might be a good option for clients to do spell checking "on the side" with a separate request without slowing down the main search query.
  • We probably want to promote _spell (or whatever we decide on) to an official vocab (non-underscore) term . Let's do this later.
  • TODO: Add support in "new style" search API after Feature/rework new search #1455 is merged

Changes requested:

  • I think we should provide a link to the corrected query within the suggestion. So that the client doesn't have to do any URL manipulation. See e.g. facet links.
  • I think we should remap the terms in the result to keep the API free from elasticsearch details.

What about:

curl -s "http://localhost:8180/find?q=flyttning%20och%20peldning&%40type=Instance&_spell=true" | jq '._spell'
[
  {
    "label": "flyttning och <em>pendling</em>",
    "view": { "@id": "/find?q=flyttning%20och%20pendling&%40type=Instance&_spell=true" }
  }
]

?

We might want to use something more specific than label to indicate that it contains some markup.

(I don't think the score is useful?)

@andersju
Copy link
Member Author

andersju commented Aug 2, 2024

@olovy Sounds good! See latest commit.

(Probably we wouldn't want to actually show suggestions unless there are no search hits (or perhaps very few hits?), otherwise there might often be annoying suggestions even if you spell correctly. Though this is something for the client to decide.)

Score might be useful internally if we want to have a threshold for whether a suggestion should be returned, but we'll see. I agree it's probably not useful in the API result.

Copy link
Contributor

@olovy olovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

See suggestion about using the same naming convention as descriptionHTML

@andersju andersju merged commit 37ed783 into develop Aug 2, 2024
1 check passed
@andersju andersju deleted the feature/lws-87-spell-check branch August 2, 2024 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants