Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BytesRefIterator to TermInSetQuery #13806

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

cbuescher
Copy link
Contributor

@cbuescher cbuescher commented Sep 18, 2024

Addresses #13804

TermInSetQuery used to have an accessor to its terms that was removed in #12173 to protect leaking internal encoding details. This introduces an accessor to the term data in the query that doesn't expose internal but merely allows iterating over the decoded BytesRef, making inspection of the querys content possible again.

Addresses apache#13804

TermInSetQuery used to have an accessor to its terms that was removed in apache#12173
to protect leaking internal encoding details. This introduces an accessor to the
term data in the query that doesn't expose internal but merely allows iterating
over the decoded BytesRef, making inspection of the querys content possible again.
@cbuescher cbuescher force-pushed the termInSet-terms-iterator branch from 0fe5c78 to 9d60804 Compare September 18, 2024 21:24
@rmuir
Copy link
Member

rmuir commented Sep 18, 2024

good solution. could we consider also fixing the visitor to use this approach (vs passing a RunAutomaton or something awful?)

@rmuir
Copy link
Member

rmuir commented Sep 18, 2024

and the question is not for this PR, just a general one. It seems the only "real user" of consumeTermsMatching is highlighter, and building an automaton for this thing seems to be... both very heavy and awkward? Just for a future issue/followup...

@@ -141,6 +135,11 @@ public long getTermsCount() {
return termData.size();
}

public BytesRefIterator getBytesRefIterator() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add javadocs for the method and @experimental for now?

its possible we could "fix visitor" api in the future where we wouldn't need this public method specific to this query as well (see general thoughts on PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks.

@rmuir
Copy link
Member

rmuir commented Sep 18, 2024

also, it would be good to get an idea of the use-case. The problem is, this query can hold many terms:

  • getting an automaton over them isn't really a use-case, highlighting docs is. We should make this more efficient. This existing visitor api seems way too prescriptive: probably shouldn't be Supplier<ByteRunAutomaton> automaton but should probably just be Predicate<BytesRef> which supports composition with and/or/etc and also supports matching terms with test(BytesRef). This is more general API (even if it doesn't help us with PrefixCodedTerms immediately).
  • for similar reasons, getting an iterator over ALL terms isn't a use-case... its a slow linear-time implementation of whatever you are doing. So it would be great to get an idea of what that is, so that we can get to a good interface eventually.

@javanna
Copy link
Contributor

javanna commented Sep 19, 2024

Thanks for taking a look @rmuir !

I have been digging a bit through history, it seems like it used to be possible to get all the terms via QueryVisitor#consumeTerms, but that changed with 267d70b to build an automaton instead. That effectively removed the ability to get the actual terms from TermInSetQuery via query visitor. Later, getTermData was also removed from TermInSetQuery, as it exposed internal encoding of the terms. I agree that we should look at how to make the API better in the long run.

The main usecase we have is monitor alike, where you have queries stored in the index, and want to run them against incoming documents. In order to pre-filter the queries and reduce the amount of them that we need to run, we extract terms from them upon indexing and put them in a separate field that we later use to apply pre-filtering. I checked in the monitor code and it looks like its query visitor (from QueryAnalyzer) ignores this case, hence TermInSetQuery with more than one term will not allow to perform pre-filtering, which means that more queries will be potential candidates that could be excluded ahead of time if we were able to extract terms from them. I am not sure that the predicate idea would work for this scenario, I don't think we can move away from building a list of terms for this usecase?

With that, I am going to go ahead and merge this PR for now to main, that unblocks us for now, and we can continue the discussion about the long term approach.

@javanna
Copy link
Contributor

javanna commented Sep 19, 2024

@cbuescher could you add an entry to CHANGES.txt, under 9.12 please? I am thinking that this should be backported so it provides a replacement for the deprecated method before it gets removed. I can take care of adding a link to it in the javadocs of the deprecated term in 9.x when I do the backporting.

@cbuescher
Copy link
Contributor Author

@javanna done, thanks.

Copy link
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@javanna javanna merged commit e4ac577 into apache:main Sep 19, 2024
3 checks passed
javanna pushed a commit that referenced this pull request Sep 19, 2024
TermInSetQuery used to have an accessor to its terms that was removed in #12173
to protect leaking internal encoding details. This introduces an accessor to the
term data in the query that doesn't expose internals but merely allows iterating
over the decoded BytesRef, making inspection of the querys content possible again.

Closes #13804
@javanna javanna added this to the 9.12.0 milestone Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants