Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Lucene document threshold at query time #5149

Closed
ate47 opened this issue Oct 9, 2024 · 7 comments
Closed

Setup Lucene document threshold at query time #5149

ate47 opened this issue Oct 9, 2024 · 7 comments
Assignees
Labels
📶 enhancement issue is a new feature or improvement
Milestone

Comments

@ate47
Copy link
Contributor

ate47 commented Oct 9, 2024

Problem description

Inside the LuceneSail, an option exists to tell what is the amount of documents to retrieve. It is set with the property maxDocuments during the sail init and can't be set at runtime. We would need an option to allow this.

Preferred solution

An option would be to add a virtual predicate during query time like search:maxDocuments with a property during init to enable or disable it, maybe allowQueryMaxDocument.

My main concern is to implement it. The maxDocuments values is used inside each index implementation.

// SolrIndex.java
public QueryResponse search(SolrQuery query) throws SolrServerException, IOException {}
// LuceneIndex.java
public synchronized TopDocs search(Query query) throws IOException {}
// ElasticsearchIndex.java
public SearchHits search(SearchRequestBuilder request, QueryBuilder query) {}

So the different methods will need to have a max documents param. I would like an opinion on it before implementing it.

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

No response

Anything else?

ec-doris/kohesio-backend#268

@ate47 ate47 added the 📶 enhancement issue is a new feature or improvement label Oct 9, 2024
@hmottestad
Copy link
Contributor

I'll have to take a look at the code to understand how things are handled at the moment.

@ate47
Copy link
Contributor Author

ate47 commented Oct 24, 2024

Any news on what to do?

@hmottestad
Copy link
Contributor

	public QueryResponse search(SolrQuery query) throws SolrServerException, IOException {
		int nDocs;
		if (maxDocs > 0) {
			nDocs = maxDocs;
		} else {
			long docCount = client.query(query.setRows(0)).getResults().getNumFound();
			nDocs = Math.max((int) Math.min(docCount, Integer.MAX_VALUE), 1);
		}
		return client.query(query.setRows(nDocs));
	}

For this code we can probably allow the number of rows to be set on the query before this method is called. Something like if(query.getRows() != null).... to check if it is already set.

@hmottestad
Copy link
Contributor

Looks like this here is where we extract variables from the SPARQL query:

private static class PatternFilter extends AbstractQueryModelVisitor<RuntimeException> {

So that can be extended to add ROWS. And then follow the code up to see where else we need to extend it so that the field is set correctly on the SolrQuery object.

@hmottestad
Copy link
Contributor

hmottestad commented Oct 25, 2024

We can't do the same with public synchronized TopDocs search(Query query) throws IOException {} since the query object doesn't have any info about the number of rows or equivalent. So there we would probably have to make a new method public synchronized TopDocs search(Query query, int nDocs) throws IOException {}.

Same for public SearchHits search(SearchRequestBuilder request, QueryBuilder query) {}.

ate47 added a commit to ate47/rdf4j that referenced this issue Oct 28, 2024
ate47 added a commit to ate47/rdf4j that referenced this issue Oct 28, 2024
ate47 added a commit to ate47/rdf4j that referenced this issue Oct 29, 2024
@ate47
Copy link
Contributor Author

ate47 commented Nov 4, 2024

@hmottestad do you think you can get a look at how I have implemented it in the PR #5163?

@hmottestad
Copy link
Contributor

hmottestad commented Nov 5, 2024

Thanks for the reminder :)

ate47 added a commit to ate47/rdf4j that referenced this issue Nov 14, 2024
…KEY and MAX_DOCUMENTS_KEY to DEFAULT_NUM_DOCS_KEY
ate47 added a commit to ate47/rdf4j that referenced this issue Nov 14, 2024
…KEY and MAX_DOCUMENTS_KEY to DEFAULT_NUM_DOCS_KEY
hmottestad added a commit that referenced this issue Nov 20, 2024
@hmottestad hmottestad mentioned this issue Nov 20, 2024
5 tasks
hmottestad added a commit that referenced this issue Nov 20, 2024
@hmottestad hmottestad modified the milestones: 3.7.8, 5.1.0 Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📶 enhancement issue is a new feature or improvement
Projects
None yet
Development

No branches or pull requests

2 participants