-
-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(elasticsearch): Remove size parameter from auxiliary queries #4690
Conversation
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: cl/lib/elasticsearch_utils.py
Did you find this useful? React with a 👍 or 👎 |
Thanks. This looks fine to me. From a process perspective, I set Alberto as the reviewer, but left it assigned to you. I'm thinking it's his job to review, but you should stay assigned as the one paying attention to it and being on top of it. Feel right? |
Another tweak, sorry. I'm putting both of you as assignees. This way it shows up under the reviewer's name in the By Assignee tab in the backlog. Figuring things out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Eduardo! This seems about right to me. I was just thinking of a corner case scenario that seems unlikely to happen, but I think it's still possible.
For instance, let's say that a user has an alert like case_name:United States
, and one day we receive more than 10,000 cases that include "United States" in their case_name
(imagine 12,000).
We'll only include at most SCHEDULED_ALERT_HITS_LIMIT
alert cases in the alert email.
However, regarding the auxiliary queries, by default, the maximum number of hits ES returns is 10,000.
So the auxiliary parent query and child query will include up to 10,000 hits. In the scenario above, imagine all the hits have the same score, so it's possible that some of the hits in the alert are not matched within the parent query hits. This would lead to flagging the alert as a cross-object alert, causing the issue detected in this test.
Therefore, I was thinking that an alternative approach to solve this issue is to sort hits for the main query (parent results and also nested documents) and the auxiliary queries by two keys: score
and date_created
(or timestamp
).
In this way, we can ensure that the order of the hits in the main results is similar to those in the auxiliary queries, and we don't miss hits in a scenario like the one described.
Let me know what you think.
I don't completely understand the fix/issue here, but would it be easier to just say, "You can only get XXX results per alert per day?" Feels like beyond some threshold, the alert system is being used in a really weird way. Maybe we just set that number and then document it somewhere? |
Yes, we already have this limit defined by However, the issue here is slightly different. It's related to the the auxiliary queries we perform to classify alerts based on their hits: Docket-only, Cross-object, or RECAPDocument-only so that we can determine whether to display nested hits in the results. Additionally, this classification allows us to mark in their Redis SET if an alert has been triggered by a docket or RD. This ensures the alert is not triggered again by the same document in the future. |
ed53924
to
4a8d2be
Compare
4a8d2be
to
1f1fbc8
Compare
@albertisfu Sorry for all the back-and-forth on this PR. I tried your suggestion, but it didn't quite fix the issue. In this test, we create four dockets, with only one having two recap documents. When I used the sorting rule you suggested ( I checked the explanation (Explain API) and saw that its score is the sum of the score from the child documents search and the score from matching the parent document itself. This explains why it ended up first. However, when we run the auxiliary query, we can't replicate the same order using the same sorting rule. That's because the auxiliary query doesn't use a nested approach, and the scores are calculated differently. So, when we limit the results, we might unintentionally exclude documents that were included in the main query. Do you have any other ideas to avoid removing the limit from auxiliary queries? |
Yeah you're totally right. The scores from the main query will never match those from the auxiliary queries due to nested documents in the main query, so that approach isn’t a solution.
This is indeed complicated, but I’ve got an idea. The described issue will arise only when any of the auxiliary queries return exactly 10,000 results, which is the default maximum number of results returned by ES. If we receive that many results from any of the auxiliary queries, it indicates the possibility of more results being excluded, which could cause problems when filtering the main query's results. In such cases, if one of the auxiliary queries returns 10,000 results, we’d need to repeat that auxiliary query or both (if both returned 10,000 hits) and include an additional filter as follows: If the auxiliary query returning 10,000 results is the docket-only query:
If the auxiliary query returning 10,000 results is the RECAP document-only query:
Once the required auxiliary query (or both) is updated with the additional filters, we'd need to repeat the query. This will ensure that the auxiliary queries return accurate results, as they now consider the Let me know what do you think. Thank you! |
Sounds good! I'll start working on the new code |
e6469ba
to
96d4790
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ERosendo! This looks great. I’ve just left a few comments below, and we’ll be ready to merge it!
@albertisfu Thanks for the review. I've applied your suggestions. |
Shall we assign this back to @albertisfu, for final review, or should we merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ERosendo the changes look good.
Let's merge it
The
test_multiple_alerts_email_hits_limit_per_alert
test failed because our logic mistakenly included nested documents and failed to accurately distinguish docket-only hits when generating email alerts.Upon inspecting the helper function responsible for processing individual alert hits, I noticed that the identification of docket-only hits relied on the results of auxiliary queries.
When comparing the main and auxiliary docket queries, I noticed some inconsistencies. Results were not always returned in the same order, and the auxiliary query occasionally omitted elements included in the main query. The auxiliary query was missing elements because the
SCHEDULED_ALERT_HITS_LIMIT
setting restricted the number of results and elements were ordered differently.This PR addresses the issue by removing the
size
parameter from auxiliary queries. This ensures that these queries always return all matching IDs, regardless of their order, allowing for accurate identification of docket-only hits and cross-object hits.