Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(elasticsearch): Remove size parameter from auxiliary queries #4690

Merged
merged 7 commits into from
Nov 22, 2024

Conversation

ERosendo
Copy link
Contributor

@ERosendo ERosendo commented Nov 14, 2024

The test_multiple_alerts_email_hits_limit_per_alert test failed because our logic mistakenly included nested documents and failed to accurately distinguish docket-only hits when generating email alerts.

Upon inspecting the helper function responsible for processing individual alert hits, I noticed that the identification of docket-only hits relied on the results of auxiliary queries.

When comparing the main and auxiliary docket queries, I noticed some inconsistencies. Results were not always returned in the same order, and the auxiliary query occasionally omitted elements included in the main query. The auxiliary query was missing elements because the SCHEDULED_ALERT_HITS_LIMIT setting restricted the number of results and elements were ordered differently.

This PR addresses the issue by removing the size parameter from auxiliary queries. This ensures that these queries always return all matching IDs, regardless of their order, allowing for accurate identification of docket-only hits and cross-object hits.

Copy link

sentry-io bot commented Nov 14, 2024

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/lib/elasticsearch_utils.py

Function Unhandled Issue
do_es_sweep_alert_query ApiError: N/A cl.lib.elasticsearch_utils in do_es...
Event Count: 7

Did you find this useful? React with a 👍 or 👎

@ERosendo ERosendo marked this pull request as ready for review November 14, 2024 18:47
@mlissner mlissner requested review from albertisfu and removed request for mlissner November 14, 2024 22:04
@mlissner
Copy link
Member

Thanks. This looks fine to me. From a process perspective, I set Alberto as the reviewer, but left it assigned to you. I'm thinking it's his job to review, but you should stay assigned as the one paying attention to it and being on top of it. Feel right?

@mlissner
Copy link
Member

Another tweak, sorry. I'm putting both of you as assignees. This way it shows up under the reviewer's name in the By Assignee tab in the backlog. Figuring things out!

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Eduardo! This seems about right to me. I was just thinking of a corner case scenario that seems unlikely to happen, but I think it's still possible.

For instance, let's say that a user has an alert like case_name:United States, and one day we receive more than 10,000 cases that include "United States" in their case_name (imagine 12,000).

We'll only include at most SCHEDULED_ALERT_HITS_LIMIT alert cases in the alert email.

However, regarding the auxiliary queries, by default, the maximum number of hits ES returns is 10,000.

So the auxiliary parent query and child query will include up to 10,000 hits. In the scenario above, imagine all the hits have the same score, so it's possible that some of the hits in the alert are not matched within the parent query hits. This would lead to flagging the alert as a cross-object alert, causing the issue detected in this test.

Therefore, I was thinking that an alternative approach to solve this issue is to sort hits for the main query (parent results and also nested documents) and the auxiliary queries by two keys: score and date_created (or timestamp).

In this way, we can ensure that the order of the hits in the main results is similar to those in the auxiliary queries, and we don't miss hits in a scenario like the one described.

Let me know what you think.

@mlissner
Copy link
Member

I don't completely understand the fix/issue here, but would it be easier to just say, "You can only get XXX results per alert per day?" Feels like beyond some threshold, the alert system is being used in a really weird way. Maybe we just set that number and then document it somewhere?

@albertisfu
Copy link
Contributor

I don't completely understand the fix/issue here, but would it be easier to just say, "You can only get XXX results per alert per day?" Feels like beyond some threshold, the alert system is being used in a really weird way. Maybe we just set that number and then document it somewhere?

Yes, we already have this limit defined by SCHEDULED_ALERT_HITS_LIMIT (currently set to 20 by default). I think we just need to document that this is the maximum number of cases an alert can display.

However, the issue here is slightly different. It's related to the the auxiliary queries we perform to classify alerts based on their hits: Docket-only, Cross-object, or RECAPDocument-only so that we can determine whether to display nested hits in the results. Additionally, this classification allows us to mark in their Redis SET if an alert has been triggered by a docket or RD. This ensures the alert is not triggered again by the same document in the future.
So the issue here involves the ordering of documents in the results for these auxiliary queries when the scores are identical.

@ERosendo ERosendo reopened this Nov 18, 2024
@ERosendo
Copy link
Contributor Author

@albertisfu Sorry for all the back-and-forth on this PR. I tried your suggestion, but it didn't quite fix the issue.

In this test, we create four dockets, with only one having two recap documents. When I used the sorting rule you suggested (_score descending and timestamp descending), the result with the two child documents always came first.

I checked the explanation (Explain API) and saw that its score is the sum of the score from the child documents search and the score from matching the parent document itself. This explains why it ended up first.

However, when we run the auxiliary query, we can't replicate the same order using the same sorting rule. That's because the auxiliary query doesn't use a nested approach, and the scores are calculated differently. So, when we limit the results, we might unintentionally exclude documents that were included in the main query.

Do you have any other ideas to avoid removing the limit from auxiliary queries?

@albertisfu
Copy link
Contributor

Yeah you're totally right. The scores from the main query will never match those from the auxiliary queries due to nested documents in the main query, so that approach isn’t a solution.

Do you have any other ideas to avoid removing the limit from auxiliary queries?

This is indeed complicated, but I’ve got an idea.
As I mentioned in a previous comment, I think the problem described is a corner case that might occur rarely, but it can still happen. Since the solution I’m proposing below would require performing an additional ES query, we could limit its application to only those cases where it’s necessary and continue using your solution for the majority of requests.

The described issue will arise only when any of the auxiliary queries return exactly 10,000 results, which is the default maximum number of results returned by ES. If we receive that many results from any of the auxiliary queries, it indicates the possibility of more results being excluded, which could cause problems when filtering the main query's results.

In such cases, if one of the auxiliary queries returns 10,000 results, we’d need to repeat that auxiliary query or both (if both returned 10,000 hits) and include an additional filter as follows:

If the auxiliary query returning 10,000 results is the docket-only query:
Extract all the docket_ids from the main query results and use them as a filter to refine the docket-only query. The filter should look like:

Q("terms", docket_id=docket_ids_list)

If the auxiliary query returning 10,000 results is the RECAP document-only query:
Extract all the RECAPDocument IDs from the main query results. To do this, retrieve all the IDs within child_docs from the main query results and use them as a filter to refine the RECAP document-only query. The filter should look like:

Q("terms", id=rd_ids_list)

Once the required auxiliary query (or both) is updated with the additional filters, we'd need to repeat the query. This will ensure that the auxiliary queries return accurate results, as they now consider the docket_ids or RECAPDocument ids from the main query. So results returned won't miss any document.

Let me know what do you think.

Thank you!

@ERosendo
Copy link
Contributor Author

Let me know what do you think.

Sounds good! I'll start working on the new code

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ERosendo! This looks great. I’ve just left a few comments below, and we’ll be ready to merge it!

cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved
cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved
cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved
cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved
@albertisfu albertisfu assigned ERosendo and unassigned albertisfu Nov 22, 2024
@ERosendo
Copy link
Contributor Author

@albertisfu Thanks for the review. I've applied your suggestions.

@mlissner
Copy link
Member

Shall we assign this back to @albertisfu, for final review, or should we merge?

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ERosendo the changes look good.

Let's merge it :shipit:

@albertisfu albertisfu merged commit 40e59c6 into main Nov 22, 2024
15 checks passed
@albertisfu albertisfu deleted the 4624-fix-flaky-es-test branch November 22, 2024 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants