fix(elasticsearch): Remove size parameter from auxiliary queries #4690

ERosendo · 2024-11-14T18:11:43Z

The test_multiple_alerts_email_hits_limit_per_alert test failed because our logic mistakenly included nested documents and failed to accurately distinguish docket-only hits when generating email alerts.

Upon inspecting the helper function responsible for processing individual alert hits, I noticed that the identification of docket-only hits relied on the results of auxiliary queries.

When comparing the main and auxiliary docket queries, I noticed some inconsistencies. Results were not always returned in the same order, and the auxiliary query occasionally omitted elements included in the main query. The auxiliary query was missing elements because the SCHEDULED_ALERT_HITS_LIMIT setting restricted the number of results and elements were ordered differently.

This PR addresses the issue by removing the size parameter from auxiliary queries. This ensures that these queries always return all matching IDs, regardless of their order, allowing for accurate identification of docket-only hits and cross-object hits.

sentry-io · 2024-11-14T18:11:57Z

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/lib/elasticsearch_utils.py

Function	Unhandled Issue
`do_es_sweep_alert_query`	ApiError: N/A cl.lib.elasticsearch_utils in do_es... `Event Count:` 7

_{Did you find this useful? React with a 👍 or 👎}

cl/lib/elasticsearch_utils.py

mlissner · 2024-11-14T22:05:52Z

Thanks. This looks fine to me. From a process perspective, I set Alberto as the reviewer, but left it assigned to you. I'm thinking it's his job to review, but you should stay assigned as the one paying attention to it and being on top of it. Feel right?

mlissner · 2024-11-14T23:07:59Z

Another tweak, sorry. I'm putting both of you as assignees. This way it shows up under the reviewer's name in the By Assignee tab in the backlog. Figuring things out!

albertisfu

Thanks, Eduardo! This seems about right to me. I was just thinking of a corner case scenario that seems unlikely to happen, but I think it's still possible.

For instance, let's say that a user has an alert like case_name:United States, and one day we receive more than 10,000 cases that include "United States" in their case_name (imagine 12,000).

We'll only include at most SCHEDULED_ALERT_HITS_LIMIT alert cases in the alert email.

However, regarding the auxiliary queries, by default, the maximum number of hits ES returns is 10,000.

So the auxiliary parent query and child query will include up to 10,000 hits. In the scenario above, imagine all the hits have the same score, so it's possible that some of the hits in the alert are not matched within the parent query hits. This would lead to flagging the alert as a cross-object alert, causing the issue detected in this test.

Therefore, I was thinking that an alternative approach to solve this issue is to sort hits for the main query (parent results and also nested documents) and the auxiliary queries by two keys: score and date_created (or timestamp).

In this way, we can ensure that the order of the hits in the main results is similar to those in the auxiliary queries, and we don't miss hits in a scenario like the one described.

Let me know what you think.

mlissner · 2024-11-14T23:37:42Z

I don't completely understand the fix/issue here, but would it be easier to just say, "You can only get XXX results per alert per day?" Feels like beyond some threshold, the alert system is being used in a really weird way. Maybe we just set that number and then document it somewhere?

albertisfu · 2024-11-14T23:51:42Z

I don't completely understand the fix/issue here, but would it be easier to just say, "You can only get XXX results per alert per day?" Feels like beyond some threshold, the alert system is being used in a really weird way. Maybe we just set that number and then document it somewhere?

Yes, we already have this limit defined by SCHEDULED_ALERT_HITS_LIMIT (currently set to 20 by default). I think we just need to document that this is the maximum number of cases an alert can display.

However, the issue here is slightly different. It's related to the the auxiliary queries we perform to classify alerts based on their hits: Docket-only, Cross-object, or RECAPDocument-only so that we can determine whether to display nested hits in the results. Additionally, this classification allows us to mark in their Redis SET if an alert has been triggered by a docket or RD. This ensures the alert is not triggered again by the same document in the future.
So the issue here involves the ordering of documents in the results for these auxiliary queries when the scores are identical.

ERosendo · 2024-11-18T14:02:06Z

@albertisfu Sorry for all the back-and-forth on this PR. I tried your suggestion, but it didn't quite fix the issue.

In this test, we create four dockets, with only one having two recap documents. When I used the sorting rule you suggested (_score descending and timestamp descending), the result with the two child documents always came first.

I checked the explanation (Explain API) and saw that its score is the sum of the score from the child documents search and the score from matching the parent document itself. This explains why it ended up first.

However, when we run the auxiliary query, we can't replicate the same order using the same sorting rule. That's because the auxiliary query doesn't use a nested approach, and the scores are calculated differently. So, when we limit the results, we might unintentionally exclude documents that were included in the main query.

Do you have any other ideas to avoid removing the limit from auxiliary queries?

albertisfu · 2024-11-19T02:18:53Z

Yeah you're totally right. The scores from the main query will never match those from the auxiliary queries due to nested documents in the main query, so that approach isn’t a solution.

Do you have any other ideas to avoid removing the limit from auxiliary queries?

This is indeed complicated, but I’ve got an idea.
As I mentioned in a previous comment, I think the problem described is a corner case that might occur rarely, but it can still happen. Since the solution I’m proposing below would require performing an additional ES query, we could limit its application to only those cases where it’s necessary and continue using your solution for the majority of requests.

The described issue will arise only when any of the auxiliary queries return exactly 10,000 results, which is the default maximum number of results returned by ES. If we receive that many results from any of the auxiliary queries, it indicates the possibility of more results being excluded, which could cause problems when filtering the main query's results.

In such cases, if one of the auxiliary queries returns 10,000 results, we’d need to repeat that auxiliary query or both (if both returned 10,000 hits) and include an additional filter as follows:

If the auxiliary query returning 10,000 results is the docket-only query:
Extract all the docket_ids from the main query results and use them as a filter to refine the docket-only query. The filter should look like:

Q("terms", docket_id=docket_ids_list)

If the auxiliary query returning 10,000 results is the RECAP document-only query:
Extract all the RECAPDocument IDs from the main query results. To do this, retrieve all the IDs within child_docs from the main query results and use them as a filter to refine the RECAP document-only query. The filter should look like:

Q("terms", id=rd_ids_list)

Once the required auxiliary query (or both) is updated with the additional filters, we'd need to repeat the query. This will ensure that the auxiliary queries return accurate results, as they now consider the docket_ids or RECAPDocument ids from the main query. So results returned won't miss any document.

Let me know what do you think.

Thank you!

ERosendo · 2024-11-19T15:26:28Z

Let me know what do you think.

Sounds good! I'll start working on the new code

albertisfu

Thank you, @ERosendo! This looks great. I’ve just left a few comments below, and we’ll be ready to merge it!

cl/lib/elasticsearch_utils.py

ERosendo · 2024-11-22T19:01:31Z

@albertisfu Thanks for the review. I've applied your suggestions.

mlissner · 2024-11-22T20:20:51Z

Shall we assign this back to @albertisfu, for final review, or should we merge?

albertisfu

Thanks @ERosendo the changes look good.

Let's merge it

semgrep-app bot reviewed Nov 14, 2024

View reviewed changes

cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved

semgrep-app bot reviewed Nov 14, 2024

View reviewed changes

cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved

ERosendo marked this pull request as ready for review November 14, 2024 18:47

ERosendo requested a review from mlissner November 14, 2024 18:47

mlissner requested review from albertisfu and removed request for mlissner November 14, 2024 22:04

mlissner assigned ERosendo Nov 14, 2024

mlissner assigned albertisfu Nov 14, 2024

albertisfu reviewed Nov 14, 2024

View reviewed changes

ERosendo mentioned this pull request Nov 15, 2024

Flaky tests: test_multiple_alerts_email_hits_limit_per_alert #4624

Closed

ERosendo force-pushed the 4624-fix-flaky-es-test branch 4 times, most recently from ed53924 to 4a8d2be Compare November 18, 2024 04:21

freelawproject deleted a comment from semgrep-app bot Nov 18, 2024

ERosendo closed this Nov 18, 2024

ERosendo force-pushed the 4624-fix-flaky-es-test branch from 4a8d2be to 1f1fbc8 Compare November 18, 2024 04:33

fix(elasticsearch): Remove size parameter from auxiliary queries

1d5df53

ERosendo reopened this Nov 18, 2024

mlissner unassigned ERosendo Nov 19, 2024

ERosendo added 2 commits November 19, 2024 11:38

Merge branch 'main' into 4624-fix-flaky-es-test

9a569dd

Merge branch 'main' into 4624-fix-flaky-es-test

6da67c7

feat(alerts): Improve count accuracy by tracking up to 10,001 results

96d4790

ERosendo force-pushed the 4624-fix-flaky-es-test branch from e6469ba to 96d4790 Compare November 21, 2024 22:26

feat(alerts): Adds logic to re-run aux queries

e3cade0

ERosendo requested a review from albertisfu November 22, 2024 05:06

Merge branch 'main' into 4624-fix-flaky-es-test

dd44979

albertisfu reviewed Nov 22, 2024

View reviewed changes

cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved

cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved

cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved

cl/lib/elasticsearch_utils.py Outdated Show resolved Hide resolved

albertisfu assigned ERosendo and unassigned albertisfu Nov 22, 2024

feat(alerts): Refines logic to re-run aux queries

70f5625

ERosendo requested a review from albertisfu November 22, 2024 19:24

albertisfu approved these changes Nov 22, 2024

View reviewed changes

albertisfu merged commit 40e59c6 into main Nov 22, 2024
15 checks passed

albertisfu deleted the 4624-fix-flaky-es-test branch November 22, 2024 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(elasticsearch): Remove size parameter from auxiliary queries #4690

fix(elasticsearch): Remove size parameter from auxiliary queries #4690

ERosendo commented Nov 14, 2024 •

edited

Loading

sentry-io bot commented Nov 14, 2024

mlissner commented Nov 14, 2024

mlissner commented Nov 14, 2024

albertisfu left a comment

mlissner commented Nov 14, 2024

albertisfu commented Nov 14, 2024

ERosendo commented Nov 18, 2024

albertisfu commented Nov 19, 2024

ERosendo commented Nov 19, 2024

albertisfu left a comment

ERosendo commented Nov 22, 2024

mlissner commented Nov 22, 2024

albertisfu left a comment

fix(elasticsearch): Remove size parameter from auxiliary queries #4690

fix(elasticsearch): Remove size parameter from auxiliary queries #4690

Conversation

ERosendo commented Nov 14, 2024 • edited Loading

sentry-io bot commented Nov 14, 2024

🔍 Existing Issues For Review

mlissner commented Nov 14, 2024

mlissner commented Nov 14, 2024

albertisfu left a comment

Choose a reason for hiding this comment

mlissner commented Nov 14, 2024

albertisfu commented Nov 14, 2024

ERosendo commented Nov 18, 2024

albertisfu commented Nov 19, 2024

ERosendo commented Nov 19, 2024

albertisfu left a comment

Choose a reason for hiding this comment

ERosendo commented Nov 22, 2024

mlissner commented Nov 22, 2024

albertisfu left a comment

Choose a reason for hiding this comment

ERosendo commented Nov 14, 2024 •

edited

Loading