-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[filebeat][google workspace] - Added canonical sorting over fingerprint keys array to maintain key order #40055
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts that this will fix the issue if it is related to #39859; in that issue we see that there are missing fields in the resulting document; there is no json.id.uniqueQualifier
/event.id
, json.id.applicationName
/event.provider
or json.id.customerId
/organization.id
which are all (in the first incarnation) used in the fingerprinting processor. These do all show up in the event.original
, so we know that they have been received. Looking at the processors' implementations, I don't see any obvious reason why fields would be being lost.
What I do find confusing is that in the documents shown in the issue, we still see the json. …
fields, but these should have been removed here (@rlevytskyi do you have local changes to your configuration that would have stopped the ingest pipeline being run?)
Absolutely no. Events got fetched by Filebeat, then passed to Logstash, then passed to Opensearch. |
@rlevytskyi Thanks. This complicates the issue and might explain the confounding behaviour.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because I don't see this causing any harm, but I still have doubt that it will fix the issue. Please get a second opinion.
@@ -25,15 +25,16 @@ var googleWorkspace = (function () { | |||
"json.id.applicationName", | |||
"json.id.customerId", | |||
]; | |||
var dynKeyArr = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section looks like it is working around limitations in the beat fingerprint processor. And it's not entirely overcoming those limitations (hashing arrays and objects). I think the would be reasonable to enhance the fingerprint processor to accommodate this usage (the ES fingerprint processor does this). Or move this _id processing into ingest node.
But I would consider fingerprinting on the JSON string that is contained in message
if the full contents are suitable for a fingerprint (i.e. each time an event is requested the api returns the same JSON).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried atm to change the fingerprint processor behaviour since changing it there would impact all modules utilising the processor and could possibly cause an initial wave of duplicates all across the board since the sorting will impact the fingerprint. I'll bring it up in the team sync once to discuss this and then make the change. For now I'm merging this change(otherwise it will go past FF) and we'll see and observe if this impacts duplicate creation in any way.
Type of change
Proposed commit message
The Google Workspace module has had a long standing issue of generating duplicate entries for events. This is an attempt to address that issue by introducing a canonical order to the fingerprint keys. It is possible that over successive responses, the same event object could possibly be canonically unordered. This is hinted by issues like this where ingested events have unique fingerprints for the same event leading to the duplication of events in elasticsearch.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
User might initially get more duplicates after updating, but this will subside after sometime as the ingestion process self corrects. This is because the sorting will impact the fingerprint of events to an extent in the initial period.
Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs