Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUMM-513 Validating JSON payload before sending to intake #194

Closed

Conversation

buranmert
Copy link
Contributor

@buranmert buranmert commented Jul 27, 2020

What and why?

We observe non-parseable logs sent to intake
In a certain org which has high volumes of logs,
%0.3-0.5 of all logs are not parsed properly in dashboard

Hypothesis #1
Data being read is malformed, because every malformed log has their malformed part at the very end
So if we can read the file later, we can have valid payload

Hypothesis #2
Write op corrupts the file and we need to watch write ops closely so that we can catch and recover errors
This requires major refactoring in File.swift and related files

How?

This commit implements the first hypothesis by validating the payload before uploading it to intake

How to test the fix?

let content = "some payload".data
let numberOfExtraBytes = 4
fileHandle.write(content)
// this puts 4 extra null bytes at the end of the file
fileHandle.truncate(atOffset: content.count + numberOfExtraBytes)
// now JSONSerialization throws with "Garbage at the end" error
try JSONSerialization.jsonObject(with: readData, options: [])

There are other cases than null bytes at the end, such as incomplete log message, this fix should also cover them

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference

@buranmert buranmert self-assigned this Jul 27, 2020
@buranmert buranmert requested a review from ncreated July 27, 2020 12:30
Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks Tracing feature (and will break RUM), as the batchData read for Tracing is not a valid JSON object (the Tracing payload is newline separated list of JSON objects).

Also, this adds unnecessary processing time. It's 3x worst in benchmarking single log read and over 6x worst when benchmarking 100 logs written to the file (where the max is 500).

1 log, no fix
-[DatadogBenchmarkTests.LoggingStorageBenchmarkTests testReadingLogsFromDisc]' 
measured [Time, seconds] average: 0.002

vs

1 log, fix
-[DatadogBenchmarkTests.LoggingStorageBenchmarkTests testReadingLogsFromDisc]' 
measured [Time, seconds] average: 0.007

---

100 logs, no fix
-[DatadogBenchmarkTests.LoggingStorageBenchmarkTests testReadingLogsFromDisc]'
measured [Time, seconds] average: 0.007

vs

100 logs, fix
-[DatadogBenchmarkTests.LoggingStorageBenchmarkTests testReadingLogsFromDisc]'
measured [Time, seconds] average: 0.045

This may have a very bad impact on the SDK performance and correctness, as we use a single thread for writes and reads. The thread may get stuck for writes if it's over-busy with reads. We'd have to re-balance read/write time constraints or introduce a separate thread for reads in order to keep the SDK performant and correct in heavily-logging apps.

We should reproduce the issue first and fix it where it occurs. The 0.3-0.5% impact doesn't justify such a drastic change IMO.

We observe non-parseable logs sent to intake
In a certain org which has high volumes of logs,
%0.3-0.5 of all logs are not parsed properly in dashboard

Hypothesis #1
Data being read is malformed, because every malformed log has their malformed part at the very end
So if we can read the file later, we can have valid payload
This commit implements this hypothesis

Hypothesis #2
Write op corrupts the file and we need to watch write ops closely so that we can catch and recover errors
This requires major refactoring in File.swift and related files
@buranmert buranmert force-pushed the buranmert/RUMM-513-malformed-logs-strikes-back branch from 1c50aea to 15ee54c Compare July 27, 2020 15:22
@buranmert
Copy link
Contributor Author

buranmert commented Jul 27, 2020

i'd add validator: (Data) -> Bool property to DataFormat type and validate it accordingly whether it is a single json or a bunch of newline-separate-json, this is solvable ✅

sure it's a trade-off: validation adds 0.04 seconds more CPU time in a background thread in every 5 seconds and/but in return (hopefully 🤞) %100 of the logs will be parsed successfully

i can't reproduce the issue at all: i'm closing the PR

@buranmert buranmert closed this Jul 27, 2020
@buranmert buranmert deleted the buranmert/RUMM-513-malformed-logs-strikes-back branch August 6, 2020 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants