Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add S3 SQS Data Event Notification message as metadata to records #3641

Open
rhysxevans opened this issue Nov 12, 2023 · 1 comment
Open
Labels
enhancement New feature or request plugin - source A plugin to receive data from a service or location.

Comments

@rhysxevans
Copy link

Is your feature request related to a problem? Please describe.
We have a requirement to produce reports showing the time difference between a file been actioned upon (uploaded, generally) in S3 and the time that we process it in data-prepper

This page outlines the data provided to "SQS" https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-content-structure.html

From that my understanding at present is that only the Records.s3.bucket.name and Records.s3.object.key are used and exported to the event.

Ideally we would want the entire S3 data event message to be available to be used as metadata to be "attached" to the record. this would allow us to extend our requirement to possibly, say this file was uploaded by this "person/process/thing" at y time, we "recieved" it as x time, and the event had an original timestamp of z time.

So if this event ended up in Opensearch we may have an extra set of data along the lines of

s3: {
  eventTime: <time_object_created> (Records.eventTime)
  eventName: <put_object> (Records.eventName)
  bucketName: <bucket_name> (Records.s3.bucket.name)
  object: <object> (Records.s3.object.key)
}
internal: {
  original_time: <timestamp_of_log_line> (Origin time written by App)
  s3_time: <s3.eventTime>  (read from S3 Data event Notification message)
  dp_time: <time_ingest_by_dataprepper>  (added by dataprepper)
  os_time: <time_ingest_into_opensearch> (via a ingest pipeline or similar)
}

With the above we would be able to work out total latency from event emitted to event ingested in opensearch in this example. Now if this time was prolonged we would hopefully be able to determine where the latency was introduced. This would help us measure our SLA's / SLO's accurately.

Please note I may have mixed terms etc, but hopefully I have got the gist across, of what we are looking for an why.

Describe the solution you'd like
We would like the S3 Data Event message to be attached as "metadata" to the records processed in a S3 sourced file

Describe alternatives you've considered (Optional)
We would need to build some solution, to read our final output, go search the S3 bucket, get the obect time and then update the record in the final output

Additional context
Off the back of this very brief discussion here #3626

@dlvenable
Copy link
Member

@rhysxevans , Thank you for creating this issue and elaborating on your use-case. Since you are looking for end-to-end latencies, this could work very will with Data Prepper's new end-to-end latency feature (#3494).

We added a to_origination_metadata configuration to the date processor in #3583. With this, the sinks can report the latency between when the event was created (as defined by the pipeline author) and when the data reached the sink (as determined by Data Prepper).

@dlvenable dlvenable added enhancement New feature or request plugin - source A plugin to receive data from a service or location. and removed untriaged labels Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request plugin - source A plugin to receive data from a service or location.
Projects
Development

No branches or pull requests

2 participants