Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable pipeline to discard data older than XX #4667

Open
marfago opened this issue Jun 29, 2024 · 4 comments
Open

Enable pipeline to discard data older than XX #4667

marfago opened this issue Jun 29, 2024 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@marfago
Copy link

marfago commented Jun 29, 2024

Is your feature request related to a problem? Please describe.
I have a data pipeline built as a combination of AOSS pipeline and AOSS collection. This pipeline is a real time monitor for logs.
We recently had an outage so the source did not move logs for few days. When we finally unblocked the pipeline and restarted the ingestion, all the days were moved at once and the AOSS pipeline started to ingest oldest to newest. This behavior does not work for us where we prioritize fresher data over older because we want a real-time monitor.

Describe the solution you'd like
I propose to introduce a a new behavior where the pipeline can discard data in the queue that are older than XX (days, hours,minutes). In this way users may choose to prioritize fresher data over older data without causing the queue to grow indefinitely. In my case I may just set this flag on 1H and only ingest fresh data (at least for some time) forgetting about the past.

For example:
max_retention: 1h
max_retention: 1d
max_retention: 1w

Describe alternatives you've considered (Optional)
I dont have any.

Additional context
Related to #4666

@dlvenable
Copy link
Member

We do have the drop_events processor which can drop events that meet a certain condition.

I think what we are missing is a Data Prepper expression and/or function for comparing time.

Something like this could work:

drop_when: /my_timestamp < now() - 3d

Where 3d is our standard Data Prepper duration concept.

However, we do not have a now() function, nor the ability to perform comparisons against times. But, both could be added.

@dlvenable dlvenable added enhancement New feature or request good first issue Good for newcomers and removed untriaged labels Jul 2, 2024
@dlvenable
Copy link
Member

This could be done a little more easily by adding just a now() method for now.

drop_when: /my_timestamp < now() - 3 * 24 * 60 * 60 * 1000

@marfago , Are you interested in working on adding the now() method?

@marfago
Copy link
Author

marfago commented Jul 2, 2024

@dlvenable thank you for your comment. For the solution that you propose, how is the my_timestamp retrieved?

@dlvenable
Copy link
Member

@marfago , Do your events have an existing timestamp field that you could use? The my_timestamp would need to be a value from your source events.

Are you using Amazon S3 as a source? If you also need a timestamp, we could include the value of the S3 object header Last-Modified as metadata on the events. Then you'd be able to use that to approximate the time of the event. This could be useful in general as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
Development

No branches or pull requests

2 participants