Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New module to stream FileInfoStream from s3 #339

Merged
merged 10 commits into from
Feb 13, 2023
Merged

Conversation

bbalser
Copy link
Contributor

@bbalser bbalser commented Feb 6, 2023

This will poll s3 for new files on a short duration , but keep track of each individual file so it can handle files coming out of order. The service will receive an already decoded stream over a channel when a new file is seen in s3.

The benefit of this is we will not have to make sure to have certain offsets configured based on the previous service's roll time on its output files.

Table definition

create table files_processed (
	file_name varchar primary key,
	file_type varchar,
	file_timestamp timestamp,
	processed_at timestamp
)

Configuration:
start_after --> the start point when querying into s3. Will only be used when no records are found in the db.
max_lookback --> the max it will look back from now when querying s3.

Having both set doesn't really make sense, do we need both? should be an enum?

Another design choice that comes off little weird, but couldn't come up with anything better is that calling FileInfoStream::into_stream consumes the struct and inserts the record into the db within the provided transaction. It feels odd to insert the db record before reading the stream but the thought was it shouldn't matter as long as the transaction is committed after processing stream. Welcome to other ideas.

@bbalser bbalser marked this pull request as ready for review February 6, 2023 16:21
Copy link
Contributor

@andymck andymck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but i wonder if we need to allow the file receiver ( ie the entity which receives the file stream contents ) to specify a max concurrent file limit in order to avoid the receiver being overwhelmed should there be a lot of files or a lot of msgs in files being streamed....Allows the receiver to control the flow rate

@bbalser
Copy link
Contributor Author

bbalser commented Feb 10, 2023

lgtm but i wonder if we need to allow the file receiver ( ie the entity which receives the file stream contents ) to specify a max concurrent file limit in order to avoid the receiver being overwhelmed should there be a lot of files or a lot of msgs in files being streamed....Allows the receiver to control the flow rate

So allow the receiver to specify the size of the channel? I like that idea

@bbalser bbalser force-pushed the bbalser/incoming-data-poller branch from c7ea52a to 9e3a077 Compare February 10, 2023 19:46
@bbalser bbalser merged commit 61bd696 into main Feb 13, 2023
@bbalser bbalser deleted the bbalser/incoming-data-poller branch February 13, 2023 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants