New module to stream FileInfoStream from s3 #339

bbalser · 2023-02-06T14:07:22Z

This will poll s3 for new files on a short duration , but keep track of each individual file so it can handle files coming out of order. The service will receive an already decoded stream over a channel when a new file is seen in s3.

The benefit of this is we will not have to make sure to have certain offsets configured based on the previous service's roll time on its output files.

Table definition

create table files_processed (
	file_name varchar primary key,
	file_type varchar,
	file_timestamp timestamp,
	processed_at timestamp
)

Configuration:
start_after --> the start point when querying into s3. Will only be used when no records are found in the db.
max_lookback --> the max it will look back from now when querying s3.

Having both set doesn't really make sense, do we need both? should be an enum?

Another design choice that comes off little weird, but couldn't come up with anything better is that calling FileInfoStream::into_stream consumes the struct and inserts the record into the db within the provided transaction. It feels odd to insert the db record before reading the stream but the thought was it shouldn't matter as long as the transaction is committed after processing stream. Welcome to other ideas.

file_store/src/file_info_poller.rs

andymck

lgtm but i wonder if we need to allow the file receiver ( ie the entity which receives the file stream contents ) to specify a max concurrent file limit in order to avoid the receiver being overwhelmed should there be a lot of files or a lot of msgs in files being streamed....Allows the receiver to control the flow rate

bbalser · 2023-02-10T16:02:29Z

lgtm but i wonder if we need to allow the file receiver ( ie the entity which receives the file stream contents ) to specify a max concurrent file limit in order to avoid the receiver being overwhelmed should there be a lot of files or a lot of msgs in files being streamed....Allows the receiver to control the flow rate

So allow the receiver to specify the size of the channel? I like that idea

…ured

bbalser marked this pull request as ready for review February 6, 2023 16:21

maplant reviewed Feb 6, 2023

View reviewed changes

file_store/src/file_info_poller.rs Outdated Show resolved Hide resolved

maplant approved these changes Feb 6, 2023

View reviewed changes

andymck approved these changes Feb 9, 2023

View reviewed changes

bbalser added 10 commits February 10, 2023 14:41

Initial version of Incoming Data Poller

d055c31

Minor cleanup

6e85163

Renamed to file_info_poller

f26ebd4

Refactoring

645930a

Fixing some clippy warnings

99412f7

Fix issues after rebase

d136f0c

Rename sql table and format

34d715b

Change start_after and max_lookback to enum so only one can be config…

30ec8ba

…ured

Move db function to inline db module

6388982

Allow receiver to specify channel size to control flow of files

9e3a077

bbalser force-pushed the bbalser/incoming-data-poller branch from c7ea52a to 9e3a077 Compare February 10, 2023 19:46

bbalser merged commit 61bd696 into main Feb 13, 2023

bbalser deleted the bbalser/incoming-data-poller branch February 13, 2023 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New module to stream FileInfoStream from s3 #339

New module to stream FileInfoStream from s3 #339

bbalser commented Feb 6, 2023 •

edited

Loading

andymck left a comment •

edited

Loading

bbalser commented Feb 10, 2023

New module to stream FileInfoStream from s3 #339

New module to stream FileInfoStream from s3 #339

Conversation

bbalser commented Feb 6, 2023 • edited Loading

andymck left a comment • edited Loading

Choose a reason for hiding this comment

bbalser commented Feb 10, 2023

bbalser commented Feb 6, 2023 •

edited

Loading

andymck left a comment •

edited

Loading