Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source: files (s3 and local) #653

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Source: files (s3 and local) #653

wants to merge 24 commits into from

Conversation

tim-quix
Copy link
Contributor

@tim-quix tim-quix commented Nov 26, 2024

Not 100% done yet but here is a fairly refined draft. Just a little more cleanup to do and a tiny bit more work left on Azure, and then documentation assuming we like the approach.

Currently extends FileSource (which itself had some minor adjustments to better accommodate extension).

@tim-quix tim-quix added the connector Issues updating Sinks or Sources label Nov 26, 2024
@tim-quix tim-quix marked this pull request as draft November 26, 2024 01:43
@tim-quix tim-quix changed the title Source: blob store (AWS and Azure) Source: blob stores (AWS and Azure) Nov 26, 2024
docs/connectors/sources/aws-s3-file-source.md Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/s3.py Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/local.py Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/local.py Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/s3.py Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/base.py Outdated Show resolved Hide resolved
@tim-quix
Copy link
Contributor Author

tim-quix commented Nov 27, 2024

TODO:

  • maybe add a simple file download queue?
  • look over Tomas's updated replay timestamp logic and see if changes are needed
  • double-check non-code stuff (adding doc paths, package stuff, etc)

@tim-quix tim-quix changed the title Source: blob stores (AWS and Azure) Source: external files (AWS and Azure) Nov 27, 2024
@tim-quix tim-quix changed the title Source: external files (AWS and Azure) Source: external files (S3 and Azure) Nov 27, 2024
docs/connectors/sources/azure-file-source.md Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/azure.py Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/base.py Outdated Show resolved Hide resolved
quixstreams/sources/community/file/origins/local.py Outdated Show resolved Hide resolved
Comment on lines +58 to +59
def _get_client(self) -> S3Client:
return boto_client("s3", **self._credentials)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be self._get_client too.

BTW, maybe you could replace the _get_client with a _client property and a good explanation as to why it reinstantiates a client every time. Then self._client: Optional[S3Client] = None may be removed from init.

Suggested change
def _get_client(self) -> S3Client:
return boto_client("s3", **self._credentials)
@property
def _client(self) -> S3Client:
return boto_client("s3", **self._credentials)

Copy link
Contributor Author

@tim-quix tim-quix Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it doesn't need to instantiate one every time...it basically must be instantiated after it's kicked off as a multiprocess Process via Source.run() (which is where the pickling error is occurring).

The only reason this is a problem at all is I need to count the partition folders to see what the expected partition count is, which occurs before actually performing the run operation, so that's the thing I have to work around (by not storing that instance of the client when I do that call).

Ideally I'd like to not have to instantiate a new client every time I call it...currently I'm setting the client on the first call I know that is made after .run() and then I can just re-use it as originally desired.

@tim-quix tim-quix changed the title Source: external files (S3 and Azure) Source: files (s3 and local) Dec 3, 2024
@tim-quix tim-quix marked this pull request as ready for review December 3, 2024 03:26
@tim-quix
Copy link
Contributor Author

tim-quix commented Dec 3, 2024

This is ready to merge, depending on final thoughts on the client stuff for S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector Issues updating Sinks or Sources
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants