Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Loader: Add lazy load #4799

Merged
merged 2 commits into from
May 17, 2023
Merged

Conversation

eyurtsev
Copy link
Collaborator

@eyurtsev eyurtsev commented May 16, 2023

Add lazy load to HF datasets loader

Unfortunately, there are no tests as far as i can tell. Verified code manually.

@eyurtsev
Copy link
Collaborator Author

@eyurtsev eyurtsev requested review from hwchase17 and dev2049 May 16, 2023 18:01
@@ -72,13 +73,15 @@ def load(self) -> List[Document]:
num_proc=self.num_proc,
)

docs = [
yield from (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be even better if we use streaming datasets https://huggingface.co/docs/datasets/stream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i understood correctly streaming=True doesn't download the data locally, but streaming=False does?

it sounds like we should pass a streaming param to dataset, rather than use streaming=True by default. Is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I don't think we should change the behavior right now

@eyurtsev
Copy link
Collaborator Author

@vowelparrot I'm going to merge as is. My understanding is that adding a default streaming=True may change behavior in an undesired way (data isn't downloaded locally -- which I think is often desirable). Instead we can follow up with a PR to expose more parameters in the init that will be passed to the load datasets function.

@eyurtsev eyurtsev merged commit 2d20a11 into master May 17, 2023
@eyurtsev eyurtsev deleted the eugene/streaming_hugging_face branch May 17, 2023 16:04
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants