Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add support for HuggingFace datasets #462

Merged
merged 43 commits into from
Feb 10, 2025

Conversation

deependujha
Copy link
Collaborator

@deependujha deependujha commented Feb 4, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

  • Index HF datasets and then stream them.
import litdata as ld
from litdata.streaming.item_loader import ParquetLoader

hf_uri = "hf://datasets/open-thoughts/OpenThoughts-114k/data"

ld.index_parquet_dataset(hf_uri, "hf-cache")

# -------

ds = ld.StreamingDataset(hf_uri, index_path="hf-cache", item_loader = ParquetLoader())

for _ds in ds:
    print(f"{_ds=}")

Or, just Steam HF dataset directly 🚀

  • It'll automatically index hf dataset and cache it.
import litdata as ld

hf_uri = "hf://datasets/open-thoughts/OpenThoughts-114k/data"

ds = ld.StreamingDataset(hf_uri)

for _ds in ds:
    print(f"{_ds=}")

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@deependujha deependujha requested a review from tchaton as a code owner February 4, 2025 09:40
@deependujha deependujha marked this pull request as draft February 4, 2025 09:40
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 89.56522% with 24 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (f09eb65) to head (e1a4706).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff          @@
##           main   #462    +/-   ##
====================================
+ Coverage    78%    79%    +1%     
====================================
  Files        37     38     +1     
  Lines      5370   5547   +177     
====================================
+ Hits       4193   4386   +193     
+ Misses     1177   1161    -16     

@deependujha deependujha marked this pull request as ready for review February 4, 2025 20:59
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we get some sort of tests for this?

src/litdata/streaming/config.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to make a few more tests. This looks a bit light for the quantity of code ;)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/litdata/streaming/config.py Outdated Show resolved Hide resolved
src/litdata/streaming/dataset.py Show resolved Hide resolved
src/litdata/streaming/downloader.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @deependujha! just a few comments

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/litdata/streaming/dataset.py Show resolved Hide resolved
src/litdata/streaming/writer.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@tchaton tchaton merged commit b439281 into Lightning-AI:main Feb 10, 2025
29 checks passed
@deependujha deependujha deleted the feat/add-hf-support branch February 14, 2025 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants