Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face support #17625

Open
4 of 5 tasks
c-peters opened this issue Jul 14, 2024 · 0 comments
Open
4 of 5 tasks

Hugging Face support #17625

c-peters opened this issue Jul 14, 2024 · 0 comments
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium

Comments

@c-peters
Copy link
Collaborator

c-peters commented Jul 14, 2024

Description

Hugging face has a platform datasets which hosts ML datasets. Right now we have limited support (in the form of pyarrow or ffspec) which does not allow for lazy evaluation.

Goal:

  • Allow reading from hf://..... url's in a lazy manner
  • Allow for globbing
pl.scan_parquet("hf://datasets/roneneldan/TinyStories/data/train-*-of-*.parquet")

Pointers:

  • We should be able to convert the hf:// url to their https:// counterparts. Either by implementing our own custom ObjectStore or by converting the hf:// url into a list of https:// somewhere Hugging Face allows listing files in directories using the following endpoint https://huggingface.co/api/datasets/{reponame}/tree/{revision}/{encoded_path} this spits out paginated json which we could traverse to manually glob all the matching files
  • The file can be found under the endpoint `https://huggingface.co/{reponame}/resolve/{revision}/{encoded_path}

Examples URLs:
https://huggingface.co/roneneldan/TinyStories-1M/resolve/main/.gitattributes
https://huggingface.co/api/datasets/roneneldan/TinyStories/tree/main

In the future, add support for:

  • Private or gated repos with authentication (storage_options?)
  • Auto converted parquet files (extra @parquet.. in the url)
  • Other file types scan_csv and scan_(nd)json
  • Add write support?
  • Add a chapter in the user-guide (Polars and upstream)
@c-peters c-peters added the enhancement New feature or an improvement of an existing feature label Jul 14, 2024
@c-peters c-peters added the accepted Ready for implementation label Jul 14, 2024
@nameexhaustion nameexhaustion added the P-medium Priority: medium label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium
Projects
Archived in project
Development

No branches or pull requests

2 participants