Hugging Face support #17625
Labels
accepted
Ready for implementation
enhancement
New feature or an improvement of an existing feature
P-medium
Priority: medium
Description
Hugging face has a platform
datasets
which hosts ML datasets. Right now we have limited support (in the form ofpyarrow
orffspec
) which does not allow for lazy evaluation.Goal:
hf://.....
url's in a lazy mannerPointers:
hf://
url to theirhttps://
counterparts. Either by implementing our own customObjectStore
or by converting thehf://
url into a list ofhttps://
somewhere Hugging Face allows listing files in directories using the following endpointhttps://huggingface.co/api/datasets/{reponame}/tree/{revision}/{encoded_path}
this spits out paginated json which we could traverse to manually glob all the matching filesExamples URLs:
https://huggingface.co/roneneldan/TinyStories-1M/resolve/main/.gitattributes
https://huggingface.co/api/datasets/roneneldan/TinyStories/tree/main
In the future, add support for:
storage_options
?)@parquet..
in the url)scan_csv
andscan_(nd)json
The text was updated successfully, but these errors were encountered: