-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load dataset from refs/convert/parquet instead of main #1707
Comments
Hi @Hakimovich99, thanks for reporting. The issue comes from the fact that in the string You have 2 workarounds for this:
>>> from dask import dataframe as dd
# Option 1: explicit revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions", revision="refs/convert/parquet")
Dask DataFrame Structure:
image text
npartitions=1
object object
... ...
Dask Name: read-parquet, 1 graph layer
# Option 2: url-encoded revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions@refs%2Fconvert%2Fparquet/default/train/0000.parquet")
Dask DataFrame Structure:
image text
npartitions=1
object object
... ...
Dask Name: read-parquet, 1 graph layer |
That been said, maybe we should handle (EDIT: that would be that a repo with a branch named |
Maybe the following API convenience endpoint can help: https://huggingface.co/docs/hub/api#get-apidatasetsrepoidparquetconfigsplitnparquet
It redirects to the file in the cc @lhoestq |
I don't think we can make |
Hi @Wauplin, thanks for your answers! I have tried both options with a dataset containing jsonl files only that were converted into parquet files by the parquet_converter: jamescalam/llama-2-arxiv-papers-chunked # Option 1: explicit revision
df = dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked", revision="refs/convert/parquet")
print(df)
I think that the revision argument actually doesn't do anything (defaults to main). |
Apparently the Though if we ever want option 1 to possibly work we can do some modifications to Option 2 is the best option for now |
@lhoestq
I don't remember suggesting this, but I would be fine with this change |
Oups, sorry I tested it too quickly. Yep indeed option 1 I suggested was completely wrong. I also tested option 1 with storage_options and it doesn't seem to work with dask: >>> from dask import dataframe as dd
>>> df = dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked", storage_options={"revision": "refs/convert/parquet"})
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: No files satisfy the `parquet_file_extension` criteria (files must end with ('.parq', '.parquet', '.pq')). But since option 2 is fine, we can close this issue right? I have open a separate issue to handle |
Instead of dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked@~parquet") |
Describe the bug
I would like to read, using dask, parquet files created by the parquet_converter from lambdalabs/pokemon-blip-captions
so based on what's mentioned here:
https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system
I tried this
and that
and both seem to work
However, when I try
or
It seems to fail to find it which is odd because we're just changing the branch name but the parquet file is still there. So it's either an internal error from HF hub or maybe the branch is not publicly accessible by default
Reproduction
from dask import dataframe as dd
p = "hf://datasets/lambdalabs/pokemon-blip-captions@refs/convert/parquet/default/train"
df = dd.read_parquet(p)
print(df)
Logs
No response
System info
The text was updated successfully, but these errors were encountered: