load dataset from refs/convert/parquet instead of main #1707

Hakimovich99 · 2023-10-03T12:57:12Z

Describe the bug

I would like to read, using dask, parquet files created by the parquet_converter from lambdalabs/pokemon-blip-captions

so based on what's mentioned here:

https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

I tried this

p = "hf://datasets/lambdalabs/pokemon-blip-captions@main"

and that

p = "hf://datasets/lambdalabs/pokemon-blip-captions@main/data"

and both seem to work

df = dd.read_parquet(p)
print(df)

However, when I try

p = "hf://datasets/lambdalabs/pokemon-blip-captions@refs/convert/parquet"

or

p = "hf://datasets/lambdalabs/pokemon-blip-captions@refs/convert/parquet/default/train/000.parquet"

It seems to fail to find it which is odd because we're just changing the branch name but the parquet file is still there. So it's either an internal error from HF hub or maybe the branch is not publicly accessible by default

Reproduction

from dask import dataframe as dd

p = "hf://datasets/lambdalabs/pokemon-blip-captions@refs/convert/parquet/default/train"

df = dd.read_parquet(p)
print(df)

Logs

No response

System info

- huggingface_hub version: 0.17.3
- Platform: macOS-13.5.2-arm64-arm-64bit
- Python version: 3.10.9
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /Users/hakimamri/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.0.1
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.4.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.22.3
- pydantic: 1.10.12
- aiohttp: 3.8.5
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /Users/hakimamri/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /Users/hakimamri/.cache/huggingface/assets
- HF_TOKEN_PATH: /Users/hakimamri/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

The text was updated successfully, but these errors were encountered:

Wauplin · 2023-10-04T08:50:24Z

Hi @Hakimovich99, thanks for reporting. The issue comes from the fact that in the string "@refs/convert/parquet/default/train/000.parquet", the split between the revision ("refs/convert/parquet") and the path ("default/train/000.parquet") is ambiguous. What we do is to split only on the first /. So in this case, revision refs and path convert/parquet/default/... which doesn't exist.

You have 2 workarounds for this:

~~either provide revision="refs/convert/parquet" explicitly as a kwarg~~ EDIT: for future readers: this doesn't work.
either url-encode the revision as refs%2Fconvert%2Fparquet

>>> from dask import dataframe as dd

# Option 1: explicit revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions", revision="refs/convert/parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

# Option 2: url-encoded revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions@refs%2Fconvert%2Fparquet/default/train/0000.parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

Wauplin · 2023-10-04T08:54:39Z

That been said, maybe we should handle "refs/convert/parquet" and "refs/pr/(\d)+" (PRs) as a special case. If we detect it, we should handle them correctly even if not url-encoded. @mariosasko do you remember talking about something like that already? I have that in mind but can't find back the convo where we discussed it.

(EDIT: that would be that a repo with a branch named refs and a folder named pr/1 would be problematic - but not a big issue IMO 😄 )

severo · 2023-10-04T09:02:36Z

Maybe the following API convenience endpoint can help: https://huggingface.co/docs/hub/api#get-apidatasetsrepoidparquetconfigsplitnparquet

GET /api/datasets/(repo_id)/parquet/(config)/(split)/(n).parquet

It redirects to the file in the refs/convert/parquet branch, see for example: https://huggingface.co/api/datasets/glue/parquet/ax/test/0.parquet which returns the same as https://huggingface.co/datasets/glue/resolve/refs%2Fconvert%2Fparquet/ax/test/0000.parquet

cc @lhoestq

Wauplin · 2023-10-04T09:32:22Z

I don't think we can make HfFileSystem work with URLs like /api/datasets/(repo_id)/parquet/(config)/(split)/(n).parquet. At least it's not meant for it. Also we would need to know the number of splits (with /api/datasets/(repo_id)/parquet?) while dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions", revision="refs/convert/parquet") is supposed to read all of them at once. But good to know, about the /api alias for the /resolve endpoint, that's a TIL for me :)

Hakimovich99 · 2023-10-04T10:04:02Z

Hi @Hakimovich99, thanks for reporting. The issue comes from the fact that in the string "@refs/convert/parquet/default/train/000.parquet", the split between the revision ("refs/convert/parquet") and the path ("default/train/000.parquet") is ambiguous. What we do is to split only on the first /. So in this case, revision refs and path convert/parquet/default/... which doesn't exist.

You have 2 workarounds for this:

either provide revision="refs/convert/parquet" explicitly as a kwarg

either url-encode the revision as refs%2Fconvert%2Fparquet
>>> from dask import dataframe as dd

# Option 1: explicit revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions", revision="refs/convert/parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

# Option 2: url-encoded revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions@refs%2Fconvert%2Fparquet/default/train/0000.parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

Hi @Wauplin, thanks for your answers! I have tried both options with a dataset containing jsonl files only that were converted into parquet files by the parquet_converter: jamescalam/llama-2-arxiv-papers-chunked
Option 2 works well but option 1 doesn't. Here is the code and error I get:

# Option 1: explicit revision
df = dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked", revision="refs/convert/parquet")
print(df)

ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: No files satisfy the `parquet_file_extension` criteria (files must end with ('.parq', '.parquet', '.pq')).

I think that the revision argument actually doesn't do anything (defaults to main).

lhoestq · 2023-10-04T10:30:12Z

Apparently the **kwargs for dd.read_parquet seem to be passed to read_partitions, not to the filesystem ?

Though if we ever want option 1 to possibly work we can do some modifications to HfFileSystem: add revision as a storage option for the HfFileSystem (i.e. add it to its __init__)
This would allow to pass storage_options={"revision": "refs/convert/parquet"} to dd.read_parquet`.

Option 2 is the best option for now

mariosasko · 2023-10-04T13:50:31Z

@lhoestq ~~I think storage_options={"revision": "refs/convert/parquet"} should work already assuming dask is aligned with pandas because pandas propagates these parameters to fsspec.open.~~ This also does not work in pandas - I thought their logic was a bit more advanced than it is

dask has open_file_options for this purpose. However, these options are then passed to a fsspec.parquet.open_parquet_file function that calls fs.size without any options, leading to a failure in our case. So, a special handling of "refs/convert/parquet" seems to be the best solution.

@Wauplin

maybe we should handle "refs/convert/parquet" and "refs/pr/(\d)+" (PRs) as a special case

I don't remember suggesting this, but I would be fine with this change

Wauplin · 2023-10-04T14:05:18Z

Option 2 works well but option 1 doesn't.

Apparently the **kwargs for dd.read_parquet seem to be passed to read_partitions, not to the filesystem ?

Oups, sorry I tested it too quickly. Yep indeed option 1 I suggested was completely wrong. I also tested option 1 with storage_options and it doesn't seem to work with dask:

>>> from dask import dataframe as dd
>>> df = dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked", storage_options={"revision": "refs/convert/parquet"})
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: No files satisfy the `parquet_file_extension` criteria (files must end with ('.parq', '.parquet', '.pq')).

But since option 2 is fine, we can close this issue right? I have open a separate issue to handle refs/convert/parquet correctly.

lhoestq · 2023-10-05T11:23:48Z

Instead of refs/convert/parquet which is ambiguous because of the slashes, you can use its alias ~parquet

dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked@~parquet")

Hakimovich99 added the bug Something isn't working label Oct 3, 2023

Wauplin mentioned this issue Oct 4, 2023

[HFFS] Hanlde "refs/convert/parquet" and "refs/pr/(\d)+" (PRs) revision correctly #1710

Closed

Hakimovich99 closed this as completed Oct 4, 2023

PhilippeMoussalli mentioned this issue Oct 5, 2023

Load parquet from hub ml6team/fondant#491

Closed

Wauplin mentioned this issue Oct 5, 2023

Handle refs/convert/parquet and PR revision correctly in hffs #1712

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load dataset from refs/convert/parquet instead of main #1707

load dataset from refs/convert/parquet instead of main #1707

Hakimovich99 commented Oct 3, 2023

Wauplin commented Oct 4, 2023 •

edited

Loading

Wauplin commented Oct 4, 2023 •

edited

Loading

severo commented Oct 4, 2023 •

edited

Loading

Wauplin commented Oct 4, 2023

Hakimovich99 commented Oct 4, 2023

lhoestq commented Oct 4, 2023

mariosasko commented Oct 4, 2023 •

edited

Loading

Wauplin commented Oct 4, 2023

lhoestq commented Oct 5, 2023

load dataset from refs/convert/parquet instead of main #1707

load dataset from refs/convert/parquet instead of main #1707

Comments

Hakimovich99 commented Oct 3, 2023

Describe the bug

Reproduction

Logs

System info

Wauplin commented Oct 4, 2023 • edited Loading

Wauplin commented Oct 4, 2023 • edited Loading

severo commented Oct 4, 2023 • edited Loading

Wauplin commented Oct 4, 2023

Hakimovich99 commented Oct 4, 2023

lhoestq commented Oct 4, 2023

mariosasko commented Oct 4, 2023 • edited Loading

Wauplin commented Oct 4, 2023

lhoestq commented Oct 5, 2023

Wauplin commented Oct 4, 2023 •

edited

Loading

Wauplin commented Oct 4, 2023 •

edited

Loading

severo commented Oct 4, 2023 •

edited

Loading

mariosasko commented Oct 4, 2023 •

edited

Loading