Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load dataset from refs/convert/parquet instead of main #1707

Closed
Hakimovich99 opened this issue Oct 3, 2023 · 9 comments
Closed

load dataset from refs/convert/parquet instead of main #1707

Hakimovich99 opened this issue Oct 3, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@Hakimovich99
Copy link

Describe the bug

I would like to read, using dask, parquet files created by the parquet_converter from lambdalabs/pokemon-blip-captions

so based on what's mentioned here:

https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

I tried this

p = "hf://datasets/lambdalabs/pokemon-blip-captions@main"

and that

p = "hf://datasets/lambdalabs/pokemon-blip-captions@main/data"

and both seem to work

df = dd.read_parquet(p)
print(df)

However, when I try

p = "hf://datasets/lambdalabs/pokemon-blip-captions@refs/convert/parquet"

or

p = "hf://datasets/lambdalabs/pokemon-blip-captions@refs/convert/parquet/default/train/000.parquet"

It seems to fail to find it which is odd because we're just changing the branch name but the parquet file is still there. So it's either an internal error from HF hub or maybe the branch is not publicly accessible by default

Reproduction

from dask import dataframe as dd

p = "hf://datasets/lambdalabs/pokemon-blip-captions@refs/convert/parquet/default/train"

df = dd.read_parquet(p)
print(df)

Logs

No response

System info

- huggingface_hub version: 0.17.3
- Platform: macOS-13.5.2-arm64-arm-64bit
- Python version: 3.10.9
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /Users/hakimamri/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.0.1
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.4.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.22.3
- pydantic: 1.10.12
- aiohttp: 3.8.5
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /Users/hakimamri/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /Users/hakimamri/.cache/huggingface/assets
- HF_TOKEN_PATH: /Users/hakimamri/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
@Hakimovich99 Hakimovich99 added the bug Something isn't working label Oct 3, 2023
@Wauplin
Copy link
Contributor

Wauplin commented Oct 4, 2023

Hi @Hakimovich99, thanks for reporting. The issue comes from the fact that in the string "@refs/convert/parquet/default/train/000.parquet", the split between the revision ("refs/convert/parquet") and the path ("default/train/000.parquet") is ambiguous. What we do is to split only on the first /. So in this case, revision refs and path convert/parquet/default/... which doesn't exist.

You have 2 workarounds for this:

  1. either provide revision="refs/convert/parquet" explicitly as a kwarg EDIT: for future readers: this doesn't work.
  2. either url-encode the revision as refs%2Fconvert%2Fparquet
>>> from dask import dataframe as dd

# Option 1: explicit revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions", revision="refs/convert/parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

# Option 2: url-encoded revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions@refs%2Fconvert%2Fparquet/default/train/0000.parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

@Wauplin
Copy link
Contributor

Wauplin commented Oct 4, 2023

That been said, maybe we should handle "refs/convert/parquet" and "refs/pr/(\d)+" (PRs) as a special case. If we detect it, we should handle them correctly even if not url-encoded. @mariosasko do you remember talking about something like that already? I have that in mind but can't find back the convo where we discussed it.

(EDIT: that would be that a repo with a branch named refs and a folder named pr/1 would be problematic - but not a big issue IMO 😄 )

@severo
Copy link
Collaborator

severo commented Oct 4, 2023

Maybe the following API convenience endpoint can help: https://huggingface.co/docs/hub/api#get-apidatasetsrepoidparquetconfigsplitnparquet

GET /api/datasets/(repo_id)/parquet/(config)/(split)/(n).parquet

It redirects to the file in the refs/convert/parquet branch, see for example: https://huggingface.co/api/datasets/glue/parquet/ax/test/0.parquet which returns the same as https://huggingface.co/datasets/glue/resolve/refs%2Fconvert%2Fparquet/ax/test/0000.parquet

cc @lhoestq

@Wauplin
Copy link
Contributor

Wauplin commented Oct 4, 2023

I don't think we can make HfFileSystem work with URLs like /api/datasets/(repo_id)/parquet/(config)/(split)/(n).parquet. At least it's not meant for it. Also we would need to know the number of splits (with /api/datasets/(repo_id)/parquet?) while dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions", revision="refs/convert/parquet") is supposed to read all of them at once. But good to know, about the /api alias for the /resolve endpoint, that's a TIL for me :)

@Hakimovich99
Copy link
Author

Hi @Hakimovich99, thanks for reporting. The issue comes from the fact that in the string "@refs/convert/parquet/default/train/000.parquet", the split between the revision ("refs/convert/parquet") and the path ("default/train/000.parquet") is ambiguous. What we do is to split only on the first /. So in this case, revision refs and path convert/parquet/default/... which doesn't exist.

You have 2 workarounds for this:

  1. either provide revision="refs/convert/parquet" explicitly as a kwarg
  2. either url-encode the revision as refs%2Fconvert%2Fparquet
>>> from dask import dataframe as dd

# Option 1: explicit revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions", revision="refs/convert/parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

# Option 2: url-encoded revision
>>> dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions@refs%2Fconvert%2Fparquet/default/train/0000.parquet")
Dask DataFrame Structure:
                image    text
npartitions=1                
               object  object
                  ...     ...
Dask Name: read-parquet, 1 graph layer

Hi @Wauplin, thanks for your answers! I have tried both options with a dataset containing jsonl files only that were converted into parquet files by the parquet_converter: jamescalam/llama-2-arxiv-papers-chunked
Option 2 works well but option 1 doesn't. Here is the code and error I get:

# Option 1: explicit revision
df = dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked", revision="refs/convert/parquet")
print(df)
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: No files satisfy the `parquet_file_extension` criteria (files must end with ('.parq', '.parquet', '.pq')).

I think that the revision argument actually doesn't do anything (defaults to main).

@lhoestq
Copy link
Member

lhoestq commented Oct 4, 2023

Apparently the **kwargs for dd.read_parquet seem to be passed to read_partitions, not to the filesystem ?

Though if we ever want option 1 to possibly work we can do some modifications to HfFileSystem: add revision as a storage option for the HfFileSystem (i.e. add it to its __init__)
This would allow to pass storage_options={"revision": "refs/convert/parquet"} to dd.read_parquet`.

Option 2 is the best option for now

@mariosasko
Copy link
Contributor

mariosasko commented Oct 4, 2023

@lhoestq I think storage_options={"revision": "refs/convert/parquet"} should work already assuming dask is aligned with pandas because pandas propagates these parameters to fsspec.open. This also does not work in pandas - I thought their logic was a bit more advanced than it is

dask has open_file_options for this purpose. However, these options are then passed to a fsspec.parquet.open_parquet_file function that calls fs.size without any options, leading to a failure in our case. So, a special handling of "refs/convert/parquet" seems to be the best solution.

@Wauplin

maybe we should handle "refs/convert/parquet" and "refs/pr/(\d)+" (PRs) as a special case

I don't remember suggesting this, but I would be fine with this change

@Wauplin
Copy link
Contributor

Wauplin commented Oct 4, 2023

Option 2 works well but option 1 doesn't.

Apparently the **kwargs for dd.read_parquet seem to be passed to read_partitions, not to the filesystem ?

Oups, sorry I tested it too quickly. Yep indeed option 1 I suggested was completely wrong. I also tested option 1 with storage_options and it doesn't seem to work with dask:

>>> from dask import dataframe as dd
>>> df = dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked", storage_options={"revision": "refs/convert/parquet"})
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: No files satisfy the `parquet_file_extension` criteria (files must end with ('.parq', '.parquet', '.pq')).

But since option 2 is fine, we can close this issue right? I have open a separate issue to handle refs/convert/parquet correctly.

@lhoestq
Copy link
Member

lhoestq commented Oct 5, 2023

Instead of refs/convert/parquet which is ambiguous because of the slashes, you can use its alias ~parquet

dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked@~parquet")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants