-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datasets doesn't support # in data paths #5099
Comments
datasets/src/datasets/utils/file_utils.py Lines 109 to 111 in 7feeb56
for example we should have from datasets.utils.file_utils import hf_hub_url
url = hf_hub_url("loubnabnl/bigcode_csharp", "data/c#/data_0003.jsonl")
print(url)
# Currently returns
# https://huggingface.co/datasets/loubnabnl/bigcode_csharp/resolve/main/data/c#/data_0003.jsonl
# while it should be
# https://huggingface.co/datasets/loubnabnl/bigcode_csharp/resolve/main/data/c%23/data_0003.jsonl |
I'll work on this :) |
@loubnabnl The dataset you linked in the description of the bug does not work and returns a 404. Where can I find the dataset to reproduce the bug? |
I think you can create a dataset repository on the Hub with a dummy file containing a |
Ah sorry it was private I just made it public, I can also help with this if needed |
@lhoestq Should I url encode also repo_id and revision parameters? I'm not sure what are the valid characters there. Personally, I would be cautious and only url encode the path parameter. |
These are possible solutions (assuming
|
repo_id can only contain alphanumeric characters and _- so it doesn't need to be encoded. However I agree it's a good idea to also apply |
Should be fixed by #5099 - we'll do a release later today |
Describe the bug
dataset files with
#
symbol their paths aren't read correctly.Steps to reproduce the bug
The data in folder
c#
of this dataset can't be loaded. While the folderc_sharp
with the same data is loaded properlyEnvironment info
datasets
version: 2.5.2cc @lhoestq
The text was updated successfully, but these errors were encountered: