-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hfh hf_hub_url function #5196
Use hfh hf_hub_url function #5196
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
src/datasets/utils/hub.py
Outdated
with temporary_assignment( | ||
huggingface_hub.file_download, | ||
"HUGGINGFACE_CO_URL_TEMPLATE", | ||
config.HUB_DATASETS_URL.replace("{path}", "{filename}"), | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Wauplin is it how you'd expect datasets
to overwrite HUGGINGFACE_CO_URL_TEMPLATE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, what is the goal here ? Is it just to avoid passing repo_type="dataset"
?
For what I understand, HUB_DATASETS_URL
from datasets
and HUGGINGFACE_CO_URL_TEMPLATE
from huggingface_hub
are very similar, isn't it ?
HUGGINGFACE_CO_URL_TEMPLATE = ENDPOINT + "/{repo_id}/resolve/{revision}/{filename}"
HUB_DATASETS_URL = HF_ENDPOINT + "/datasets/{repo_id}/resolve/{revision}/{path}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the goal is to overwrite the hfh one with the datasets
one, but idk if we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really avoid to as the "/{repo_id}/resolve/{revision}/{filename}"
part is not specific to datasets
.
If it is a matter of customizing the endpoint, both are defined as HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co")
so it doesn't add something compared to existing hfh
behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that datasets
users might have override the variable config.HUB_DATASETS_URL
.
If we don't pass this to hfh
, this is a breaking change and we should warn datasets
users that their setting config.HUB_DATASETS_URL
is ignored.
Indeed I struggled to have this working:
- in
datasets
we import the moduleconfig
each time we need to use a variable: then we can modify afterwards one of the variable in it and that will be updated wherever theconfig
module is used; we support dynamic update of the config - in
hfh
only the config constant is imported wherever it is used and this creates a "copy" of it: whenever we modify afterwards the variable value inhuggingface_hub.constants
has no effect onhuggingface_hub.file_download
imported constants: they were already "copied" at import time
In relation with the ENDPOINT, we set it also dynamically in our CI, using fixtures. However, it is useless to modify it dynamically in hfh
: once the repo URL is defined from the "old" ENDPOINT, changing a new value for the ENDPOINT will not have any effect in the repo URL: this is set only once at import time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, while tweaking this, I discovered our CI has some tests that request the production ENDPOINT: so the CI ENDPOINT cannot be set globally for all tests: https://github.com/huggingface/datasets/actions/runs/3386458817/jobs/5625907924
TokenizersDumpTest.test_hash_tokenizer
> raise HTTPError(http_error_msg, response=self)
E requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://hub-ci.huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine you need this, @albertvillanova : huggingface/huggingface_hub#1082
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @albertvillanova for the broader context. Since how long do you have this HUB_DATASETS_URL
env variable ? It might be worth to deprecate it if it's only used to overwrite HUGGINGFACE_CO_URL_TEMPLATE
(if it has a larger scope, maybe it's good to keep if).
I imagine you need this, @albertvillanova : huggingface/huggingface_hub#1082
Agree that this should be the clean solution to do it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in hfh only the config constant is imported wherever it is used and this creates a "copy" of it:
Totally get your point here. Might be worth opening an issue in hfh
to fix that once and for all :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally get your point here. Might be worth opening an issue in
hfh
to fix that once and for all :)
@Wauplin that fix would make testing much easier... 🤗
discussing if we want to overwrite hfh constants
@lhoestq I think we should first agree if If so, I then would suggest to initiate a deprecation cycle. |
After a discussion with the rest of the datasets team, we agreed we can introduce the breaking change of ignoring Additionally, we also ignore See explanation in this PR description: #5196 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, thanks :)
monkeypatch.setattr( | ||
"huggingface_hub.file_download.HUGGINGFACE_CO_URL_TEMPLATE", CI_HFH_HUGGINGFACE_CO_URL_TEMPLATE | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wauplin, just please note that in our CI currently we need to do the patching in huggingface_hub.file_download
because doing it in huggingface_hub.constants
has no effect: at import time, the constant HUGGINGFACE_CO_URL_TEMPLATE
was copied from the submodule constants
to file_download
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, thanks for the reminder. I created an issue on huggingface_hub
to keep track of this.
huggingface/huggingface_hub#1172
I'm trying to upgrade datasets to 2.7.0 in https://github.com/huggingface/datasets-server, and the tests fail due to this change. I think it's a breaking change (that was not listed in https://github.com/huggingface/datasets/releases/tag/2.7.0) since code that previously worked (by setting I'm not sure what is the correct way to set up the tests; besides setting the env var "HF_ENDPOINT" before launching the tests (which, I think, is not a good way to do: the tests should not depend on the environment). |
OK, I re-read this thread, and #5196 (comment) explicitely states that |
I think the current workaround of settings an env variable before launching the tests is "not so bad" when considering the fact that env variables are evaluated at import time in |
You can use fixtures in your tests: CI_HUB_ENDPOINT = "https://hub-ci.huggingface.co"
CI_HUB_DATASETS_URL = CI_HUB_ENDPOINT + "/datasets/{repo_id}/resolve/{revision}/{path}"
CI_HFH_HUGGINGFACE_CO_URL_TEMPLATE = CI_HUB_ENDPOINT + "/{repo_id}/resolve/{revision}/{filename}"
@pytest.fixture
def ci_hfh_hf_hub_url(monkeypatch):
monkeypatch.setattr(
"huggingface_hub.file_download.HUGGINGFACE_CO_URL_TEMPLATE", CI_HFH_HUGGINGFACE_CO_URL_TEMPLATE
)
@pytest.fixture
def ci_hub_config(monkeypatch):
monkeypatch.setattr("datasets.config.HF_ENDPOINT", CI_HUB_ENDPOINT)
monkeypatch.setattr("datasets.config.HUB_DATASETS_URL", CI_HUB_DATASETS_URL) and use And when |
OK. In fact, in datasets-server we set I understand that for now, the only way to fix this is to setup |
Thanks, used in huggingface/dataset-viewer#644. |
Small refactoring to use
hf_hub_url
function fromhuggingface_hub
.This PR also creates the
hub
module that will contain allhuggingface_hub
functionalities relevant todatasets
.This is a necessary stage before implementing the use of the
hfh
caching system (which uses itshf_hub_url
under the hood).EDIT:
Finally, we use ourconfig.HUB_DATASETS_URL
when usinghfh.hf_hub_url
There is a breaking change: the
hfh
hf_hub_url
function useshfh
HUGGINGFACE_CO_URL_TEMPLATE
URL template, different from thedatasets
config.HUB_DATASETS_URL
hfh
DEFAULT_REVISION
, instead ofdatasets
config.HUB_DEFAULT_VERSION