Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for valid YAML files #308

Merged
merged 6 commits into from
Oct 10, 2022
Merged

Test for valid YAML files #308

merged 6 commits into from
Oct 10, 2022

Conversation

mathemakitten
Copy link
Contributor

@mathemakitten mathemakitten commented Oct 4, 2022

Closes #296, which hopefully results in fewer broken Spaces. Nothing fancy about this implementation and it's pretty specific to Hub metric card formats but works just fine for what we need, let me know if you think it should do anything else!

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 4, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mathemakitten!

My main suggestion: What do you think about moving the validation code to something like src/evaluate/utils/file_utils.py. This could be just a function that takes a string as input and raises an error if it's not valid yaml. Then we could test if that function works with a few examples as well as testing all the readmes. What do you think?

readme_filepaths.extend(glob.glob(glob_path))
for readme_file in readme_filepaths:
with open(readme_file) as f_yaml:
x = yaml.safe_load_all(f_yaml)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not necessary to split off the YAML part of the README?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is loaded in delimited by ---, so calling next() on the generator automatically gets the YAML (the first section)

tests/test_hub.py Outdated Show resolved Hide resolved
mathemakitten and others added 2 commits October 5, 2022 07:44
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
@mathemakitten
Copy link
Contributor Author

@lvwerra The bit which actually does the validation is next(x), which gets the first element from the loaded YAML (fine because for our metric cards the YAML is always at the top of the README, so it's the first element). If it isn't valid YAML then calling next() on the generator actually fails.

If we wanted to parse out the first bit of the file and pass it around to validate it'd look like this:

import re
REGEX_YAML_BLOCK = re.compile(r"---[\n\r]+([\S\s]*?)[\n\r]+---[\n\r]")
with open('metrics/perplexity/README.md', 'r') as f:
    x = f.read()
    match = REGEX_YAML_BLOCK.search(x)
y = yaml.safe_load_all(match.group())
assert type(y) == dict

However, this isn't any cleaner and also doesn't really generalize, since it depends on loading from the file in a specific way. We could have a general function which wraps yaml.safe_load_all for some piece of text, but I don't actually think that helps with code reuse much. Thoughts?

@lvwerra
Copy link
Member

lvwerra commented Oct 7, 2022

Fair enough, let's leave it like that then!

To solve the issue on windows, maybe you need to load with utf-8?

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @mathemakitten 🚀

@mathemakitten mathemakitten merged commit 4835cbe into main Oct 10, 2022
@mathemakitten mathemakitten deleted the hn-yaml-check branch October 10, 2022 08:31
NimaBoscarino pushed a commit that referenced this pull request Oct 17, 2022
* yaml check fast

* Update tests/test_hub.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fix merge

* Adding encoding utf8

* code quality

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Yaml verification for module READMEs
3 participants