Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RONEC v2 #3184

Merged
merged 4 commits into from
Nov 2, 2021
Merged

RONEC v2 #3184

merged 4 commits into from
Nov 2, 2021

Conversation

dumitrescustefan
Copy link
Contributor

Hi, as we've recently finished with the new RONEC (Romanian Named Entity Corpus), we'd like to update the dataset here as well. It's actually essential as links to V1 are no longer valid.

In reality we'd like to replace completely v1, as v2 is a full re-annotation of v1 with additional data (up to 2x size vs v1).

I've run the make style and all the dummy and real data test, and they passed.

I hope it's okay to merge the new RONEC v2 in the datasets.

Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks ! I just added a comment about keeping a way to load V1, let me know what you think

Comment on lines 61 to 63
BUILDER_CONFIGS = [
RONECConfig(name="ronec", version=VERSION, description="RONEC dataset"),
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to still let users access the V1 for reproducibility of all the work before RONEC v2

Do you think you could add a configuration that allows users to load RONEC V1 ? You can use the new links to the V1 data files

@dumitrescustefan
Copy link
Contributor Author

@lhoestq Thanks for the review. I totally understand what you are saying. Normally, I would definitely agree with you, but in this particular case, the quality of v1 is poor, and the dataset itself is small (at the time we created v1 it was the only RO NER dataset, and its size was limited by the available resources).

This is why we worked to build a larger one, with much better inter-annotator agreement. Fact is, models trained on v1 will be of very low quality and I would not recommend to anybody to use/do that. That's why I'd strongly suggest we replace v1 with v2, and kindof make v1 vanish :)

What do you think? If you insist on having v1 accessible, I'll add the required code. Thanks!

@lhoestq
Copy link
Member

lhoestq commented Nov 2, 2021

Ok I see ! I think it's fine then, no need to re-add V1

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for updating the dataset !
I think we can merge now :)

@lhoestq lhoestq merged commit 425f45b into huggingface:master Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants