-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SubjQA dataset #2302
Add SubjQA dataset #2302
Conversation
I'm not sure why the windows test fails, but looking at the logs it looks like some caching issue on one of the metrics ... maybe re-run and 🤞 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
Could you also try to reduce the size of the dummy data zip files ? Currently they're quite big (50KB+ each). To do so feel free to take a look inside the books dummy data zip file and remove all the csv files from the other domains, and only keep the files from the books domain. Thanks :)
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Hi @lewtun, thanks for adding this dataset! If the dataset is going to be referenced heavily, I think it's worth spending some time to make the dataset card really great :) To start, the information that is currently in the Here's a link to the relevant section of the guide, let me know if you have any questions! |
great idea @yjernite! i've added some extra information / moved things as you suggest and will wrap up the rest tomorrow :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
I added my final comments:
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you !
Merging since the CI error is fixed on master
Hello datasetters 🙂!
Here's an interesting dataset about extractive question-answering on subjective product / restaurant reviews. It's quite challenging for models fine-tuned on SQuAD and provides a nice example of domain adaptation (i.e. fine-tuning a SQuAD model on this domain gives better performance).
I found a bug in the start/end indices that I've proposed a fix for here: megagonlabs/SubjQA#2
Unfortunately, the dataset creators are unresponsive, so for now I am using my fork as the source. Will update the URL if/when the creators respond.