Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SubjQA dataset #2302

Merged
merged 22 commits into from
May 10, 2021
Merged

Add SubjQA dataset #2302

merged 22 commits into from
May 10, 2021

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented May 2, 2021

Hello datasetters 🙂!

Here's an interesting dataset about extractive question-answering on subjective product / restaurant reviews. It's quite challenging for models fine-tuned on SQuAD and provides a nice example of domain adaptation (i.e. fine-tuning a SQuAD model on this domain gives better performance).

I found a bug in the start/end indices that I've proposed a fix for here: megagonlabs/SubjQA#2

Unfortunately, the dataset creators are unresponsive, so for now I am using my fork as the source. Will update the URL if/when the creators respond.

@lewtun
Copy link
Member Author

lewtun commented May 2, 2021

I'm not sure why the windows test fails, but looking at the logs it looks like some caching issue on one of the metrics ... maybe re-run and 🤞 ?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !

Could you also try to reduce the size of the dummy data zip files ? Currently they're quite big (50KB+ each). To do so feel free to take a look inside the books dummy data zip file and remove all the csv files from the other domains, and only keep the files from the books domain. Thanks :)

datasets/subjqa/README.md Outdated Show resolved Hide resolved
datasets/subjqa/README.md Outdated Show resolved Hide resolved
datasets/subjqa/README.md Outdated Show resolved Hide resolved
datasets/subjqa/subjqa.py Outdated Show resolved Hide resolved
@yjernite
Copy link
Member

yjernite commented May 5, 2021

Hi @lewtun, thanks for adding this dataset!

If the dataset is going to be referenced heavily, I think it's worth spending some time to make the dataset card really great :) To start, the information that is currently in the Data collection paragraph should probably be organized in the Dataset Creation section.

Here's a link to the relevant section of the guide, let me know if you have any questions!

@lewtun
Copy link
Member Author

lewtun commented May 5, 2021

If the dataset is going to be referenced heavily, I think it's worth spending some time to make the dataset card really great :) To start, the information that is currently in the Data collection paragraph should probably be organized in the Dataset Creation section.

great idea @yjernite! i've added some extra information / moved things as you suggest and will wrap up the rest tomorrow :)

@lewtun
Copy link
Member Author

lewtun commented May 6, 2021

hi @yjernite and @lhoestq, i've fleshed out the dataset card and think this is now ready for another round of review!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !
I added my final comments:

datasets/subjqa/README.md Outdated Show resolved Hide resolved
datasets/subjqa/README.md Show resolved Hide resolved
datasets/subjqa/subjqa.py Outdated Show resolved Hide resolved
lewtun and others added 5 commits May 7, 2021 15:53
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you !

Merging since the CI error is fixed on master

@lhoestq lhoestq merged commit e9d2d39 into huggingface:master May 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants