Add SubjQA dataset #2302

lewtun · 2021-05-02T14:51:20Z

Hello datasetters 🙂!

Here's an interesting dataset about extractive question-answering on subjective product / restaurant reviews. It's quite challenging for models fine-tuned on SQuAD and provides a nice example of domain adaptation (i.e. fine-tuning a SQuAD model on this domain gives better performance).

I found a bug in the start/end indices that I've proposed a fix for here: megagonlabs/SubjQA#2

Unfortunately, the dataset creators are unresponsive, so for now I am using my fork as the source. Will update the URL if/when the creators respond.

lewtun · 2021-05-02T16:27:39Z

I'm not sure why the windows test fails, but looking at the logs it looks like some caching issue on one of the metrics ... maybe re-run and 🤞 ?

lhoestq

Awesome thank you !

Could you also try to reduce the size of the dummy data zip files ? Currently they're quite big (50KB+ each). To do so feel free to take a look inside the books dummy data zip file and remove all the csv files from the other domains, and only keep the files from the books domain. Thanks :)

datasets/subjqa/README.md

datasets/subjqa/subjqa.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

yjernite · 2021-05-05T16:56:04Z

Hi @lewtun, thanks for adding this dataset!

If the dataset is going to be referenced heavily, I think it's worth spending some time to make the dataset card really great :) To start, the information that is currently in the Data collection paragraph should probably be organized in the Dataset Creation section.

Here's a link to the relevant section of the guide, let me know if you have any questions!

lewtun · 2021-05-05T19:55:18Z

If the dataset is going to be referenced heavily, I think it's worth spending some time to make the dataset card really great :) To start, the information that is currently in the Data collection paragraph should probably be organized in the Dataset Creation section.

great idea @yjernite! i've added some extra information / moved things as you suggest and will wrap up the rest tomorrow :)

lewtun · 2021-05-06T07:42:17Z

hi @yjernite and @lhoestq, i've fleshed out the dataset card and think this is now ready for another round of review!

lhoestq

Thanks !
I added my final comments:

datasets/subjqa/README.md

datasets/subjqa/subjqa.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lhoestq

LGTM thank you !

Merging since the CI error is fixed on master

lewtun added 5 commits May 2, 2021 16:27

Add SubjQA dataset

7d011b2

Remove unused code

cf6a90d

Add README

f777a2a

Add dataset infos and dummy data

aed78b2

Fix style

74f0cb3

lhoestq reviewed May 3, 2021

View reviewed changes

datasets/subjqa/README.md Outdated Show resolved Hide resolved

datasets/subjqa/README.md Outdated Show resolved Hide resolved

datasets/subjqa/README.md Outdated Show resolved Hide resolved

datasets/subjqa/subjqa.py Outdated Show resolved Hide resolved

lewtun and others added 9 commits May 3, 2021 17:59

Update datasets/subjqa/README.md

c47c3f4

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update datasets/subjqa/README.md

98e4b11

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Refactor SubjQA to produce SQuAD schema

f7bdc6b

Simplify extraction of answer metadata

20a02f6

Remove redundant feature

e0ae717

Update dataset infos

87d4d3b

Trim down size of dummy data

39b1baa

Update README

6a65727

Fix field description in README

51efbbc

Add dataset statistics and collection info to the README

01e4d7e

lewtun added 2 commits May 6, 2021 09:25

Add info on dataset creation

3f9664f

Add annotation process and social impact to README

8391c91

lhoestq reviewed May 7, 2021

View reviewed changes

datasets/subjqa/README.md Outdated Show resolved Hide resolved

datasets/subjqa/README.md Show resolved Hide resolved

datasets/subjqa/subjqa.py Outdated Show resolved Hide resolved

lewtun and others added 5 commits May 7, 2021 15:53

Update datasets/subjqa/README.md

4a1a4f3

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update datasets/subjqa/README.md

b1419cc

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Replace eval with ast.literal_eval for safety!

da0a15d

Merge remote-tracking branch 'origin/add-subjqa' into add-subjqa

6acc5fc

fix missing extended| prefix in source datasets tags

9265193

lhoestq approved these changes May 10, 2021

View reviewed changes

lhoestq merged commit e9d2d39 into huggingface:master May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SubjQA dataset #2302

Add SubjQA dataset #2302

lewtun commented May 2, 2021

lewtun commented May 2, 2021

lhoestq left a comment

yjernite commented May 5, 2021

lewtun commented May 5, 2021

lewtun commented May 6, 2021

lhoestq left a comment

lhoestq left a comment

Add SubjQA dataset #2302

Add SubjQA dataset #2302

Conversation

lewtun commented May 2, 2021

lewtun commented May 2, 2021

lhoestq left a comment

Choose a reason for hiding this comment

yjernite commented May 5, 2021

lewtun commented May 5, 2021

lewtun commented May 6, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment