Added metadata and correct splits for swda. #1749

gmihaila · 2021-01-18T18:36:32Z

Switchboard Dialog Act Corpus

I made some changes following @bhavitvyamalik recommendation in #1678:

Contains all metadata.
Used official implementation from the /swda repo.
Add official train and test splits used in Stolcke et al. (2000) and validation split used in Probabilistic-RNN-DA-Classifier.

lhoestq

Nice thank you !

I left a few comments

Also it looks like the dummy_data.zip file is quite big (1MB), could you try to reduce its size please ?
To do so feel free to take a look inside the zip file and you should see many unused csv files in the swda.zip directory. You can remove all of them except the ones defined by the test/dev/train txt files. For example if there is 3994 in the train_split.txt then you should keep the sw_1319_3994.utt.csv file.

datasets/swda/swda.py

datasets/swda/README.md

datasets/swda/swda.py

datasets/swda/README.md

datasets/swda/swda.py

gmihaila · 2021-01-28T04:58:03Z

I will push updates tomorrow.

gmihaila · 2021-01-29T17:20:29Z

@lhoestq thank you for your comments! I went ahead and fixed the code 😃. Please let me know if I missed anything.

lhoestq

Thanks for the changes !

Looks all good now :)

lhoestq · 2021-01-29T18:03:09Z

datasets/swda/README.md

+* 'pos':                 (str) The POS tagged version of the utterance, from PtbBasename+.pos
+* 'topic_description':   (str) The topic that is being discussed.
+* 'trees':               (str) The tree(s) containing this utterance (separated by ||| in the file). Use `[Tree.fromstring(t)
+                                 for t in row_value.split("|||")]` to convert to (list of nltk.tree.Tree).


datasets/swda/swda.py

lhoestq reviewed Jan 19, 2021

View reviewed changes

gmihaila added 3 commits January 29, 2021 07:24

Added metadata and correct splits for swda.

0b480f5

Fixes so it doesn't use code to do dummy --auto_generate.

54adbb5

Added fixes and topic_description of utterance.

8a55f03

lhoestq approved these changes Jan 29, 2021

View reviewed changes

Update datasets/swda/swda.py

8e5dc09

lhoestq merged commit 18d3357 into huggingface:master Jan 29, 2021

gmihaila deleted the fix_swda branch January 29, 2021 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added metadata and correct splits for swda. #1749

Added metadata and correct splits for swda. #1749

gmihaila commented Jan 18, 2021

lhoestq left a comment

gmihaila commented Jan 28, 2021

gmihaila commented Jan 29, 2021

lhoestq left a comment

lhoestq Jan 29, 2021

Added metadata and correct splits for swda. #1749

Added metadata and correct splits for swda. #1749

Conversation

gmihaila commented Jan 18, 2021

lhoestq left a comment

Choose a reason for hiding this comment

gmihaila commented Jan 28, 2021

gmihaila commented Jan 29, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Jan 29, 2021

Choose a reason for hiding this comment