Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added metadata and correct splits for swda. #1749

Merged
merged 4 commits into from
Jan 29, 2021
Merged

Added metadata and correct splits for swda. #1749

merged 4 commits into from
Jan 29, 2021

Conversation

gmihaila
Copy link
Contributor

Switchboard Dialog Act Corpus

I made some changes following @bhavitvyamalik recommendation in #1678:

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

I left a few comments

Also it looks like the dummy_data.zip file is quite big (1MB), could you try to reduce its size please ?
To do so feel free to take a look inside the zip file and you should see many unused csv files in the swda.zip directory. You can remove all of them except the ones defined by the test/dev/train txt files. For example if there is 3994 in the train_split.txt then you should keep the sw_1319_3994.utt.csv file.

datasets/swda/swda.py Outdated Show resolved Hide resolved
datasets/swda/swda.py Outdated Show resolved Hide resolved
datasets/swda/README.md Outdated Show resolved Hide resolved
datasets/swda/README.md Show resolved Hide resolved
datasets/swda/swda.py Outdated Show resolved Hide resolved
datasets/swda/README.md Outdated Show resolved Hide resolved
datasets/swda/swda.py Outdated Show resolved Hide resolved
@gmihaila
Copy link
Contributor Author

I will push updates tomorrow.

@gmihaila
Copy link
Contributor Author

@lhoestq thank you for your comments! I went ahead and fixed the code 😃. Please let me know if I missed anything.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes !

Looks all good now :)

* 'pos': (str) The POS tagged version of the utterance, from PtbBasename+.pos
* 'topic_description': (str) The topic that is being discussed.
* 'trees': (str) The tree(s) containing this utterance (separated by ||| in the file). Use `[Tree.fromstring(t)
for t in row_value.split("|||")]` to convert to (list of nltk.tree.Tree).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

datasets/swda/swda.py Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit 18d3357 into huggingface:master Jan 29, 2021
@gmihaila gmihaila deleted the fix_swda branch January 29, 2021 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants