Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switchboard Dialog Act Corpus added under datasets/swda #1678

Merged
merged 5 commits into from
Jan 5, 2021
Merged

Switchboard Dialog Act Corpus added under datasets/swda #1678

merged 5 commits into from
Jan 5, 2021

Conversation

gmihaila
Copy link
Contributor

@gmihaila gmihaila commented Jan 3, 2021

Switchboard Dialog Act Corpus

Intro:
The Switchboard Dialog Act Corpus (SwDA) extends the Switchboard-1 Telephone Speech Corpus, Release 2,
with turn/utterance-level dialog-act tags. The tags summarize syntactic, semantic, and pragmatic information
about the associated turn. The SwDA project was undertaken at UC Boulder in the late 1990s.

Details:
homepage
repo

I believe this is an important dataset to have since there is no dataset related to dialogue act added.

I didn't find any formatting for pull request. I hope all this information is enough.

For any support please contact me.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool thank you !

I left a few comments

After changing the feature type to ClassLabel you'll need to regenerate the dataset_infos.json file

datasets-cli test ./datasets/swda --save_infos --all_configs --ignore_verifications

datasets/swda/README.md Outdated Show resolved Hide resolved
datasets/swda/README.md Outdated Show resolved Hide resolved
datasets/swda/README.md Outdated Show resolved Hide resolved
datasets/swda/README.md Outdated Show resolved Hide resolved
datasets/swda/README.md Outdated Show resolved Hide resolved
datasets/swda/swda.py Outdated Show resolved Hide resolved
gmihaila and others added 3 commits January 4, 2021 10:56
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@gmihaila
Copy link
Contributor Author

gmihaila commented Jan 4, 2021

@lhoestq Thank you for your detailed comments! I fixed everything you suggested.

Please let me know if I'm missing anything else.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

datasets/swda/README.md Show resolved Hide resolved
datasets/swda/README.md Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit 5ae870c into huggingface:master Jan 5, 2021
@lhoestq
Copy link
Member

lhoestq commented Jan 5, 2021

It looks like the Transcript and Utterance objects are missing, maybe we can mention it in the README ? Or just add them ? @gmihaila @bhavitvyamalik

@gmihaila gmihaila deleted the swda branch January 5, 2021 15:46
@bhavitvyamalik
Copy link
Contributor

Hi @lhoestq,
I'm working on this to add the full dataset

@gmihaila
Copy link
Contributor Author

gmihaila commented Jan 5, 2021

It looks like the Transcript and Utterance objects are missing, maybe we can mention it in the README ? Or just add them ? @gmihaila @bhavitvyamalik

@lhoestq Any info on how to add them?

@bhavitvyamalik
Copy link
Contributor

@gmihaila, instead of using the current repo you should look into this. You can use the csv files uploaded in this repo (swda.zip) to access other fields and include them in this dataset. It has one dependency too, swda.py, you can download that separately and include it in your dataset's folder to be imported while reading the csv files.

Almost all the attributes of Transcript and Utterance objects are of the type str, int, or list. As far as trees attribute is concerned in utterance object you can simply parse it as string and user can maybe later convert it to nltk.tree object

@gmihaila
Copy link
Contributor Author

gmihaila commented Jan 6, 2021

@bhavitvyamalik Thank you for the clarification!

I didn't use that because it doesn't have the splits. I think in combination with what I used would help.

Let me know if I can help! I can make those changes if you don't have the time.

@bhavitvyamalik
Copy link
Contributor

bhavitvyamalik commented Jan 7, 2021

I'm a bit busy for the next 2 weeks. I'll be able to complete it by end of January only. Maybe you can start with it and I'll help you?
Also, I looked into the official train/val/test splits and not all the files are there in the repo I used so I think either we'll have to skip them or put all of that into just train

@gmihaila
Copy link
Contributor Author

gmihaila commented Jan 8, 2021

Yes, I can start working on it and ask you to do a code review.

Yes, not all files are there. I'll try to find papers that have the correct and full splits, if not, I'll do like you suggested.

Thank you again for your help @bhavitvyamalik !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants