Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silicone #1712

Closed
wants to merge 42 commits into from
Closed

Silicone #1712

wants to merge 42 commits into from

Conversation

eusip
Copy link
Contributor

@eusip eusip commented Jan 8, 2021

My collaborators and I within the Affective Computing team at Telecom Paris would like to push our spoken dialogue dataset for publication.

@eusip
Copy link
Contributor Author

eusip commented Jan 11, 2021

When should we expect to see our dataset appear in the search dropdown at huggingface.co?

@julien-c
Copy link
Member

Hi @eusip,

When should we expect to see our dataset appear in the search dropdown at huggingface.co?

when this PR is merged.

@eusip
Copy link
Contributor Author

eusip commented Jan 11, 2021

Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you so much !

I left a few comments about the tags and the dataset cards.

Could you also add the dummy_data.zip files ? They're used to test the dataset script and make sure it keeps working in the long run.
You can find more info on how to add them here: https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md#automatically-add-code-metadata

Feel free to ping me if you have any question about the dataset card of the dummy data :)

source_datasets:
- original
task_categories:
- sequence-modeling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- sequence-modeling
- sequence-modeling
- text-classification
- text-scoring

Some missing categories

  • text-classification is for the task_ids sentiment-classification and topic-classification
  • text-scoring is for the task_ids semantic-similarity-scoring and sentiment-scoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I will make the necessary changes :-)

Comment on lines 19 to 24
- dialogue-modeling
- language-modeling
- sentiment-classification
- topic-classification
- semantic-similarity-scoring
- sentiment-scoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to specify the task_ids depending on the dataset ?
For example

task_ids:
  iemocap:
  - dialogue-modeling
  - text-classification-other-emotion-classification
  maptask:
  - dialogue-modeling
  - text-classification-other-dialogue-act-classification
etc.

Note that I'm using the prefix text-classification-other- because emotion-classification and dialogue-act-classification are not (yet ?) part of our tagging taxonomy


### Data Instances

#### DailyDialog Act Corpus (Dialogue Act)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### DailyDialog Act Corpus (Dialogue Act)
#### DailyDialog Act Corpus (Dialogue Act)
For the `dyda_da` configuration one example from the dataset is:

Can you add the configuration name this way ?
Same for the other datasets

```
{
'Utterance': the taxi drivers are on strike again .,
'Dialogue_Act': inform,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Dialogue_Act': inform,
'Dialogue_Act': 2, # inform

Here we expect to see the example as it is returned by the dataset library.
Since the Dialogue_Act field is a ClassLabel then it returns the label id as an integer. Here 2 corresponds to the "inform" label name. The correspondence is defined by the label classes list. Here it is ["commissive", "directive", "inform", "question"]

Same for the other datasets

#### DailyDialog Act Corpus (Dialogue Act)
```
{
'Utterance': the taxi drivers are on strike again .,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Utterance': the taxi drivers are on strike again .,
'Utterance': "the taxi drivers are on strike again .",

Can you place strings in quote as in python ?
Same for the other datasets

lhoestq and others added 23 commits January 21, 2021 01:20
* fix windows path scheme in cached_path

* add test
* make column order deterministic in transmit_format

* add test
* mirror_scientific_papers_for_faster_download

* upload

* make dummy data lighter

* delete more
…ocs (huggingface#1705)

* Add cache management doc

* Add a link to the cache management section

* Fix link to cache management section

* Fix table with values for download_mode param

* Rephrase, fix typos and add code examples

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
…face#1715)

* add Korean intonation-aided intention identification dataset

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Update datasetcard

* Update Curiosity DatasetCard

* Update Curiosity Dialogs DatasetCard Missing Entries
* Add MNIST dataset

* Update datasets/mnist/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Adding adversarialQA dataset

* Added YAML tags

* reduce dummy_data.zip files sizes

Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
* new version of Ted Talks IWSLT (WIT3)

* updated for configs

* new version of Ted Talks IWSLT (WIT3)

* updated for dummy_data

* updated for dummy_data

* Apply suggestions from code review

* Update datasets/ted_talks_iwslt/README.md

Co-authored-by: Ubuntu <ubuntu@ip-172-31-69-18.ec2.internal>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
…mplification. (huggingface#1732)

* Added TurkCorpus, an evaluation dataset for sentence simplification that focuses on lexical paraphrasing.

* Corrected the dataset name in the config file

* Rectified formatting issues in the dataset file

* Retrigger checks

* Added YAML tags, updated README with data instances and reduced size of dummy data
* update link to be github links

* format code
* fix empty token bugs for thainer

* fix empty token bug for lst20

Co-authored-by: charin <charin@central.tech>
* COMET MT Metric

* added COMET install command

* fixed trailing whitespace

* fixed line length

* fixed line length

* Fixed typos in the docstrings

* Fix pip install message

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* added example of usage in KWARGS_DESCRIPTION

Co-authored-by: Ricardo Rei <ricardorei@Ricardos-MBP-2.Home>
Co-authored-by: Ricardo Rei <ricardorei@ip-192-168-1-126.ec2.internal>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Update add new model template

* Update ADD_NEW_DATASET.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Conda

* All tags

* x.x.x
jbragg and others added 17 commits January 21, 2021 01:20
* Add missing "brief" entries to reuters

* Add missing spaces

* Update infos

* Fix handling BRIEF and UNPROC text types

* Make style pass

* Add text type

* style

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* check if tag regex is valid

* regex in quotes
* Add Hatexplain - the first benchmark hate speech dataset covering multiple aspects of the issue

* Add changes suggested in review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Citation information Added

* Update datasets/hatexplain/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* add Stuctured Argument Extraction for Korean dataset

* Update datasets/kor_sae/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@yjernite @lhoestq

I only fixed the languages: tag but others would need fixing too
* minor

* add prepare module test

* fix windows path scheme check

* cached_path raises requests error if no internet

* look for cached modules if there's no internet

* wip tests

* add warning message

* update tests

* style

* remove test modules if already exist

* style

* add init_dynamic_modules function for testing purposes

* fix importlib cache

* move csv, json, text and pandas to inside the package

* add packaged datasets handling in prepare_module

* update tests

* minor fix

* add missing __init__.py

* fix test

* style

* fix test

* fix tests

* show last modification date in the warning
* disable caching and fingerprinting + allow unpickable transforms

* add tests

* fix tests on windows

* ignore some kwargs for fingerprint + better decorator name

* more ignore kwargs

* style

* add set_caching_enabled and set_fingerprinting_enabled to main __init__

* typo

* remove enable/disable fingerprinting

* use temp dir when caching is disabled

* update tests

* show warning only once

* fix global caching boolean

* docs

* move code block after the implications of disabling caching

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* docs

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* first commit for id_liputan6

* fixes with black, isort and flake8

* removed unnecessary dl_manager.download_and_extract() since the dataset
should be downloaded manually

* updated info to download the dataset manually

* added string detokenizers

* added another regex for the string detokeniyer/cleaner, and fixed an existing one

* Update datasets/id_liputan6/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/id_liputan6/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/id_liputan6/id_liputan6.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* added info about the canonical and xtreme variants, and an example of dataset instance in README.md.
added a directory test.

* cosmetics change for the citation information

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* fix comet citations

* fixed citation1 wrong variable name

Co-authored-by: Ricardo Rei <ricardorei@Ricardos-MacBook-Pro-2.local>
* Updated README for the Social Bias Frames dataset

* Update datasets/social_bias_frames/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/social_bias_frames/README.md

Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@eusip eusip closed this Jan 21, 2021
@eusip eusip deleted the silicone branch January 21, 2021 00:26
@eusip eusip restored the silicone branch January 21, 2021 00:29
@eusip eusip deleted the silicone branch January 21, 2021 00:30
@eusip eusip restored the silicone branch January 21, 2021 00:32
@eusip
Copy link
Contributor Author

eusip commented Jan 21, 2021

I've implemented all the changes requested by @lhoestq but I made the mistake of trying to change the remote branch name.

Hopefully the changes are seen on your end as both branches silicone and main should be up-to-date.

@eusip eusip reopened this Jan 21, 2021
@lhoestq
Copy link
Member

lhoestq commented Jan 21, 2021

It looks like the PR includes changes about many other files than the ones for Silicone (+30,000 line changes)

Maybe you can try to create another branch and another PR ?

@eusip
Copy link
Contributor Author

eusip commented Jan 21, 2021

It looks like the PR includes changes about many other files than the ones for Silicone (+30,000 line changes)

Maybe you can try to create another branch and another PR ?

Sure. I will make a new pull request.

@eusip eusip closed this Jan 21, 2021
@eusip eusip deleted the silicone branch January 21, 2021 14:12
@eusip eusip mentioned this pull request Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.