-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silicone #1712
Silicone #1712
Conversation
When should we expect to see our dataset appear in the search dropdown at huggingface.co? |
Hi @eusip,
when this PR is merged. |
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you so much !
I left a few comments about the tags and the dataset cards.
Could you also add the dummy_data.zip files ? They're used to test the dataset script and make sure it keeps working in the long run.
You can find more info on how to add them here: https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md#automatically-add-code-metadata
Feel free to ping me if you have any question about the dataset card of the dummy data :)
source_datasets: | ||
- original | ||
task_categories: | ||
- sequence-modeling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- sequence-modeling | |
- sequence-modeling | |
- text-classification | |
- text-scoring |
Some missing categories
text-classification
is for the task_idssentiment-classification
andtopic-classification
text-scoring
is for the task_idssemantic-similarity-scoring
andsentiment-scoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I will make the necessary changes :-)
silicone/README.md
Outdated
- dialogue-modeling | ||
- language-modeling | ||
- sentiment-classification | ||
- topic-classification | ||
- semantic-similarity-scoring | ||
- sentiment-scoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to specify the task_ids depending on the dataset ?
For example
task_ids:
iemocap:
- dialogue-modeling
- text-classification-other-emotion-classification
maptask:
- dialogue-modeling
- text-classification-other-dialogue-act-classification
etc.
Note that I'm using the prefix text-classification-other-
because emotion-classification
and dialogue-act-classification
are not (yet ?) part of our tagging taxonomy
|
||
### Data Instances | ||
|
||
#### DailyDialog Act Corpus (Dialogue Act) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### DailyDialog Act Corpus (Dialogue Act) | |
#### DailyDialog Act Corpus (Dialogue Act) | |
For the `dyda_da` configuration one example from the dataset is: | |
Can you add the configuration name this way ?
Same for the other datasets
silicone/README.md
Outdated
``` | ||
{ | ||
'Utterance': the taxi drivers are on strike again ., | ||
'Dialogue_Act': inform, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Dialogue_Act': inform, | |
'Dialogue_Act': 2, # inform |
Here we expect to see the example as it is returned by the dataset library.
Since the Dialogue_Act
field is a ClassLabel then it returns the label id as an integer. Here 2
corresponds to the "inform"
label name. The correspondence is defined by the label classes list. Here it is ["commissive", "directive", "inform", "question"]
Same for the other datasets
silicone/README.md
Outdated
#### DailyDialog Act Corpus (Dialogue Act) | ||
``` | ||
{ | ||
'Utterance': the taxi drivers are on strike again ., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Utterance': the taxi drivers are on strike again ., | |
'Utterance': "the taxi drivers are on strike again .", |
Can you place strings in quote as in python ?
Same for the other datasets
* fix windows path scheme in cached_path * add test
* make column order deterministic in transmit_format * add test
* mirror_scientific_papers_for_faster_download * upload * make dummy data lighter * delete more
…ocs (huggingface#1705) * Add cache management doc * Add a link to the cache management section * Fix link to cache management section * Fix table with values for download_mode param * Rephrase, fix typos and add code examples * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
…face#1715) * add Korean intonation-aided intention identification dataset * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
…simplification task. (huggingface#1722)
* Update datasetcard * Update Curiosity DatasetCard * Update Curiosity Dialogs DatasetCard Missing Entries
* Add MNIST dataset * Update datasets/mnist/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Adding adversarialQA dataset * Added YAML tags * reduce dummy_data.zip files sizes Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
* new version of Ted Talks IWSLT (WIT3) * updated for configs * new version of Ted Talks IWSLT (WIT3) * updated for dummy_data * updated for dummy_data * Apply suggestions from code review * Update datasets/ted_talks_iwslt/README.md Co-authored-by: Ubuntu <ubuntu@ip-172-31-69-18.ec2.internal> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
…mplification. (huggingface#1732) * Added TurkCorpus, an evaluation dataset for sentence simplification that focuses on lexical paraphrasing. * Corrected the dataset name in the config file * Rectified formatting issues in the dataset file * Retrigger checks * Added YAML tags, updated README with data instances and reduced size of dummy data
* update link to be github links * format code
* fix empty token bugs for thainer * fix empty token bug for lst20 Co-authored-by: charin <charin@central.tech>
* COMET MT Metric * added COMET install command * fixed trailing whitespace * fixed line length * fixed line length * Fixed typos in the docstrings * Fix pip install message Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * added example of usage in KWARGS_DESCRIPTION Co-authored-by: Ricardo Rei <ricardorei@Ricardos-MBP-2.Home> Co-authored-by: Ricardo Rei <ricardorei@ip-192-168-1-126.ec2.internal> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Update add new model template * Update ADD_NEW_DATASET.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Conda * All tags * x.x.x
* Add missing "brief" entries to reuters * Add missing spaces * Update infos * Fix handling BRIEF and UNPROC text types * Make style pass * Add text type * style Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* check if tag regex is valid * regex in quotes
* Add Hatexplain - the first benchmark hate speech dataset covering multiple aspects of the issue * Add changes suggested in review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Citation information Added * Update datasets/hatexplain/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* add Stuctured Argument Extraction for Korean dataset * Update datasets/kor_sae/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* minor * add prepare module test * fix windows path scheme check * cached_path raises requests error if no internet * look for cached modules if there's no internet * wip tests * add warning message * update tests * style * remove test modules if already exist * style * add init_dynamic_modules function for testing purposes * fix importlib cache * move csv, json, text and pandas to inside the package * add packaged datasets handling in prepare_module * update tests * minor fix * add missing __init__.py * fix test * style * fix test * fix tests * show last modification date in the warning
* disable caching and fingerprinting + allow unpickable transforms * add tests * fix tests on windows * ignore some kwargs for fingerprint + better decorator name * more ignore kwargs * style * add set_caching_enabled and set_fingerprinting_enabled to main __init__ * typo * remove enable/disable fingerprinting * use temp dir when caching is disabled * update tests * show warning only once * fix global caching boolean * docs * move code block after the implications of disabling caching * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * docs Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* first commit for id_liputan6 * fixes with black, isort and flake8 * removed unnecessary dl_manager.download_and_extract() since the dataset should be downloaded manually * updated info to download the dataset manually * added string detokenizers * added another regex for the string detokeniyer/cleaner, and fixed an existing one * Update datasets/id_liputan6/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/id_liputan6/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/id_liputan6/id_liputan6.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * added info about the canonical and xtreme variants, and an example of dataset instance in README.md. added a directory test. * cosmetics change for the citation information Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* fix comet citations * fixed citation1 wrong variable name Co-authored-by: Ricardo Rei <ricardorei@Ricardos-MacBook-Pro-2.local>
* Updated README for the Social Bias Frames dataset * Update datasets/social_bias_frames/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/social_bias_frames/README.md Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
I've implemented all the changes requested by @lhoestq but I made the mistake of trying to change the remote branch name. Hopefully the changes are seen on your end as both branches |
It looks like the PR includes changes about many other files than the ones for Silicone (+30,000 line changes) Maybe you can try to create another branch and another PR ? |
Sure. I will make a new pull request. |
My collaborators and I within the Affective Computing team at Telecom Paris would like to push our spoken dialogue dataset for publication.