Silicone #1712

eusip · 2021-01-08T18:24:18Z

My collaborators and I within the Affective Computing team at Telecom Paris would like to push our spoken dialogue dataset for publication.

eusip · 2021-01-11T10:02:04Z

When should we expect to see our dataset appear in the search dropdown at huggingface.co?

julien-c · 2021-01-11T10:06:08Z

Hi @eusip,

When should we expect to see our dataset appear in the search dropdown at huggingface.co?

when this PR is merged.

eusip · 2021-01-11T10:09:41Z

Thanks!

lhoestq

Awesome thank you so much !

I left a few comments about the tags and the dataset cards.

Could you also add the dummy_data.zip files ? They're used to test the dataset script and make sure it keeps working in the long run.
You can find more info on how to add them here: https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md#automatically-add-code-metadata

Feel free to ping me if you have any question about the dataset card of the dummy data :)

lhoestq · 2021-01-12T13:48:36Z

silicone/README.md

+source_datasets:
+- original
+task_categories:
+- sequence-modeling


Suggested change

- sequence-modeling

- sequence-modeling

- text-classification

- text-scoring

Some missing categories

text-classification is for the task_ids sentiment-classification and topic-classification

text-scoring is for the task_ids semantic-similarity-scoring and sentiment-scoring

Thanks for the feedback! I will make the necessary changes :-)

lhoestq · 2021-01-12T13:57:26Z

silicone/README.md

+- dialogue-modeling
+- language-modeling
+- sentiment-classification
+- topic-classification
+- semantic-similarity-scoring
+- sentiment-scoring


Would it be possible to specify the task_ids depending on the dataset ?
For example

task_ids: iemocap: - dialogue-modeling - text-classification-other-emotion-classification maptask: - dialogue-modeling - text-classification-other-dialogue-act-classification etc.

Note that I'm using the prefix text-classification-other- because emotion-classification and dialogue-act-classification are not (yet ?) part of our tagging taxonomy

lhoestq · 2021-01-12T13:59:28Z

silicone/README.md

+
+### Data Instances
+
+#### DailyDialog Act Corpus (Dialogue Act)


Suggested change

#### DailyDialog Act Corpus (Dialogue Act)

#### DailyDialog Act Corpus (Dialogue Act)

For the `dyda_da` configuration one example from the dataset is:

Can you add the configuration name this way ?
Same for the other datasets

lhoestq · 2021-01-12T14:01:06Z

silicone/README.md

+```
+{
+  'Utterance': the taxi drivers are on strike again .,
+  'Dialogue_Act': inform,


Suggested change

'Dialogue_Act': inform,

'Dialogue_Act': 2, # inform

Here we expect to see the example as it is returned by the dataset library.
Since the Dialogue_Act field is a ClassLabel then it returns the label id as an integer. Here 2 corresponds to the "inform" label name. The correspondence is defined by the label classes list. Here it is ["commissive", "directive", "inform", "question"]

Same for the other datasets

lhoestq · 2021-01-12T14:02:38Z

silicone/README.md

+#### DailyDialog Act Corpus (Dialogue Act)
+```
+{
+  'Utterance': the taxi drivers are on strike again .,


Suggested change

'Utterance': the taxi drivers are on strike again .,

'Utterance': "the taxi drivers are on strike again .",

Can you place strings in quote as in python ?
Same for the other datasets

* fix windows path scheme in cached_path * add test

* make column order deterministic in transmit_format * add test

* mirror_scientific_papers_for_faster_download * upload * make dummy data lighter * delete more

…ocs (huggingface#1705) * Add cache management doc * Add a link to the cache management section * Fix link to cache management section * Fix table with values for download_mode param * Rephrase, fix typos and add code examples * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…face#1715) * add Korean intonation-aided intention identification dataset * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…simplification task. (huggingface#1722)

* Update datasetcard * Update Curiosity DatasetCard * Update Curiosity Dialogs DatasetCard Missing Entries

* Add MNIST dataset * Update datasets/mnist/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Adding adversarialQA dataset * Added YAML tags * reduce dummy_data.zip files sizes Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

* new version of Ted Talks IWSLT (WIT3) * updated for configs * new version of Ted Talks IWSLT (WIT3) * updated for dummy_data * updated for dummy_data * Apply suggestions from code review * Update datasets/ted_talks_iwslt/README.md Co-authored-by: Ubuntu <ubuntu@ip-172-31-69-18.ec2.internal> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…mplification. (huggingface#1732) * Added TurkCorpus, an evaluation dataset for sentence simplification that focuses on lexical paraphrasing. * Corrected the dataset name in the config file * Rectified formatting issues in the dataset file * Retrigger checks * Added YAML tags, updated README with data instances and reduced size of dummy data

* update link to be github links * format code

* fix empty token bugs for thainer * fix empty token bug for lst20 Co-authored-by: charin <charin@central.tech>

* COMET MT Metric * added COMET install command * fixed trailing whitespace * fixed line length * fixed line length * Fixed typos in the docstrings * Fix pip install message Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * added example of usage in KWARGS_DESCRIPTION Co-authored-by: Ricardo Rei <ricardorei@Ricardos-MBP-2.Home> Co-authored-by: Ricardo Rei <ricardorei@ip-192-168-1-126.ec2.internal> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update add new model template * Update ADD_NEW_DATASET.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Conda * All tags * x.x.x

…ike objects (huggingface#1663)

* Add missing "brief" entries to reuters * Add missing spaces * Update infos * Fix handling BRIEF and UNPROC text types * Make style pass * Add text type * style Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* check if tag regex is valid * regex in quotes

* Add Hatexplain - the first benchmark hate speech dataset covering multiple aspects of the issue * Add changes suggested in review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Citation information Added * Update datasets/hatexplain/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…ingface#1707)

* add Stuctured Argument Extraction for Korean dataset * Update datasets/kor_sae/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

@yjernite

@yjernite @lhoestq I only fixed the languages: tag but others would need fixing too

* minor * add prepare module test * fix windows path scheme check * cached_path raises requests error if no internet * look for cached modules if there's no internet * wip tests * add warning message * update tests * style * remove test modules if already exist * style * add init_dynamic_modules function for testing purposes * fix importlib cache * move csv, json, text and pandas to inside the package * add packaged datasets handling in prepare_module * update tests * minor fix * add missing __init__.py * fix test * style * fix test * fix tests * show last modification date in the warning

* disable caching and fingerprinting + allow unpickable transforms * add tests * fix tests on windows * ignore some kwargs for fingerprint + better decorator name * more ignore kwargs * style * add set_caching_enabled and set_fingerprinting_enabled to main __init__ * typo * remove enable/disable fingerprinting * use temp dir when caching is disabled * update tests * show warning only once * fix global caching boolean * docs * move code block after the implications of disabling caching * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * docs Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* first commit for id_liputan6 * fixes with black, isort and flake8 * removed unnecessary dl_manager.download_and_extract() since the dataset should be downloaded manually * updated info to download the dataset manually * added string detokenizers * added another regex for the string detokeniyer/cleaner, and fixed an existing one * Update datasets/id_liputan6/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/id_liputan6/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/id_liputan6/id_liputan6.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * added info about the canonical and xtreme variants, and an example of dataset instance in README.md. added a directory test. * cosmetics change for the citation information Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* fix comet citations * fixed citation1 wrong variable name Co-authored-by: Ricardo Rei <ricardorei@Ricardos-MacBook-Pro-2.local>

* Updated README for the Social Bias Frames dataset * Update datasets/social_bias_frames/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/social_bias_frames/README.md Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

eusip · 2021-01-21T00:45:12Z

I've implemented all the changes requested by @lhoestq but I made the mistake of trying to change the remote branch name.

Hopefully the changes are seen on your end as both branches silicone and main should be up-to-date.

lhoestq · 2021-01-21T10:01:13Z

It looks like the PR includes changes about many other files than the ones for Silicone (+30,000 line changes)

Maybe you can try to create another branch and another PR ?

eusip · 2021-01-21T10:31:10Z

It looks like the PR includes changes about many other files than the ones for Silicone (+30,000 line changes)

Maybe you can try to create another branch and another PR ?

Sure. I will make a new pull request.

eusip added 2 commits January 8, 2021 19:18

initial commit

6953766

Merge remote-tracking branch 'upstream/master' into silicone

c1ca045

aballigier approved these changes Jan 9, 2021

View reviewed changes

lhoestq reviewed Jan 12, 2021

View reviewed changes

lhoestq and others added 23 commits January 21, 2021 01:20

Fix windows path scheme in cached path (huggingface#1711)

297a97b

* fix windows path scheme in cached_path * add test

Fix column list comparison in transmit format (huggingface#1719)

12fa68b

* make column order deterministic in transmit_format * add test

[Scientific papers] Mirror datasets zip (huggingface#1721)

58a3541

* mirror_scientific_papers_for_faster_download * upload * make dummy data lighter * delete more

Update README.md

1fc2f58

Update XSUM Factuality DatasetCard (huggingface#1704)

625b65e

add Korean intonation-aided intention identification dataset (hugging…

9ef80a2

…face#1715) * add Korean intonation-aided intention identification dataset * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Added unfiltered versions of the Wiki-Auto training data for the GEM …

cd626ef

…simplification task. (huggingface#1722)

Update Curiosity dialogs DatasetCard (huggingface#1700)

7ba7e79

* Update datasetcard * Update Curiosity DatasetCard * Update Curiosity Dialogs DatasetCard Missing Entries

Add MNIST dataset (huggingface#1730)

0fbcc20

* Add MNIST dataset * Update datasets/mnist/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

fix cache_file_name docstring to make it explicit that it is a path

b510b4d

Release: 1.2.1

89dc790

Adding adversarialQA dataset (huggingface#1714)

e46ad30

* Adding adversarialQA dataset * Added YAML tags * reduce dummy_data.zip files sizes Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

wiki_auto: Fix invalid yaml

6f029ff

update link in TLC to be github links (huggingface#1737)

34d34a7

* update link to be github links * format code

adjust BrWaC dataset features name (huggingface#1736)

c58c0e1

Fix empty token bug for thainer and lst20 (huggingface#1734)

d4aebcc

* fix empty token bugs for thainer * fix empty token bug for lst20 Co-authored-by: charin <charin@central.tech>

Update add new dataset template (huggingface#1735)

afce7b1

* Update add new model template * Update ADD_NEW_DATASET.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Conda support (huggingface#1738)

5d90079

* Conda * All tags * x.x.x

update saving and loading methods for faiss index so to accept path l…

ab062b5

…ike objects (huggingface#1663)

jbragg and others added 17 commits January 21, 2021 01:20

Fix release conda worflow (huggingface#1746)

8cf5b39

* check if tag regex is valid * regex in quotes

Added generated READMEs for 161 datasets that were missing one. (hugg…

00f5361

…ingface#1707)

remove conflicting saudinewsnet card

2fac6f2

fix conflict in saudinewsnet dataset card

c34ce5b

Fix typo in README.md of cnn_dailymail (huggingface#1750)

eca38ae

add Stuctured Argument Extraction for Korean dataset (huggingface#1748)

1391c66

* add Stuctured Argument Extraction for Korean dataset * Update datasets/kor_sae/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

metadata: snli seems to be using an old schema

553f2a9

@yjernite @lhoestq I only fixed the languages: tag but others would need fixing too

update snli tags schema @julien-c @yjernite

ab5b4bb

fix comet citations (huggingface#1753)

2a28193

* fix comet citations * fixed citation1 wrong variable name Co-authored-by: Ricardo Rei <ricardorei@Ricardos-MacBook-Pro-2.local>

Update README.md

841e797

Update README and add dummy data

324121a

eusip closed this Jan 21, 2021

eusip deleted the silicone branch January 21, 2021 00:26

eusip restored the silicone branch January 21, 2021 00:29

eusip deleted the silicone branch January 21, 2021 00:30

eusip restored the silicone branch January 21, 2021 00:32

eusip reopened this Jan 21, 2021

eusip closed this Jan 21, 2021

eusip deleted the silicone branch January 21, 2021 14:12

eusip mentioned this pull request Jan 21, 2021

Add SILICONE benchmark #1761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silicone #1712

Silicone #1712

eusip commented Jan 8, 2021

eusip commented Jan 11, 2021

julien-c commented Jan 11, 2021

eusip commented Jan 11, 2021

lhoestq left a comment

lhoestq Jan 12, 2021

eusip Jan 12, 2021

lhoestq Jan 12, 2021

lhoestq Jan 12, 2021

lhoestq Jan 12, 2021

lhoestq Jan 12, 2021

eusip commented Jan 21, 2021

lhoestq commented Jan 21, 2021

eusip commented Jan 21, 2021


		### Data Instances

		#### DailyDialog Act Corpus (Dialogue Act)

	'Utterance': the taxi drivers are on strike again .,
	'Utterance': "the taxi drivers are on strike again .",

Silicone #1712

Silicone #1712

Conversation

eusip commented Jan 8, 2021

eusip commented Jan 11, 2021

julien-c commented Jan 11, 2021

eusip commented Jan 11, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Jan 12, 2021

Choose a reason for hiding this comment

eusip Jan 12, 2021

Choose a reason for hiding this comment

lhoestq Jan 12, 2021

Choose a reason for hiding this comment

lhoestq Jan 12, 2021

Choose a reason for hiding this comment

lhoestq Jan 12, 2021

Choose a reason for hiding this comment

lhoestq Jan 12, 2021

Choose a reason for hiding this comment

eusip commented Jan 21, 2021

lhoestq commented Jan 21, 2021

eusip commented Jan 21, 2021