Ro sent #1529

iliemihai · 2020-12-13T01:55:02Z

Movies reviews dataset for Romanian language.

SBrandeis · 2020-12-15T07:23:49Z

Hi @iliemihai, it looks like this PR holds changes from your previous PR #1493 .
Would you mind removing them from the branch please ?

SBrandeis

Very cool, thanks a lot !

I have a few comments regarding the script 👇

You will need to re-generate the dataset_infos.json file after making changes to the fetaures declaration.

Otherwise, the dataset card is missing (the README.md file). You can find a template and a guid on how to complete it under the templates directory at the root of the library. You may run the datasets-tagging app to generate the YAML tags.

Feel free to reach out if you have any questions !

SBrandeis · 2020-12-15T07:27:27Z

datasets/rosent/rosent.py

+            {
+                "id": datasets.Value("string"),
+                "sentence": datasets.Sequence(datasets.Value("string")),
+                "label": datasets.Sequence(datasets.Value("int32")),


Looks like this could be a ClassLabel feature

I was thinking about "label": datasets.Sequence(datasets.Value("int32")), as being 0 or 1 for positive or negative sentiment

ClassLabel is useful to add names to integer (often binary) values.
Here you can use for example

"label": datasets.Sequence(datasets.ClassLabel(names=["negative", "positive"]))

where negative and positive are meaningful labels for sentiment analysis.

In practice the values that are stored will still be 0 and 1, but we'll know their string representation as well.

SBrandeis · 2020-12-15T07:30:30Z

datasets/rosent/rosent.py

+                "sentence": datasets.Sequence(datasets.Value("string")),
+                "label": datasets.Sequence(datasets.Value("int32")),


Why are these two features Sequences ? It looks like each id has 1 sentence and its associated label

Ah yes, it should not be sequence : "label": datasets.Sequence(datasets.Value("int32")), -> "label": datasets.Value("int32"), :D

Or actually

"label": datasets.ClassLabel(names=["negative", "positive"])

;)

SBrandeis · 2020-12-15T07:33:40Z

datasets/rosent/rosent.py

+        urls_to_download_train = _URL + _TRAINING_FILE
+        urls_to_download_test = _URL + _TEST_FILE
+
+        train_path = dl_manager.download(urls_to_download_train)
+        test_path = dl_manager.download(urls_to_download_test)


I'd recommend using this syntax, like you did in your previous PR:

Suggested change

urls_to_download_train = _URL + _TRAINING_FILE

urls_to_download_test = _URL + _TEST_FILE

train_path = dl_manager.download(urls_to_download_train)

test_path = dl_manager.download(urls_to_download_test)

urls = {

"train": _URL + _TRAINING_FILE,

"test": _URL + _TEST_FILE,

}

paths = dl_manager.download(urls)

As it enables parrallelism in the download-extract task

SBrandeis · 2020-12-15T07:34:32Z

datasets/rosent/rosent.py

+
+        train_path = dl_manager.download(urls_to_download_train)
+        test_path = dl_manager.download(urls_to_download_test)
+        print("FISIERE LUATE", train_path, test_path)


Don't forget to remove debugging assets :)

Suggested change

print("FISIERE LUATE", train_path, test_path)

SBrandeis · 2020-12-15T07:36:18Z

datasets/rosent/rosent.py

+
+            next(data, None)
+            for row_id, row in enumerate(data):
+                print("ROW", row)


Suggested change

print("ROW", row)

SBrandeis · 2020-12-15T07:39:23Z

datasets/rosent/rosent.py

+                    "sentence": [txt],
+                    "label": [lbl],


Provided you change the features declaration, there's no need to encapsulate this in a list:

Suggested change

"sentence": [txt],

"label": [lbl],

"sentence": txt,

"label": lbl,

SBrandeis · 2020-12-15T07:40:00Z

datasets/rosent/rosent.py

+
+        logging.info("⏳ Generating examples from = %s", filepath)
+        with open(filepath, encoding="utf-8") as f:
+            # data = pd.read_csv(filepath)


Suggested change

# data = pd.read_csv(filepath)

SBrandeis · 2020-12-15T07:42:31Z

datasets/rosent/rosent.py

+            data = csv.reader(f, delimiter=",", quotechar='"')
+
+            next(data, None)
+            for row_id, row in enumerate(data):


You might want to have a look at the csv.DictReader object

It takes cares of the header line for you

SBrandeis · 2020-12-15T07:43:36Z

datasets/rosent/rosent.py

+            # If there's a common (input, target) tuple from the features,
+            # specify them here. They'll be used if as_supervised=True in
+            # builder.as_dataset.
+            supervised_keys=None,


I think it makes sens to have supervised keys here:

Suggested change

supervised_keys=None,

supervised_keys=("sentence", "label"),

iliemihai · 2020-12-15T19:01:11Z

@SBrandeis I am sorry. Yes I will remove them. Thank you :D

lhoestq

Looking good so far :)
Please don't forget to add the dataset cards with the YAML tags (more infos here)

gchhablani · 2021-03-05T04:07:39Z

Hi @lhoestq @SBrandeis @iliemihai

Is this still in progress or can I take over this one?

Thanks,
Gunjan

gchhablani · 2021-03-06T09:42:54Z

Hi,
While trying to add this dataset, I found some potential issues.
The homepage mentioned is : https://github.com/katakonst/sentiment-analysis-tensorflow/tree/master/datasets/ro/, where the dataset is different from the URLs: https://raw.githubusercontent.com/dumitrescustefan/Romanian-Transformers/examples/examples/sentiment_analysis/ro/train.csv. It is unclear which dataset is "correct". I checked the total examples (train+test) in both places and they do not match.

lhoestq · 2021-03-08T14:33:48Z

We should use the data from dumitrescustefan and set the homepage to his repo IMO, since he's first author of the paper of the dataset.

gchhablani · 2021-03-08T14:44:14Z

Hi @lhoestq,

Cool, I'll get working on it.

Thanks

gchhablani · 2021-03-17T18:50:52Z

Hi @lhoestq,

This PR can be closed.

lhoestq · 2021-03-19T10:32:42Z

Closing in favor of #2011
Thanks again for adding it !

iliemihai added 6 commits December 12, 2020 14:21

Added RONEC dataset.

057f0ee

Added dummy data for RONEC

800b2d8

Resolved coding style RONEC

45e2c20

Resolved errors RONEC

60cf063

Added Romanian sentiment dataset

2723e2b

Removed unused libraries

429f274

SBrandeis requested changes Dec 15, 2020

View reviewed changes

lhoestq reviewed Dec 18, 2020

View reviewed changes

gchhablani mentioned this pull request Mar 9, 2021

Add RoSent Dataset #2011

Merged

lhoestq closed this Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ro sent #1529

Ro sent #1529

iliemihai commented Dec 13, 2020 •

edited

Loading

SBrandeis commented Dec 15, 2020

SBrandeis left a comment

SBrandeis Dec 15, 2020

iliemihai Dec 15, 2020

lhoestq Dec 18, 2020

SBrandeis Dec 15, 2020

iliemihai Dec 15, 2020

lhoestq Dec 18, 2020

SBrandeis Dec 15, 2020

SBrandeis Dec 15, 2020

SBrandeis Dec 15, 2020

SBrandeis Dec 15, 2020

SBrandeis Dec 15, 2020

SBrandeis Dec 15, 2020

SBrandeis Dec 15, 2020

iliemihai commented Dec 15, 2020

lhoestq left a comment

gchhablani commented Mar 5, 2021

gchhablani commented Mar 6, 2021

lhoestq commented Mar 8, 2021

gchhablani commented Mar 8, 2021

gchhablani commented Mar 17, 2021

lhoestq commented Mar 19, 2021

		"sentence": datasets.Sequence(datasets.Value("string")),
		"label": datasets.Sequence(datasets.Value("int32")),

Ro sent #1529

Ro sent #1529

Conversation

iliemihai commented Dec 13, 2020 • edited Loading

SBrandeis commented Dec 15, 2020

SBrandeis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iliemihai commented Dec 15, 2020

lhoestq left a comment

Choose a reason for hiding this comment

gchhablani commented Mar 5, 2021

gchhablani commented Mar 6, 2021

lhoestq commented Mar 8, 2021

gchhablani commented Mar 8, 2021

gchhablani commented Mar 17, 2021

lhoestq commented Mar 19, 2021

iliemihai commented Dec 13, 2020 •

edited

Loading