Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ro sent #1529

Closed
wants to merge 6 commits into from
Closed

Ro sent #1529

wants to merge 6 commits into from

Conversation

iliemihai
Copy link
Contributor

@iliemihai iliemihai commented Dec 13, 2020

Movies reviews dataset for Romanian language.

@SBrandeis
Copy link
Contributor

Hi @iliemihai, it looks like this PR holds changes from your previous PR #1493 .
Would you mind removing them from the branch please ?

Copy link
Contributor

@SBrandeis SBrandeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, thanks a lot !

I have a few comments regarding the script 👇

You will need to re-generate the dataset_infos.json file after making changes to the fetaures declaration.

Otherwise, the dataset card is missing (the README.md file). You can find a template and a guid on how to complete it under the templates directory at the root of the library. You may run the datasets-tagging app to generate the YAML tags.

Feel free to reach out if you have any questions !

{
"id": datasets.Value("string"),
"sentence": datasets.Sequence(datasets.Value("string")),
"label": datasets.Sequence(datasets.Value("int32")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could be a ClassLabel feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about "label": datasets.Sequence(datasets.Value("int32")), as being 0 or 1 for positive or negative sentiment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClassLabel is useful to add names to integer (often binary) values.
Here you can use for example

"label": datasets.Sequence(datasets.ClassLabel(names=["negative", "positive"]))

where negative and positive are meaningful labels for sentiment analysis.

In practice the values that are stored will still be 0 and 1, but we'll know their string representation as well.

Comment on lines +65 to +66
"sentence": datasets.Sequence(datasets.Value("string")),
"label": datasets.Sequence(datasets.Value("int32")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these two features Sequences ? It looks like each id has 1 sentence and its associated label

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, it should not be sequence : "label": datasets.Sequence(datasets.Value("int32")), -> "label": datasets.Value("int32"), :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually

"label": datasets.ClassLabel(names=["negative", "positive"])

;)

Comment on lines +90 to +94
urls_to_download_train = _URL + _TRAINING_FILE
urls_to_download_test = _URL + _TEST_FILE

train_path = dl_manager.download(urls_to_download_train)
test_path = dl_manager.download(urls_to_download_test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using this syntax, like you did in your previous PR:

Suggested change
urls_to_download_train = _URL + _TRAINING_FILE
urls_to_download_test = _URL + _TEST_FILE
train_path = dl_manager.download(urls_to_download_train)
test_path = dl_manager.download(urls_to_download_test)
urls = {
"train": _URL + _TRAINING_FILE,
"test": _URL + _TEST_FILE,
}
paths = dl_manager.download(urls)

As it enables parrallelism in the download-extract task


train_path = dl_manager.download(urls_to_download_train)
test_path = dl_manager.download(urls_to_download_test)
print("FISIERE LUATE", train_path, test_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove debugging assets :)

Suggested change
print("FISIERE LUATE", train_path, test_path)


next(data, None)
for row_id, row in enumerate(data):
print("ROW", row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("ROW", row)

Comment on lines +124 to +125
"sentence": [txt],
"label": [lbl],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided you change the features declaration, there's no need to encapsulate this in a list:

Suggested change
"sentence": [txt],
"label": [lbl],
"sentence": txt,
"label": lbl,


logging.info("⏳ Generating examples from = %s", filepath)
with open(filepath, encoding="utf-8") as f:
# data = pd.read_csv(filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# data = pd.read_csv(filepath)

Comment on lines +116 to +119
data = csv.reader(f, delimiter=",", quotechar='"')

next(data, None)
for row_id, row in enumerate(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to have a look at the csv.DictReader object

It takes cares of the header line for you

# If there's a common (input, target) tuple from the features,
# specify them here. They'll be used if as_supervised=True in
# builder.as_dataset.
supervised_keys=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sens to have supervised keys here:

Suggested change
supervised_keys=None,
supervised_keys=("sentence", "label"),

@iliemihai
Copy link
Contributor Author

@SBrandeis I am sorry. Yes I will remove them. Thank you :D

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far :)
Please don't forget to add the dataset cards with the YAML tags (more infos here)

@gchhablani
Copy link
Contributor

Hi @lhoestq @SBrandeis @iliemihai

Is this still in progress or can I take over this one?

Thanks,
Gunjan

@gchhablani
Copy link
Contributor

Hi,
While trying to add this dataset, I found some potential issues.
The homepage mentioned is : https://github.com/katakonst/sentiment-analysis-tensorflow/tree/master/datasets/ro/, where the dataset is different from the URLs: https://raw.githubusercontent.com/dumitrescustefan/Romanian-Transformers/examples/examples/sentiment_analysis/ro/train.csv. It is unclear which dataset is "correct". I checked the total examples (train+test) in both places and they do not match.

@lhoestq
Copy link
Member

lhoestq commented Mar 8, 2021

We should use the data from dumitrescustefan and set the homepage to his repo IMO, since he's first author of the paper of the dataset.

@gchhablani
Copy link
Contributor

Hi @lhoestq,

Cool, I'll get working on it.

Thanks

@gchhablani gchhablani mentioned this pull request Mar 9, 2021
@gchhablani
Copy link
Contributor

Hi @lhoestq,

This PR can be closed.

@lhoestq
Copy link
Member

lhoestq commented Mar 19, 2021

Closing in favor of #2011
Thanks again for adding it !

@lhoestq lhoestq closed this Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants