-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ro sent #1529
Ro sent #1529
Conversation
Hi @iliemihai, it looks like this PR holds changes from your previous PR #1493 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, thanks a lot !
I have a few comments regarding the script 👇
You will need to re-generate the dataset_infos.json
file after making changes to the fetaures declaration.
Otherwise, the dataset card is missing (the README.md
file). You can find a template and a guid on how to complete it under the templates
directory at the root of the library. You may run the datasets-tagging
app to generate the YAML tags.
Feel free to reach out if you have any questions !
{ | ||
"id": datasets.Value("string"), | ||
"sentence": datasets.Sequence(datasets.Value("string")), | ||
"label": datasets.Sequence(datasets.Value("int32")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this could be a ClassLabel
feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about "label": datasets.Sequence(datasets.Value("int32")), as being 0 or 1 for positive or negative sentiment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClassLabel is useful to add names to integer (often binary) values.
Here you can use for example
"label": datasets.Sequence(datasets.ClassLabel(names=["negative", "positive"]))
where negative
and positive
are meaningful labels for sentiment analysis.
In practice the values that are stored will still be 0 and 1, but we'll know their string representation as well.
"sentence": datasets.Sequence(datasets.Value("string")), | ||
"label": datasets.Sequence(datasets.Value("int32")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these two features Sequences ? It looks like each id has 1 sentence and its associated label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, it should not be sequence : "label": datasets.Sequence(datasets.Value("int32")),
-> "label": datasets.Value("int32"),
:D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually
"label": datasets.ClassLabel(names=["negative", "positive"])
;)
urls_to_download_train = _URL + _TRAINING_FILE | ||
urls_to_download_test = _URL + _TEST_FILE | ||
|
||
train_path = dl_manager.download(urls_to_download_train) | ||
test_path = dl_manager.download(urls_to_download_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using this syntax, like you did in your previous PR:
urls_to_download_train = _URL + _TRAINING_FILE | |
urls_to_download_test = _URL + _TEST_FILE | |
train_path = dl_manager.download(urls_to_download_train) | |
test_path = dl_manager.download(urls_to_download_test) | |
urls = { | |
"train": _URL + _TRAINING_FILE, | |
"test": _URL + _TEST_FILE, | |
} | |
paths = dl_manager.download(urls) |
As it enables parrallelism in the download-extract task
|
||
train_path = dl_manager.download(urls_to_download_train) | ||
test_path = dl_manager.download(urls_to_download_test) | ||
print("FISIERE LUATE", train_path, test_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to remove debugging assets :)
print("FISIERE LUATE", train_path, test_path) |
|
||
next(data, None) | ||
for row_id, row in enumerate(data): | ||
print("ROW", row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print("ROW", row) |
"sentence": [txt], | ||
"label": [lbl], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided you change the features declaration, there's no need to encapsulate this in a list:
"sentence": [txt], | |
"label": [lbl], | |
"sentence": txt, | |
"label": lbl, |
|
||
logging.info("⏳ Generating examples from = %s", filepath) | ||
with open(filepath, encoding="utf-8") as f: | ||
# data = pd.read_csv(filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# data = pd.read_csv(filepath) |
data = csv.reader(f, delimiter=",", quotechar='"') | ||
|
||
next(data, None) | ||
for row_id, row in enumerate(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to have a look at the csv.DictReader object
It takes cares of the header line for you
# If there's a common (input, target) tuple from the features, | ||
# specify them here. They'll be used if as_supervised=True in | ||
# builder.as_dataset. | ||
supervised_keys=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sens to have supervised keys here:
supervised_keys=None, | |
supervised_keys=("sentence", "label"), |
@SBrandeis I am sorry. Yes I will remove them. Thank you :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far :)
Please don't forget to add the dataset cards with the YAML tags (more infos here)
Hi @lhoestq @SBrandeis @iliemihai Is this still in progress or can I take over this one? Thanks, |
Hi, |
We should use the data from dumitrescustefan and set the homepage to his repo IMO, since he's first author of the paper of the dataset. |
Hi @lhoestq, Cool, I'll get working on it. Thanks |
Hi @lhoestq, This PR can be closed. |
Closing in favor of #2011 |
Movies reviews dataset for Romanian language.