Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark made of multiple text datasets #354

Merged
merged 38 commits into from
Aug 3, 2023
Merged

Add benchmark made of multiple text datasets #354

merged 38 commits into from
Aug 3, 2023

Conversation

610v4nn1
Copy link
Contributor

@610v4nn1 610v4nn1 commented Jul 25, 2023

Add a new data module loading collection of 5 public text datasets called domains.

The dataset will be added to the benchmark in a follow-up PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@610v4nn1 610v4nn1 changed the base branch from main to dev July 25, 2023 12:55
@610v4nn1 610v4nn1 requested a review from lballes July 27, 2023 10:13
@610v4nn1 610v4nn1 assigned 610v4nn1 and lballes and unassigned 610v4nn1 Jul 27, 2023
@610v4nn1 610v4nn1 marked this pull request as ready for review July 27, 2023 10:17
@wistuba
Copy link
Contributor

wistuba commented Jul 27, 2023

This code is based on #312, right? It has exactly the same test fail and I honestly don't know why.

@610v4nn1
Copy link
Contributor Author

Locally when the test fails it is sufficient to re-run it and it will pass.

@github-actions
Copy link

github-actions bot commented Jul 31, 2023

Coverage report

The coverage rate went from 85.68% to 84.99% ⬇️

26.08% of new lines are covered.

Diff Coverage details (click to unfold)

src/renate/defaults.py

100% of new lines are covered (99.02% of the complete file).

src/renate/benchmark/datasets/nlp_datasets.py

22.72% of new lines are covered (61.16% of the complete file).
Missing lines: 225, 227, 228, 229, 231, 232, 233, 235, 236, 238, 239, 243, 248, 249, 254, 256, 257, 262, 263, 268, 269, 273, 275, 276, 278, 280, 286, 288, 298, 300, 302, 303, 304, 305

Copy link
Contributor

@wistuba wistuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally left some more comments but I came here to point out the relationship to #357. We should merge #357 first and then make small modifications here:

  • The data module extends DomainIncrementalDataModule

On a general note, we should add the datamodule to experiment_config.py and add the dataset to the documentation. Its use will be similar to DomainNet by relying on the DomainIncrementalScenario.

src/renate/defaults.py Outdated Show resolved Hide resolved
test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved
test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved
test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved
test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved
src/renate/benchmark/datasets/nlp_datasets.py Show resolved Hide resolved
src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved
src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved
src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved
src/renate/benchmark/datasets/nlp_datasets.py Show resolved Hide resolved
@wistuba wistuba assigned wistuba and unassigned lballes Aug 3, 2023
@610v4nn1 610v4nn1 changed the base branch from dev to main August 3, 2023 12:10
@610v4nn1 610v4nn1 changed the base branch from main to dev August 3, 2023 12:10
@610v4nn1 610v4nn1 requested a review from wistuba August 3, 2023 12:12
src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved
src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved

def get_split(split_name):
dataset = load_dataset(self.data_id, split=split_name, cache_dir=self._data_path)
new_features = dataset.features.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a copy needed here?

src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved
Comment on lines +220 to +221
train_size: int = defaults.SMALL_TRAIN_SET_SIZE,
test_size: int = defaults.SMALL_TEST_SET_SIZE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intuition of selecting a subset for this specific dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set a relatively small value by default because I expect it to be closer to the actual usage than the max value

Comment on lines +107 to +108
SMALL_TRAIN_SET_SIZE = 1000
SMALL_TEST_SET_SIZE = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names still imply that they are generally used. I was thinking more something along the lines of MULTI_TEXT_TRAIN_SET_SIZE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer no to have per-dataset default training/test set size

@610v4nn1 610v4nn1 merged commit 54c37ce into dev Aug 3, 2023
18 checks passed
@610v4nn1 610v4nn1 deleted the gz-multitext branch August 3, 2023 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants