New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add benchmark made of multiple text datasets #354

Merged

610v4nn1 merged 38 commits into dev from gz-multitext

Aug 3, 2023

Contributor

610v4nn1 commented Jul 25, 2023 •

edited

Loading

Add a new data module loading collection of 5 public text datasets called domains.

The dataset will be added to the benchmark in a follow-up PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.


          WIP implementation of multi-dataset NLP benchmark

19ee3d5

610v4nn1 changed the base branch from main to dev

July 25, 2023 12:55

610v4nn1 added 8 commits

July 25, 2023 16:51


          First implementation of 5 datasets NLP benchmark

a00ed47


          add test for new data module

fe325e8


          Made implementation thread safe. Add dataset selection to speedup loa…

b21c0b4

…ding.


          fix labels map

0dd1618


          Add quick test

2198d2a


          fix docstring and defaults. remove unused code.

c4404a9


          fix typo

15c0745


          make flake happy

cb4d723

610v4nn1 requested a review from lballes

July 27, 2023 10:13

610v4nn1 assigned 610v4nn1 and lballes and unassigned 610v4nn1

610v4nn1 marked this pull request as ready for review

July 27, 2023 10:17

Contributor

wistuba commented Jul 27, 2023

This code is based on #312, right? It has exactly the same test fail and I honestly don't know why.

Contributor Author

610v4nn1 commented Jul 27, 2023

Locally when the test fails it is sufficient to re-run it and it will pass.

610v4nn1 added 3 commits

July 27, 2023 13:31


          skip long test

ef956b3


          import pytest

2d3de75


          skip test

5eac7fc

github-actions bot commented Jul 31, 2023 •

edited

Loading

Coverage report

The coverage rate went from 85.68% to 84.99% ⬇️

26.08% of new lines are covered.

Diff Coverage details (click to unfold)

src/renate/defaults.py

100% of new lines are covered (99.02% of the complete file).

src/renate/benchmark/datasets/nlp_datasets.py

22.72% of new lines are covered (61.16% of the complete file).
Missing lines: 225, 227, 228, 229, 231, 232, 233, 235, 236, 238, 239, 243, 248, 249, 254, 256, 257, 262, 263, 268, 269, 273, 275, 276, 278, 280, 286, 288, 298, 300, 302, 303, 304, 305

610v4nn1 added 3 commits

July 31, 2023 15:53


          rename domain

d5c8333


          import pytest


          remove amazon reviews

6f81901

wistuba reviewed

View reviewed changes

Contributor

wistuba left a comment

I accidentally left some more comments but I came here to point out the relationship to #357. We should merge #357 first and then make small modifications here:

The data module extends DomainIncrementalDataModule

On a general note, we should add the datamodule to experiment_config.py and add the dataset to the documentation. Its use will be similar to DomainNet by relying on the DomainIncrementalScenario.

src/renate/defaults.py Outdated Show resolved Hide resolved

test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved

test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved

test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved

test/renate/benchmark/datasets/test_multi_data_nlp.py Outdated Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py Show resolved Hide resolved

wistuba assigned wistuba and unassigned lballes

610v4nn1 added 3 commits

August 3, 2023 11:27


          WIP implementation of multi-dataset NLP benchmark

015ef3f


          First implementation of 5 datasets NLP benchmark

dc5d80e


          add test for new data module

63ca3a6

610v4nn1 added 16 commits

August 3, 2023 11:27


          Made implementation thread safe. Add dataset selection to speedup loa…

0b9a90b

…ding.


          fix labels map

1bfbc4d


          Add quick test

7bc816b


          fix docstring and defaults. remove unused code.

678a4a7


          fix typo

d31a13a


          make flake happy

0bd72ec


          skip long test

bf2f8f2


          import pytest

cb6fe3b


          skip test

df4d172


          rename domain

5060e79


          import pytest

d001151


          remove amazon reviews

e7985ec


          fix order agnews labels

ca1c526


          fix test skip reason

9f11751


          improve tests, adapt to data incremental module

56bb55a


          fix and merge

8452b84

610v4nn1 changed the base branch from dev to main

August 3, 2023 12:10

610v4nn1 changed the base branch from main to dev

August 3, 2023 12:10

610v4nn1 requested a review from wistuba

August 3, 2023 12:12

wistuba reviewed

View reviewed changes

src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py Outdated

+                      def get_split(split_name):
+                          dataset = load_dataset(self.data_id, split=split_name, cache_dir=self._data_path)
+                          new_features = dataset.features.copy()

Contributor

wistuba Aug 3, 2023

why is a copy needed here?

src/renate/benchmark/datasets/nlp_datasets.py Outdated Show resolved Hide resolved

src/renate/benchmark/datasets/nlp_datasets.py

Comment on lines +220 to +221

		train_size: int = defaults.SMALL_TRAIN_SET_SIZE,
		test_size: int = defaults.SMALL_TEST_SET_SIZE,

Contributor

wistuba Aug 3, 2023

what is the intuition of selecting a subset for this specific dataset?

Contributor Author

610v4nn1 Aug 3, 2023

I set a relatively small value by default because I expect it to be closer to the actual usage than the max value

src/renate/defaults.py

Comment on lines +107 to +108

		SMALL_TRAIN_SET_SIZE = 1000
		SMALL_TEST_SET_SIZE = 1000

Contributor

wistuba Aug 3, 2023

names still imply that they are generally used. I was thinking more something along the lines of MULTI_TEXT_TRAIN_SET_SIZE

Contributor Author

610v4nn1 Aug 3, 2023

I would prefer no to have per-dataset default training/test set size

610v4nn1 added 4 commits

August 3, 2023 15:13


          add seed randint

9c26c1a


          fix exception message

c82285e


          avoid copying metadata to change num classes

3b009a7


          fix generator and features

1ad978e

wistuba approved these changes

View reviewed changes

610v4nn1 merged commit 54c37ce into dev

18 checks passed

610v4nn1 deleted the gz-multitext branch

August 3, 2023 16:49

wistuba mentioned this pull request

MultiText dataset Added to Benchmarking #366

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet