refactor: inherit information from training set for valid/test sets #446

joyceljy · 2023-06-13T12:24:28Z

`dataset.py` & `test_dataset.py`

Add a function named _check_inherited_param. For validation and test datasets(while train is set as False), _check_inherited_param is performed and the following information will be inherited from the training set and be given a warning message:
target, task, node_features , edge_features, features_dict, target_transform and classes.
Modify the order of parameters in GraphDataset & GridDatasetobject to the followings:
hdf5_path, subset, train, dataset_train, node_features, edge_features, features_transform, clustering_method, target, target_transform, target_filter, task, classes, tqdm, root, check_integrity
Change the unit tests GraphDataset & GridDataset parameters order, and remove the parameters that will be inherited from the training set.
Add unit tests test_inherit_info_training_griddataset & test_inherit_info_training_griddataset for checking whether the above parameters is inherited.
Add a unit test test_incompatible_dataset_train_type for checking the compatibility of assigning different types of data objects for training, validation, and testing dataset.

`trainer.py` & `test_trainer.py`

Modify function _check_dataset_equivalence, remove the for loop of checking parameters of dataset (because these information should have been inherited directly).
Add function _check_dataset_value for checking up the settings for valid/test dataset.
Modify unit test test_dataset_equivalence_no_pretrained & test_dataset_equivalence_pretrained to match the new trainer.py.
Remove unit test test_grid_graph_incompatible, this test is related to dataset and should be done in test_dataset.py.

Tutorials/ReadME/other files

Change the description of GraphDataset & GridDataset objects' parameters order, and remove the parameters that will be inherited from the training set.

gcroci2 · 2023-06-16T08:02:59Z

deeprankcore/dataset.py

-                self.features_transform = None
+
+            #set features_transform as the same in dataset_train
+            self.features_transform = dataset_train.features_transform


The if statement should stay, since it may be that dataset_train.features_transform is None. In that case, we want self.features_transform to be None as well. In these 4 lines just keep the old code, which was already doing what we want to address here for the other attributes.

gcroci2

Nice :) thanks!

In the log warning about the attributes already set in the train dataset, I'd add also the value of the attribute itself, like

_log.warning("edge_features parameter, if set, will be ignored." + "\n" +
                        "edge_features will remain the same as the one used in training phase: %s.", edge_features)

And I would also add a test to verify that even when we set different attributes in training and validation sets for example, in the validation set they are overwritten with the ones of the training dataset.

DaniBodor

It would be nice to update the docstring to reflect that the parameters should only be set for training dataset. For that matter, I think that the dataset_train argument should precede the ones that are inherited and that is where the docstring can then state which parameters should be ignored when set to False.

Also I'd consider making dataset_train a required parameter, but not sure about that.

DaniBodor · 2023-06-28T11:15:14Z

deeprankcore/dataset.py

+            _log.warning("task parameter, if set, will be ignored." + "\n" +
+                        "task will remain the same as the one used in training phase: %s.", self.task)            


I feel that we only need to add these warnings (or maybe even throw errors?) in case a user set the parameters manually and they differ from the inherited one.
If that doesn't happen, I don't see any reason why this needs to go into the log or warnings or anywhere. It is the default expected behavior of the software.
One way to do this is to loop through the different inherited settings and then check for each whether it matches the user set one and deal accordingly.

Also, i personally very much prefer f-strings over %-formatting (plus I think it's the recommended way now).

I don't think I understand fully what you mentioned about dataset_train.
Do you mean to move its order in the definition before the parameters that needed to be inherited?
Besides, I modified the docstring to state the parameters that will be ignored and inherited.

DaniBodor · 2023-06-28T11:22:23Z

PS: I changed the type of this PR, because it's not actually a bug fix, just a nicer way of dealing with things :)

DaniBodor · 2023-07-04T09:38:17Z

deeprankcore/dataset.py

+
+            dataset_train (class:`GraphDataset`, optional): If train is True, assign here the training set,
+                the following parameters target, task, node_features, edge_features and features_transform will be set
+                as the same value in dataset_train. 
+
+                Defaults to None.
+


I think I didnt explain well in my previous comment.

Right now, it's still a bit confusing, especially because the train parameter is set later.

I think it makes sense to first set train to True or False and in the description of that argument state: if True then set these arguments and if False set the other arguments. Then in the description of the other arguments state that they will be ignored if the train setting is False/True.

Thanks for the explanation:)
True that the order of the definition is quite confusing,
I adjust the order of the parameters and the description in docstring a bit.
Now it should be more readable.

DaniBodor

Yes, much clearer. Thanks.
Now only the train argument in the actual initialization should match the location in the docstring (so line 574 also needs to move above line 562). Then I think it's all done 👍

deeprankcore/dataset.py

gcroci2 · 2023-07-11T11:44:33Z

Apart from inline comments, shouldn't def len(self) in dataset.py be def len(self)

This is also not in the scope of this PR, let's try to not make it even bigger and with more scopes than it already has ;) (even if it would be a minor change).

About the len method, DeeprankDataset inherits from torch_geometric.data.dataset.Dataset and overwrites the Dataset's len method.

What usually happens under the hoods is that len method calls the magic method __len__, but I am not sure if this applies to the built-in len method only (which is built-in in python, it's not a Pytorch thing) or if this apply to len method of PyTorch as well, and it would be interesting to know (couldn't find a satisfying answer googling it a bit).

For the scope of the dataset module, len is the correct one to use.

@DaniBodor

README.md

deeprankcore/dataset.py

joyceljy · 2023-07-11T15:43:40Z

I found out at last that whether the type of dataset_train parameter is a Graph/GridDataset has already been checked in dataset.py file(See line 490 & 725). So the _check_dataset_equivalence will only check the following stuffs:

If train dataset is not set (this happens when a pretrained_model is provided), Trainer should have a test dataset assigned.
If train dataset is set, check whether the type of train dataset is a Graph/GridDataset.
Check train parameter in val/test dataset is set as False.
Check dataset_train parameter is equivalent to the train dataset which is passed to Trainer.

The test test_dataset_equivalence_pretrained in test_trainer.py will check the above first point. While the test test_dataset_equivalence_no_pretrained also in test_trainer.py will check point 2-4.

Besides, I removed the test test_grid_graph_incompatible in test_trainer.py file since the checking of dataset type is actually done in dataset.py. Instead, a new test named test_incompatible_dataset_train_type is added to test_dataset.py.
@DaniBodor

DaniBodor

As per some of my inline comments, it's a bit weird to first set train and dataset_train in the Dataset classes, but then have no flexibility on this when setting up the Trainer. I think it would be ideal if the inheritance happens only when the Trainer is instantiated and until then the Dataset objects are ignorant to training status.

Also, note that parameters target_transform and tqdm and root in both Dataset types are non Optional, but must be set (i.e., they cannot be None).
I realize it is just outside of the scope of this PR, but it is closely related and very easy to solve and would likely lead to merge conflicts if changed elsewhere, so I think it makes sense to change it here.

deeprankcore/dataset.py

tests/test_dataset.py

deeprankcore/trainer.py

DaniBodor · 2023-07-12T08:51:19Z

deeprankcore/trainer.py

+                if dataset_val.dataset_train != dataset_train:
+                    raise ValueError("valid dataset has different dataset_train parameter compared to the one given in Trainer.\n" +
+                                "Make sure to assign equivalent dataset_train in Trainer")


It seems a bit double to assign the dataset_train twice. Once when creating the Dataset objects and once when creating the Trainer.
Would it be an idea to drop the dataset_train argument in the Dataset classes and then move the _inherit_params function to Trainer ? Or is this too much work after having already set it up differently now?

I still have to review it, but it shouldn't be assigned twice, the second time in the Trainer should be a check only.

DaniBodor · 2023-07-12T08:55:36Z

deeprankcore/trainer.py

+                # Check train parameter in test is set as False.
+                if dataset_test.train is not False:
+                    raise ValueError(f"""test dataset has train parameter {dataset_test.train}
+                                Make sure to set it as False""")


Similar comment to above. If it is not allowed to be set to anything other than False, why ask for it to be set? Potentially we can drop both arguments in the Dataset classes and deal with them in Trainer.
This would also allow for users to assign Dataset objects that are neutral in terms of train/valid/test type, and then do parallel experiments where each is used for each type once.

Also it makes the flow less error prone (from the perspective of the user setting things up incorrectly and trying to figure out what's wrong).

The issue would be about standardization, which is the reason why train and dataset_train have been created in the first place. In the beginning, they were meant for retaining and using means and stds from the training set in the case of testing/validation sets creation. I think that such distinction should happen when the sets are defined, which is when we instantiate the different GraphDataset objects.

I still have to finish the new review of this PR, but what I think could be the least redundant solution can be the following (maybe already in place): create sets and use train/dataset_train parameters in the DeeprankDataset children, and then in the Trainer only verify that what is supposed to be train, valid, test has the right settings to be so. Otherwise signal to the user that the datasets configurations are wrong

I see the point, and think you are right. I also think that that is the way it is set up now, so then my comments can be disregarded.

README.md

tests/test_trainer.py

DaniBodor · 2023-07-12T09:26:57Z

what a monster PR this has turned out to be 🤦 👹

deeprankcore/trainer.py

gcroci2 · 2023-07-18T08:21:57Z

I left minor comments around, we will discuss this tomorrow but from my side I think it's fine as it is :)

joyceljy · 2023-07-19T13:09:40Z

I also changed the description of this PR to give an overview of what this pr has solved!

docs/getstarted.md

gcroci2

Great job :-) and finally over! Please address the two very minor comments that I left, then feel free to merge it and close the issue!

DaniBodor

I won't review the newest changes anymore and assume that @gcroci2 's final review suffices. So I will accept now, so that you have permission to merge.
Great job tackling this PR!!

One final comment I have, but super unimportant to solve (but very quick), is that I noticed that you (accidentally?) sometimes added some trailing whitespaces to some of the lines. I recently discovered a way to make VS code automatically scrape trailing whitespaces (in the File>Preferences>Settings search and tick "Trim trailing whitespace").
Also, you can auto replace all trailing whitespaces with a couple of clicks as described here

joyceljy added 3 commits June 13, 2023 14:12

Inherit information from training data

9e06fbf

Fix check_integrity error

945c648

Fix check_integrity error

db5a721

joyceljy self-assigned this Jun 13, 2023

joyceljy linked an issue Jun 13, 2023 that may be closed by this pull request

Inherit information from training data when train = False in GraphDataset class #444

Closed

joyceljy requested review from DaniBodor and gcroci2 June 13, 2023 13:27

gcroci2 reviewed Jun 16, 2023

View reviewed changes

gcroci2 requested changes Jun 16, 2023

View reviewed changes

Chia Yu Lin added 2 commits June 21, 2023 14:08

Add unittest, modify log warn

6f0d9cd

Fix linting

4154b50

joyceljy requested a review from gcroci2 June 21, 2023 12:32

DaniBodor removed their request for review June 26, 2023 13:52

DaniBodor requested changes Jun 28, 2023

View reviewed changes

DaniBodor changed the title ~~fix: Inherit information from training set for valid/test sets~~ refactor: Inherit information from training set for valid/test sets Jun 28, 2023

Chia Yu Lin added 2 commits July 3, 2023 11:41

Modify situation of showing log warning

c74d2e1

Fix liniting

4c87c56

DaniBodor reviewed Jul 4, 2023

View reviewed changes

Modify docstring description and order

e6c7d87

DaniBodor approved these changes Jul 4, 2023

View reviewed changes

Modify parameter order

ac13214

gcroci2 changed the title ~~refactor: Inherit information from training set for valid/test sets~~ refactor: inherit information from training set for valid/test sets Jul 4, 2023

gcroci2 reviewed Jul 4, 2023

View reviewed changes