Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: inherit information from training set for valid/test sets #446

Merged
merged 26 commits into from
Jul 20, 2023

Conversation

joyceljy
Copy link
Collaborator

@joyceljy joyceljy commented Jun 13, 2023

dataset.py & test_dataset.py

  1. Add a function named _check_inherited_param. For validation and test datasets(while train is set as False), _check_inherited_param is performed and the following information will be inherited from the training set and be given a warning message:
    target, task, node_features , edge_features, features_dict, target_transform and classes.
  2. Modify the order of parameters in GraphDataset & GridDatasetobject to the followings:
    hdf5_path, subset, train, dataset_train, node_features, edge_features, features_transform, clustering_method, target, target_transform, target_filter, task, classes, tqdm, root, check_integrity
  3. Change the unit tests GraphDataset & GridDataset parameters order, and remove the parameters that will be inherited from the training set.
  4. Add unit tests test_inherit_info_training_griddataset & test_inherit_info_training_griddataset for checking whether the above parameters is inherited.
  5. Add a unit test test_incompatible_dataset_train_type for checking the compatibility of assigning different types of data objects for training, validation, and testing dataset.

trainer.py & test_trainer.py

  1. Modify function _check_dataset_equivalence, remove the for loop of checking parameters of dataset (because these information should have been inherited directly).
  2. Add function _check_dataset_value for checking up the settings for valid/test dataset.
  3. Modify unit test test_dataset_equivalence_no_pretrained & test_dataset_equivalence_pretrained to match the new trainer.py.
  4. Remove unit test test_grid_graph_incompatible, this test is related to dataset and should be done in test_dataset.py.

Tutorials/ReadME/other files

Change the description of GraphDataset & GridDataset objects' parameters order, and remove the parameters that will be inherited from the training set.

@joyceljy joyceljy self-assigned this Jun 13, 2023
@joyceljy joyceljy requested review from DaniBodor and gcroci2 June 13, 2023 13:27
self.features_transform = None

#set features_transform as the same in dataset_train
self.features_transform = dataset_train.features_transform
Copy link
Collaborator

@gcroci2 gcroci2 Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement should stay, since it may be that dataset_train.features_transform is None. In that case, we want self.features_transform to be None as well. In these 4 lines just keep the old code, which was already doing what we want to address here for the other attributes.

Copy link
Collaborator

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) thanks!

In the log warning about the attributes already set in the train dataset, I'd add also the value of the attribute itself, like

_log.warning("edge_features parameter, if set, will be ignored." + "\n" +
                        "edge_features will remain the same as the one used in training phase: %s.", edge_features)

And I would also add a test to verify that even when we set different attributes in training and validation sets for example, in the validation set they are overwritten with the ones of the training dataset.

@joyceljy joyceljy requested a review from gcroci2 June 21, 2023 12:32
@DaniBodor DaniBodor removed their request for review June 26, 2023 13:52
Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to update the docstring to reflect that the parameters should only be set for training dataset. For that matter, I think that the dataset_train argument should precede the ones that are inherited and that is where the docstring can then state which parameters should be ignored when set to False.

Also I'd consider making dataset_train a required parameter, but not sure about that.

Comment on lines 672 to 673
_log.warning("task parameter, if set, will be ignored." + "\n" +
"task will remain the same as the one used in training phase: %s.", self.task)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we only need to add these warnings (or maybe even throw errors?) in case a user set the parameters manually and they differ from the inherited one.
If that doesn't happen, I don't see any reason why this needs to go into the log or warnings or anywhere. It is the default expected behavior of the software.
One way to do this is to loop through the different inherited settings and then check for each whether it matches the user set one and deal accordingly.

Also, i personally very much prefer f-strings over %-formatting (plus I think it's the recommended way now).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand fully what you mentioned about dataset_train.
Do you mean to move its order in the definition before the parameters that needed to be inherited?
Besides, I modified the docstring to state the parameters that will be ignored and inherited.

@DaniBodor DaniBodor changed the title fix: Inherit information from training set for valid/test sets refactor: Inherit information from training set for valid/test sets Jun 28, 2023
@DaniBodor
Copy link
Collaborator

PS: I changed the type of this PR, because it's not actually a bug fix, just a nicer way of dealing with things :)

Comment on lines 584 to 590

dataset_train (class:`GraphDataset`, optional): If train is True, assign here the training set,
the following parameters target, task, node_features, edge_features and features_transform will be set
as the same value in dataset_train.

Defaults to None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I didnt explain well in my previous comment.

Right now, it's still a bit confusing, especially because the train parameter is set later.

I think it makes sense to first set train to True or False and in the description of that argument state: if True then set these arguments and if False set the other arguments. Then in the description of the other arguments state that they will be ignored if the train setting is False/True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation:)
True that the order of the definition is quite confusing,
I adjust the order of the parameters and the description in docstring a bit.
Now it should be more readable.

Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much clearer. Thanks.
Now only the train argument in the actual initialization should match the location in the docstring (so line 574 also needs to move above line 562). Then I think it's all done 👍

@gcroci2 gcroci2 changed the title refactor: Inherit information from training set for valid/test sets refactor: inherit information from training set for valid/test sets Jul 4, 2023
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
@gcroci2
Copy link
Collaborator

gcroci2 commented Jul 11, 2023

Apart from inline comments, shouldn't def len(self) in dataset.py be def len(self)

This is also not in the scope of this PR, let's try to not make it even bigger and with more scopes than it already has ;) (even if it would be a minor change).

About the len method, DeeprankDataset inherits from torch_geometric.data.dataset.Dataset and overwrites the Dataset's len method.

What usually happens under the hoods is that len method calls the magic method __len__, but I am not sure if this applies to the built-in len method only (which is built-in in python, it's not a Pytorch thing) or if this apply to len method of PyTorch as well, and it would be interesting to know (couldn't find a satisfying answer googling it a bit).

For the scope of the dataset module, len is the correct one to use.

@DaniBodor

README.md Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
deeprankcore/dataset.py Outdated Show resolved Hide resolved
@joyceljy
Copy link
Collaborator Author

joyceljy commented Jul 11, 2023

I found out at last that whether the type of dataset_train parameter is a Graph/GridDataset has already been checked in dataset.py file(See line 490 & 725). So the _check_dataset_equivalence will only check the following stuffs:

  1. If train dataset is not set (this happens when a pretrained_model is provided), Trainer should have a test dataset assigned.
  2. If train dataset is set, check whether the type of train dataset is a Graph/GridDataset.
  3. Check train parameter in val/test dataset is set as False.
  4. Check dataset_train parameter is equivalent to the train dataset which is passed to Trainer.

The test test_dataset_equivalence_pretrained in test_trainer.py will check the above first point. While the test test_dataset_equivalence_no_pretrained also in test_trainer.py will check point 2-4.

Besides, I removed the test test_grid_graph_incompatible in test_trainer.py file since the checking of dataset type is actually done in dataset.py. Instead, a new test named test_incompatible_dataset_train_type is added to test_dataset.py.
@DaniBodor

@joyceljy joyceljy requested a review from gcroci2 July 11, 2023 16:11
Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per some of my inline comments, it's a bit weird to first set train and dataset_train in the Dataset classes, but then have no flexibility on this when setting up the Trainer. I think it would be ideal if the inheritance happens only when the Trainer is instantiated and until then the Dataset objects are ignorant to training status.

Also, note that parameters target_transform and tqdm and root in both Dataset types are non Optional, but must be set (i.e., they cannot be None).
I realize it is just outside of the scope of this PR, but it is closely related and very easy to solve and would likely lead to merge conflicts if changed elsewhere, so I think it makes sense to change it here.

deeprankcore/dataset.py Show resolved Hide resolved
tests/test_dataset.py Outdated Show resolved Hide resolved
deeprankcore/trainer.py Outdated Show resolved Hide resolved
Comment on lines 265 to 267
if dataset_val.dataset_train != dataset_train:
raise ValueError("valid dataset has different dataset_train parameter compared to the one given in Trainer.\n" +
"Make sure to assign equivalent dataset_train in Trainer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit double to assign the dataset_train twice. Once when creating the Dataset objects and once when creating the Trainer.
Would it be an idea to drop the dataset_train argument in the Dataset classes and then move the _inherit_params function to Trainer ? Or is this too much work after having already set it up differently now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to review it, but it shouldn't be assigned twice, the second time in the Trainer should be a check only.

# Check train parameter in test is set as False.
if dataset_test.train is not False:
raise ValueError(f"""test dataset has train parameter {dataset_test.train}
Make sure to set it as False""")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to above. If it is not allowed to be set to anything other than False, why ask for it to be set? Potentially we can drop both arguments in the Dataset classes and deal with them in Trainer.
This would also allow for users to assign Dataset objects that are neutral in terms of train/valid/test type, and then do parallel experiments where each is used for each type once.

Also it makes the flow less error prone (from the perspective of the user setting things up incorrectly and trying to figure out what's wrong).

Copy link
Collaborator

@gcroci2 gcroci2 Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue would be about standardization, which is the reason why train and dataset_train have been created in the first place. In the beginning, they were meant for retaining and using means and stds from the training set in the case of testing/validation sets creation. I think that such distinction should happen when the sets are defined, which is when we instantiate the different GraphDataset objects.

I still have to finish the new review of this PR, but what I think could be the least redundant solution can be the following (maybe already in place): create sets and use train/dataset_train parameters in the DeeprankDataset children, and then in the Trainer only verify that what is supposed to be train, valid, test has the right settings to be so. Otherwise signal to the user that the datasets configurations are wrong

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point, and think you are right. I also think that that is the way it is set up now, so then my comments can be disregarded.

README.md Show resolved Hide resolved
tests/test_trainer.py Outdated Show resolved Hide resolved
@DaniBodor
Copy link
Collaborator

what a monster PR this has turned out to be 🤦 👹

deeprankcore/trainer.py Outdated Show resolved Hide resolved
@gcroci2
Copy link
Collaborator

gcroci2 commented Jul 18, 2023

I left minor comments around, we will discuss this tomorrow but from my side I think it's fine as it is :)

@joyceljy joyceljy requested a review from gcroci2 July 19, 2023 12:56
@joyceljy
Copy link
Collaborator Author

I also changed the description of this PR to give an overview of what this pr has solved!

Copy link
Collaborator

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job :-) and finally over! Please address the two very minor comments that I left, then feel free to merge it and close the issue!

Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't review the newest changes anymore and assume that @gcroci2 's final review suffices. So I will accept now, so that you have permission to merge.
Great job tackling this PR!!

One final comment I have, but super unimportant to solve (but very quick), is that I noticed that you (accidentally?) sometimes added some trailing whitespaces to some of the lines. I recently discovered a way to make VS code automatically scrape trailing whitespaces (in the File>Preferences>Settings search and tick "Trim trailing whitespace").
Also, you can auto replace all trailing whitespaces with a couple of clicks as described here

@joyceljy joyceljy merged commit b152a4b into main Jul 20, 2023
@joyceljy joyceljy deleted the 444_InheritGraphDatasetInfo_joyceljy branch July 20, 2023 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inherit information from training data when train = False in GraphDataset class
3 participants