-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: inherit information from training set for valid/test sets #446
Conversation
deeprankcore/dataset.py
Outdated
self.features_transform = None | ||
|
||
#set features_transform as the same in dataset_train | ||
self.features_transform = dataset_train.features_transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement should stay, since it may be that dataset_train.features_transform
is None. In that case, we want self.features_transform
to be None as well. In these 4 lines just keep the old code, which was already doing what we want to address here for the other attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :) thanks!
In the log warning about the attributes already set in the train dataset, I'd add also the value of the attribute itself, like
_log.warning("edge_features parameter, if set, will be ignored." + "\n" +
"edge_features will remain the same as the one used in training phase: %s.", edge_features)
And I would also add a test to verify that even when we set different attributes in training and validation sets for example, in the validation set they are overwritten with the ones of the training dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to update the docstring to reflect that the parameters should only be set for training dataset. For that matter, I think that the dataset_train
argument should precede the ones that are inherited and that is where the docstring can then state which parameters should be ignored when set to False.
Also I'd consider making dataset_train
a required parameter, but not sure about that.
deeprankcore/dataset.py
Outdated
_log.warning("task parameter, if set, will be ignored." + "\n" + | ||
"task will remain the same as the one used in training phase: %s.", self.task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we only need to add these warnings (or maybe even throw errors?) in case a user set the parameters manually and they differ from the inherited one.
If that doesn't happen, I don't see any reason why this needs to go into the log or warnings or anywhere. It is the default expected behavior of the software.
One way to do this is to loop through the different inherited settings and then check for each whether it matches the user set one and deal accordingly.
Also, i personally very much prefer f-strings over %-formatting (plus I think it's the recommended way now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand fully what you mentioned about dataset_train
.
Do you mean to move its order in the definition before the parameters that needed to be inherited?
Besides, I modified the docstring to state the parameters that will be ignored and inherited.
PS: I changed the type of this PR, because it's not actually a bug fix, just a nicer way of dealing with things :) |
deeprankcore/dataset.py
Outdated
|
||
dataset_train (class:`GraphDataset`, optional): If train is True, assign here the training set, | ||
the following parameters target, task, node_features, edge_features and features_transform will be set | ||
as the same value in dataset_train. | ||
|
||
Defaults to None. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I didnt explain well in my previous comment.
Right now, it's still a bit confusing, especially because the train
parameter is set later.
I think it makes sense to first set train
to True or False and in the description of that argument state: if True
then set these arguments and if False
set the other arguments. Then in the description of the other arguments state that they will be ignored if the train
setting is False/True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation:)
True that the order of the definition is quite confusing,
I adjust the order of the parameters and the description in docstring a bit.
Now it should be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, much clearer. Thanks.
Now only the train
argument in the actual initialization should match the location in the docstring (so line 574 also needs to move above line 562). Then I think it's all done 👍
This is also not in the scope of this PR, let's try to not make it even bigger and with more scopes than it already has ;) (even if it would be a minor change). About the len method, DeeprankDataset inherits from What usually happens under the hoods is that For the scope of the dataset module, |
I found out at last that whether the type of
The test Besides, I removed the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per some of my inline comments, it's a bit weird to first set train
and dataset_train
in the Dataset
classes, but then have no flexibility on this when setting up the Trainer
. I think it would be ideal if the inheritance happens only when the Trainer
is instantiated and until then the Dataset
objects are ignorant to training status.
Also, note that parameters target_transform
and tqdm
and root
in both Dataset types are non Optional
, but must be set (i.e., they cannot be None).
I realize it is just outside of the scope of this PR, but it is closely related and very easy to solve and would likely lead to merge conflicts if changed elsewhere, so I think it makes sense to change it here.
deeprankcore/trainer.py
Outdated
if dataset_val.dataset_train != dataset_train: | ||
raise ValueError("valid dataset has different dataset_train parameter compared to the one given in Trainer.\n" + | ||
"Make sure to assign equivalent dataset_train in Trainer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit double to assign the dataset_train twice. Once when creating the Dataset
objects and once when creating the Trainer
.
Would it be an idea to drop the dataset_train
argument in the Dataset
classes and then move the _inherit_params
function to Trainer
? Or is this too much work after having already set it up differently now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have to review it, but it shouldn't be assigned twice, the second time in the Trainer should be a check only.
deeprankcore/trainer.py
Outdated
# Check train parameter in test is set as False. | ||
if dataset_test.train is not False: | ||
raise ValueError(f"""test dataset has train parameter {dataset_test.train} | ||
Make sure to set it as False""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to above. If it is not allowed to be set to anything other than False, why ask for it to be set? Potentially we can drop both arguments in the Dataset
classes and deal with them in Trainer
.
This would also allow for users to assign Dataset
objects that are neutral in terms of train/valid/test type, and then do parallel experiments where each is used for each type once.
Also it makes the flow less error prone (from the perspective of the user setting things up incorrectly and trying to figure out what's wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue would be about standardization, which is the reason why train and dataset_train have been created in the first place. In the beginning, they were meant for retaining and using means and stds from the training set in the case of testing/validation sets creation. I think that such distinction should happen when the sets are defined, which is when we instantiate the different GraphDataset objects.
I still have to finish the new review of this PR, but what I think could be the least redundant solution can be the following (maybe already in place): create sets and use train/dataset_train parameters in the DeeprankDataset children, and then in the Trainer only verify that what is supposed to be train, valid, test has the right settings to be so. Otherwise signal to the user that the datasets configurations are wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point, and think you are right. I also think that that is the way it is set up now, so then my comments can be disregarded.
what a monster PR this has turned out to be 🤦 👹 |
I left minor comments around, we will discuss this tomorrow but from my side I think it's fine as it is :) |
I also changed the description of this PR to give an overview of what this pr has solved! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job :-) and finally over! Please address the two very minor comments that I left, then feel free to merge it and close the issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't review the newest changes anymore and assume that @gcroci2 's final review suffices. So I will accept now, so that you have permission to merge.
Great job tackling this PR!!
One final comment I have, but super unimportant to solve (but very quick), is that I noticed that you (accidentally?) sometimes added some trailing whitespaces to some of the lines. I recently discovered a way to make VS code automatically scrape trailing whitespaces (in the File>Preferences>Settings search and tick "Trim trailing whitespace").
Also, you can auto replace all trailing whitespaces with a couple of clicks as described here
dataset.py
&test_dataset.py
_check_inherited_param
. For validation and test datasets(whiletrain
is set as False),_check_inherited_param
is performed and the following information will be inherited from the training set and be given a warning message:target
,task
,node_features
,edge_features
,features_dict
,target_transform
andclasses
.GraphDataset
&GridDataset
object to the followings:hdf5_path, subset, train, dataset_train, node_features, edge_features, features_transform, clustering_method, target, target_transform, target_filter, task, classes, tqdm, root, check_integrity
GraphDataset
&GridDataset
parameters order, and remove the parameters that will be inherited from the training set.test_inherit_info_training_griddataset
&test_inherit_info_training_griddataset
for checking whether the above parameters is inherited.test_incompatible_dataset_train_type
for checking the compatibility of assigning different types of data objects for training, validation, and testing dataset.trainer.py
&test_trainer.py
_check_dataset_equivalence
, remove the for loop of checking parameters of dataset (because these information should have been inherited directly)._check_dataset_value
for checking up the settings for valid/test dataset.test_dataset_equivalence_no_pretrained
&test_dataset_equivalence_pretrained
to match the new trainer.py.test_grid_graph_incompatible
, this test is related to dataset and should be done in test_dataset.py.Tutorials/ReadME/other files
Change the description of
GraphDataset
&GridDataset
objects' parameters order, and remove the parameters that will be inherited from the training set.