-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable logger connector re-design #7891
Conversation
Hello @carmocca! Thanks for updating this PR.
Comment last updated at 2021-06-09 12:31:58 UTC |
Codecov Report
@@ Coverage Diff @@
## master #7891 +/- ##
=======================================
- Coverage 91% 88% -3%
=======================================
Files 204 204
Lines 13669 13667 -2
=======================================
- Hits 12445 12009 -436
- Misses 1224 1658 +434 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing job !
pytorch_lightning/core/lightning.py
Outdated
f" of {list(self._metric_attributes.values())}" | ||
) | ||
|
||
value = apply_to_collection(value, numbers.Number, self.__to_float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we conserve the logged type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So not converting to float tensor but just wrapping it in tensor?
We can, but I don't think this matters as ResultMetric.update
will convert it to float anyways
edit: changed to __to_tensor
@@ -126,6 +142,7 @@ def setup(self, max_batches: List[Union[int, float]], dataloaders: List[DataLoad | |||
self.num_dataloaders = self._get_num_dataloaders(dataloaders) | |||
|
|||
def on_evaluation_epoch_start(self, *args: Any, **kwargs: Any) -> None: | |||
self.trainer.logger_connector.on_epoch_start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.trainer.logger_connector.on_epoch_start() | |
# update ResultCollection. | |
self.trainer.logger_connector.on_epoch_start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't update the ResultCollection
, just sets a flag
I don't think that comment is useful
def on_evaluation_batch_start(self, batch: Any, batch_idx: int, dataloader_idx: int) -> None: | ||
self.trainer.logger_connector.on_batch_start() | ||
# FIXME(@carmocca): missing hook? | ||
# self.trainer.call_hook('on_batch_start') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's missing on purpose, I thought we decided not to run on_batch_start/end regardless of train/val/predict. It's only for training.
cc @ananthsub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did we discuss this? 😂
It's weird then because we do run on_epoch_{start,end}
for them.
This is unrelated to the PR though, can remove the FIXME and address it again in #7738
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially suggested it in this issue a long time ago: #1440
Then later there was in a slack message thread. @ananthsub had an argument against, which I don't remember. So we decided not to do it.
And also because it would be impossible to make backward compatible. Docs say it runs for training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small comment
@property | ||
def active_loop(self) -> Optional[Union[TrainLoop, EvaluationLoop]]: | ||
if self.training: | ||
return self.train_loop | ||
elif self.sanity_checking or self.evaluating: | ||
return self.evaluation_loop | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to be exposed as a property? doesn't this leak the implementation detail? what if someone accesses the active_loop and then modifies properties on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we want to access the current ResultCollection
object from the logger connector.
And each Loop
has its own ResultCollection
.
So we need this to get the current running loop.
I guess we can make this property protected to discourage external modifications.
cc: @awaelchli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want me to directly do it now in the new loops? should be no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it already with ab28850
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this was not about the results in loops?
Could someone explain what the |
batch_size: Current batch_size. This will be directly inferred from the loaded batch, | ||
but some data structures might need to explicitly provide it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the docs for batch size, is this what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but what I'm missing is why e.g. what is that parameter used for (inferred or otherwise)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compute the correct average when we ask self.log
to average the metric on epoch end.
It has to be weighted by the batch size because often the last batch does not have the same size as the others.
The dataset is not guaranteed to be divisible by the batch size and the drop_last
in the PyTorch DataLoader is False
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ResultMetric in result.py you will find the line:
self.cumulated_batch_size += batch_size
and the cumulated_batch_size is then used in the compute() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's what I thought it was for but I couldn't find in the code where it's actually doing that.
So does this mean that my _step
should log a scalar which is the mean of the current batch and PL will correctly average (including across DDP processes) by multiplying with the batch size, summing, then dividing by the dataset size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.track_batch_size(len(split_batch)) | ||
|
||
# track metrics without grads for epoch reduction | ||
training_step_output_for_epoch_end = copy(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the removal of this line could possibly be the cause of #8613
What does this PR do?
Integrate the logger connector re-design with the loops.
Fix tests
Remove legacy tests
Part of #7631
pseudo-benchmark:
MASTER
ParityModuleMNIST
:time: 0:01:14.02
profiler: 19.160 seconds
training_step
HeavyLoggingBoringModel
memory: 18.3 MiB
time: 0:00:08.62
profiler: 3.298 seconds
training_step
Logging PoC
ParityModuleMNIST
:time: 0:01:12.17
profiler: 20.911 seconds
training_step
HeavyLoggingBoringModel
memory: 70.3 KiB
time: 0:00:06.86
profiler: 7.130 seconds
training_step
Code
Recap:
training_step
takes longer asself.log
does more now.Before submitting
PR review