Enable logger connector re-design #7891

carmocca · 2021-06-09T01:32:34Z

What does this PR do?

Integrate the logger connector re-design with the loops.
Fix tests
Remove legacy tests

Part of #7631

pseudo-benchmark:

MASTER

ParityModuleMNIST:
time: 0:01:14.02
profiler: 19.160 seconds training_step

HeavyLoggingBoringModel
memory: 18.3 MiB
time: 0:00:08.62
profiler: 3.298 seconds training_step

Logging PoC

ParityModuleMNIST:
time: 0:01:12.17
profiler: 20.911 seconds training_step

HeavyLoggingBoringModel
memory: 70.3 KiB
time: 0:00:06.86
profiler: 7.130 seconds training_step

Code

import gc
import io
import pstats
import tracemalloc
from datetime import datetime

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.profiler import AdvancedProfiler
from tests import PATH_DATASETS
from tests.helpers import BoringModel, RandomDataset
from tests.helpers.datasets import MNIST


def collect_stats():
    gc.collect()
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics("lineno")
    for stat in top_stats[:3]:
        print(stat)


class HeavyLoggingBoringModel(BoringModel):
    def __init__(self, memory=False):
        super().__init__()
        self.memory = memory

    def on_fit_start(self):
        if self.memory:
            tracemalloc.start(10)

    def on_fit_end(self):
        if self.memory:
            tracemalloc.stop()

    def training_step(self, batch, batch_idx):
        if self.memory and batch_idx % 50 == 49:
            collect_stats()

        loss = super().training_step(batch, batch_idx)["loss"]

        output_dict = {f"loss_{i}": loss for i in range(200)}
        self.log_dict(output_dict, on_step=True, on_epoch=True, prog_bar=True)

        return loss

    def train_dataloader(self):
        return DataLoader(RandomDataset(32, 500))


class ParityModuleMNIST(LightningModule):
    def __init__(self, memory=False):
        super().__init__()
        self.memory = memory
        self.c_d1 = nn.Linear(in_features=28 * 28, out_features=128)
        self.c_d1_bn = nn.BatchNorm1d(128)
        self.c_d1_drop = nn.Dropout(0.3)
        self.c_d2 = nn.Linear(in_features=128, out_features=10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.c_d1(x)
        x = torch.tanh(x)
        x = self.c_d1_bn(x)
        x = self.c_d1_drop(x)
        x = self.c_d2(x)
        return x

    def on_fit_start(self):
        if self.memory:
            tracemalloc.start(10)

    def on_fit_end(self):
        if self.memory:
            tracemalloc.stop()

    def training_step(self, batch, batch_idx):
        if self.memory and batch_idx % 50 == 49:
            collect_stats()

        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log("loss", loss, on_step=True, on_epoch=True, prog_bar=True)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

    def train_dataloader(self):
        return DataLoader(MNIST(root=PATH_DATASETS, train=True, download=True), batch_size=2)


class MyAdvancedProfiler(AdvancedProfiler):
    def summary(self) -> str:
        recorded_stats = {}
        for action_name, pr in self.profiled_actions.items():
            if action_name != "training_step":
                continue
            s = io.StringIO()
            ps = pstats.Stats(pr, stream=s).strip_dirs().sort_stats("tottime")
            ps.print_stats(20)
            recorded_stats[action_name] = s.getvalue()
        return self._stats_to_str(recorded_stats)


def run(model_cls, test):
    print(f"{'=' * 30}\n {test} - {model_cls.__name__}\n{'=' * 30}")

    if test == "memory":
        trainer = Trainer(max_epochs=1, logger=False, progress_bar_refresh_rate=50, weights_summary=None)
        model = model_cls(memory=True)
        trainer.fit(model)

    elif test == "time":
        trainer = Trainer(max_epochs=1, logger=False, progress_bar_refresh_rate=50, weights_summary=None)
        model = model_cls()
        start = datetime.now()
        trainer.fit(model)
        end = datetime.now()
        print("Time: ", end - start)

    elif test == "profiler":
        trainer = Trainer(
            max_epochs=1,
            logger=False,
            profiler=MyAdvancedProfiler(),
            progress_bar_refresh_rate=50,
            weights_summary=None,
        )
        model = model_cls()
        trainer.fit(model)


if __name__ == "__main__":
    for model_cls in (ParityModuleMNIST, HeavyLoggingBoringModel):
        for test in ("memory", "time", "profiler"):
            if model_cls is ParityModuleMNIST and test == "memory":
                continue
            run(model_cls, test)

Recap:

Negligible speed difference when using a real (aka medium-large+) model
training_step takes longer as self.log does more now.
Fixed memory
Runtime is decreased (faster aggregation)

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

pep8speaks · 2021-06-09T01:32:41Z

Hello @carmocca! Thanks for updating this PR.

In the file pytorch_lightning/trainer/training_loop.py:

Line 212:17: W503 line break before binary operator

Comment last updated at 2021-06-09 12:31:58 UTC

codecov · 2021-06-09T01:34:21Z

Codecov Report

Merging #7891 (d53fe03) into master (6fee926) will decrease coverage by 3%.
The diff coverage is 98%.

@@           Coverage Diff           @@
##           master   #7891    +/-   ##
=======================================
- Coverage      91%     88%    -3%     
=======================================
  Files         204     204            
  Lines       13669   13667     -2     
=======================================
- Hits        12445   12009   -436     
- Misses       1224    1658   +434

tchaton

Amazing job !

pytorch_lightning/core/lightning.py

tchaton · 2021-06-09T08:01:14Z

pytorch_lightning/core/lightning.py

+                    f" of {list(self._metric_attributes.values())}"
+                )
+
+        value = apply_to_collection(value, numbers.Number, self.__to_float)


Should we conserve the logged type ?

So not converting to float tensor but just wrapping it in tensor?

We can, but I don't think this matters as ResultMetric.update will convert it to float anyways

edit: changed to __to_tensor

pytorch_lightning/core/lightning.py

pytorch_lightning/trainer/evaluation_loop.py

tchaton · 2021-06-09T08:05:10Z

pytorch_lightning/trainer/evaluation_loop.py

@@ -126,6 +142,7 @@ def setup(self, max_batches: List[Union[int, float]], dataloaders: List[DataLoad
        self.num_dataloaders = self._get_num_dataloaders(dataloaders)

    def on_evaluation_epoch_start(self, *args: Any, **kwargs: Any) -> None:
+        self.trainer.logger_connector.on_epoch_start()


Suggested change

self.trainer.logger_connector.on_epoch_start()

# update ResultCollection.

self.trainer.logger_connector.on_epoch_start()

This doesn't update the ResultCollection, just sets a flag

I don't think that comment is useful

pytorch_lightning/trainer/evaluation_loop.py

pytorch_lightning/trainer/trainer.py

pytorch_lightning/trainer/training_loop.py

pytorch_lightning/core/lightning.py

awaelchli · 2021-06-09T08:24:58Z

pytorch_lightning/trainer/evaluation_loop.py

    def on_evaluation_batch_start(self, batch: Any, batch_idx: int, dataloader_idx: int) -> None:
+        self.trainer.logger_connector.on_batch_start()
+        # FIXME(@carmocca): missing hook?
+        # self.trainer.call_hook('on_batch_start')


it's missing on purpose, I thought we decided not to run on_batch_start/end regardless of train/val/predict. It's only for training.

cc @ananthsub

Where did we discuss this? 😂
It's weird then because we do run on_epoch_{start,end} for them.

This is unrelated to the PR though, can remove the FIXME and address it again in #7738

I initially suggested it in this issue a long time ago: #1440
Then later there was in a slack message thread. @ananthsub had an argument against, which I don't remember. So we decided not to do it.
And also because it would be impossible to make backward compatible. Docs say it runs for training.

pytorch_lightning/trainer/evaluation_loop.py

ethanwharris

LGTM, small comment

pytorch_lightning/core/lightning.py

ananthsub · 2021-06-09T08:30:34Z

pytorch_lightning/trainer/properties.py

+    @property
+    def active_loop(self) -> Optional[Union[TrainLoop, EvaluationLoop]]:
+        if self.training:
+            return self.train_loop
+        elif self.sanity_checking or self.evaluating:
+            return self.evaluation_loop
+


why does this need to be exposed as a property? doesn't this leak the implementation detail? what if someone accesses the active_loop and then modifies properties on it?

So we want to access the current ResultCollection object from the logger connector.
And each Loop has its own ResultCollection.
So we need this to get the current running loop.

I guess we can make this property protected to discourage external modifications.

cc: @awaelchli

do you want me to directly do it now in the new loops? should be no problem.

Did it already with ab28850

so this was not about the results in loops?

pytorch_lightning/trainer/training_loop.py

CHANGELOG.md

Queuecumber · 2021-06-10T14:21:28Z

Could someone explain what the batch_size parameter does? I don't see it being used in the code anywhere and the docs don't explain it.

awaelchli · 2021-06-10T14:31:04Z

pytorch_lightning/core/lightning.py

+            batch_size: Current batch_size. This will be directly inferred from the loaded batch,
+                but some data structures might need to explicitly provide it.


Here are the docs for batch size, is this what you mean?

Yeah but what I'm missing is why e.g. what is that parameter used for (inferred or otherwise)?

To compute the correct average when we ask self.log to average the metric on epoch end.

It has to be weighted by the batch size because often the last batch does not have the same size as the others.
The dataset is not guaranteed to be divisible by the batch size and the drop_last in the PyTorch DataLoader is False by default.

In the ResultMetric in result.py you will find the line:

self.cumulated_batch_size += batch_size
and the cumulated_batch_size is then used in the compute() method.

ok that's what I thought it was for but I couldn't find in the code where it's actually doing that.

So does this mean that my _step should log a scalar which is the mean of the current batch and PL will correctly average (including across DDP processes) by multiplying with the batch size, summing, then dividing by the dataset size?

Yes, relevant code:

https://github.com/PyTorchLightning/pytorch-lightning/blob/cdcc483e9b7a79de3e5a7ac9c1e9dfd12ab77f4f/pytorch_lightning/trainer/connectors/logger_connector/result.py#L169-L197

awaelchli · 2021-07-29T09:49:51Z

pytorch_lightning/trainer/training_loop.py

-        result.track_batch_size(len(split_batch))
-
-        # track metrics without grads for epoch reduction
-        training_step_output_for_epoch_end = copy(result)


the removal of this line could possibly be the cause of #8613

carmocca added 6 commits June 9, 2021 01:19

Integrate logger connector refactor with loops

fa8121d

mypy

de56277

Fix test

e7a2e6d

Refactor test

f38db7c

Deprecate self.log(sync_dist_op) in favor of self.log(reduce_fx)

a9c5c4e

Update CHANGELOG

ce9b089

carmocca added the feature Is an improvement or enhancement label Jun 9, 2021

carmocca added this to the v1.4 milestone Jun 9, 2021

carmocca requested review from awaelchli, Borda, edenlightning, justusschock, kaushikb11 and SeanNaren as code owners June 9, 2021 01:32

carmocca self-assigned this Jun 9, 2021

carmocca requested review from tchaton and williamFalcon as code owners June 9, 2021 01:32

Update CHANGELOG

f5ec8d4

tchaton approved these changes Jun 9, 2021

View reviewed changes

mergify bot added the has conflicts label Jun 9, 2021

awaelchli approved these changes Jun 9, 2021

View reviewed changes

word repetition

67da5fb

ethanwharris approved these changes Jun 9, 2021

View reviewed changes

pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved

ananthsub reviewed Jun 9, 2021

View reviewed changes

Merge branch 'master' into refactor/use-new-logger-connector

da135b8

mergify bot removed the has conflicts label Jun 9, 2021

carmocca added 2 commits June 9, 2021 12:31

Access results directly

211e3b3

__to_float_tensor

91e8e0e

carmocca added 5 commits June 9, 2021 12:44

Add comment

716e91a

Pass as kwarg for readability

3c2849d

Add comment

e53f24a

Add comment

4bd4be9

__to_float_tensor -> __to_tensor

b024729

awaelchli mentioned this pull request Jun 9, 2021

Loop Refactor 5/N - Prediction Loop #7700

Merged

11 tasks

carmocca added 3 commits June 9, 2021 13:05

Remove metric_attribute

bab742e

Mark as protected

ab28850

Remove FIXME

33ea492

carmocca commented Jun 9, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

carmocca added 2 commits June 9, 2021 13:18

Update CHANGELOG.md

cce4941

mypy

f09e726

awaelchli added the logging Related to the `LoggerConnector` and `log()` label Jun 9, 2021

carmocca enabled auto-merge (squash) June 9, 2021 12:11

mergify bot added the has conflicts label Jun 9, 2021

Merge branch 'master' into refactor/use-new-logger-connector

d53fe03

mergify bot removed the has conflicts label Jun 9, 2021

carmocca merged commit ec4f885 into master Jun 9, 2021

carmocca deleted the refactor/use-new-logger-connector branch June 9, 2021 14:24

awaelchli reviewed Jun 10, 2021

View reviewed changes

awaelchli mentioned this pull request Jul 29, 2021

fix collecting training_step outputs #8613

Merged

11 tasks

awaelchli reviewed Jul 29, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jul 29, 2021

carmocca added the Important label Aug 9, 2021

lucmos mentioned this pull request Sep 27, 2021

AttributeError with LightningModule forward without Trainer #9716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable logger connector re-design #7891

Enable logger connector re-design #7891

carmocca commented Jun 9, 2021 •

edited

Loading

pep8speaks commented Jun 9, 2021 •

edited

Loading

codecov bot commented Jun 9, 2021 •

edited

Loading

tchaton left a comment

tchaton Jun 9, 2021

carmocca Jun 9, 2021 •

edited

Loading

tchaton Jun 9, 2021

carmocca Jun 9, 2021

awaelchli Jun 9, 2021

carmocca Jun 9, 2021

awaelchli Jun 9, 2021 •

edited

Loading

ethanwharris left a comment

ananthsub Jun 9, 2021

carmocca Jun 9, 2021

awaelchli Jun 9, 2021

carmocca Jun 9, 2021

awaelchli Jun 9, 2021

Queuecumber commented Jun 10, 2021

awaelchli Jun 10, 2021

Queuecumber Jun 10, 2021

awaelchli Jun 10, 2021 •

edited

Loading

awaelchli Jun 10, 2021

Queuecumber Jun 10, 2021

carmocca Jun 18, 2021

awaelchli Jul 29, 2021

	self.trainer.logger_connector.on_epoch_start()
	# update ResultCollection.
	self.trainer.logger_connector.on_epoch_start()

		batch_size: Current batch_size. This will be directly inferred from the loaded batch,
		but some data structures might need to explicitly provide it.

Enable logger connector re-design #7891

Enable logger connector re-design #7891

Conversation

carmocca commented Jun 9, 2021 • edited Loading

What does this PR do?

pseudo-benchmark:

MASTER

Logging PoC

Before submitting

PR review

pep8speaks commented Jun 9, 2021 • edited Loading

Comment last updated at 2021-06-09 12:31:58 UTC

codecov bot commented Jun 9, 2021 • edited Loading

Codecov Report

tchaton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carmocca Jun 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli Jun 9, 2021 • edited Loading

Choose a reason for hiding this comment

ethanwharris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Queuecumber commented Jun 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli Jun 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carmocca commented Jun 9, 2021 •

edited

Loading

pep8speaks commented Jun 9, 2021 •

edited

Loading

codecov bot commented Jun 9, 2021 •

edited

Loading

carmocca Jun 9, 2021 •

edited

Loading

awaelchli Jun 9, 2021 •

edited

Loading

awaelchli Jun 10, 2021 •

edited

Loading