Added accumulation of loggers' metrics for the same steps #1278

alexeykarnachev · 2020-03-29T11:19:45Z

Before submitting

Was this discussed/approved via a Github issue? #1173
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1173.
This PR adds new method LightningLoggerBase.agg_and_log_metrics. This method has the same signature as and abstract LightningLoggerBase.log_metrics method.
New method aggregates metrics with the same step value and then, delegates the logging procedure to log_metrics.
From the client perspective nothing has changed. An abstract log_metrics still in place and it still needs to be implemented in the child loggers, but the Trainer now calls the agg_and_log_metrics.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-03-29T11:19:49Z

Hello @alexeykarnachev! Thanks for updating this PR.

In the file pytorch_lightning/trainer/trainer.py:

Line 558:21: E731 do not assign a lambda expression, use a def

Comment last updated at 2020-04-07 16:06:12 UTC

codecov · 2020-03-29T11:32:49Z

Codecov Report

Merging #1278 into master will decrease coverage by 0%.
The diff coverage is 73%.

@@          Coverage Diff           @@
##           master   #1278   +/-   ##
======================================
- Coverage      92%     91%   -0%     
======================================
  Files          63      64    +1     
  Lines        3332    3369   +37     
======================================
+ Hits         3059    3079   +20     
- Misses        273     290   +17

alexeykarnachev · 2020-03-29T11:37:47Z

Oh, I did the merge instead of rebase. sry

pytorch_lightning/loggers/base.py

Borda

it is not very easy to follow what is the usage since the new functions re not used anywhere...

pytorch_lightning/loggers/metrics_agg.py

pytorch_lightning/loggers/base.py

pytorch_lightning/loggers/metrics_agg.py

tests/loggers/test_metrics_agg.py

mergify · 2020-03-30T00:46:58Z

This pull request is now in conflict... :(

Borda

I like the way of adding abstraction with agg_and_log_metrics, but feel that the logger usage is too rigid...
cc: @PyTorchLightning/core-contributors @justusschock ^^

pytorch_lightning/loggers/metrics_agg.py

pytorch_lightning/loggers/base.py

mergify · 2020-03-30T22:17:20Z

This pull request is now in conflict... :(

Borda · 2020-03-31T07:02:07Z

@justusschock I think that it would be nice to help @alexeykarnachev with growing some aggregation which may be later useful also for metrics than shaking there and back... do you have any suggestion/idea?

justusschock · 2020-03-31T07:44:41Z

@Borda @alexeykarnachev back when I've written my own framework I solved it like this (of course replace the numpy function with something appropriate/torch functions):
https://github.com/delira-dev/delira/blob/master/delira/utils/dict_reductions.py

Then each logger was given a dict of key, reduce_specifier like this:

{'key1': 'max', 'key2': 'min'}

and we had a fallback function (mean) for non-specified keys.
We basically did this, because you would probably want to log scalars more often then images. And we used the same for metric reduction.

For reference, this was our whole logger implementation (in some parts not the most beautiful code :D ):
https://github.com/delira-dev/delira/blob/master/delira/logging/base_logger.py

williamFalcon · 2020-04-02T15:50:40Z

@Borda @alexeykarnachev @justusschock can we get this merged soon so we can add to 0.7.2?

Borda · 2020-04-02T15:58:52Z

not yet we have some API disagreement...

alexeykarnachev · 2020-04-02T16:18:30Z

planning to add all discussed fixes tomorrow

alexeykarnachev · 2020-04-03T08:01:25Z

Now, there are two arguments in the LightningLoggerBase. They have default values, so the
inherited classes won't break when they call super.
But what do you think about adding these arguments to all inherited classes constructors explicitly (also with default value)?

Borda

@alexeykarnachev I made some updates, could you pls check it?

pytorch_lightning/trainer/trainer.py

williamFalcon · 2020-04-07T13:58:42Z

@alexeykarnachev we're preparing the 0.7.2 release for tomorrow. would be good to get this in there :)

Borda

Pls add note to the changelog (extra line) about refactoring / remam TensorRunningAccum

Borda · 2020-04-07T16:19:12Z

@alexeykarnachev GREAT job and sorry for overwriting some of your work...

williamFalcon · 2020-04-07T21:32:48Z

pytorch_lightning/loggers/base.py

+        >>> d2 = {'a': 1.1, 'b': 2.2, 'v': 1}
+        >>> d3 = {'a': 1.1, 'v': 2.3}
+        >>> dflt_func = min
+        >>> agg_funcs = {'a': np.mean, 'v': max}


won't numpy functions slow things down because everything needs to go to cpu? we don't want to move things to CPU for the user ever haha. Every cpu calls slows training down a ton

so to have it rather as Tensor...

@alexeykarnachev we may use the Running accum or make another for full accum and extending every N steps and copy existing...
https://discuss.pytorch.org/t/dynamically-extend-the-tensor/39553/2
or https://discuss.pytorch.org/t/appending-to-a-tensor/2665/9

so the structure will change from list of dict to dict of tensors but not sure if it makes much faster... also it will allow us to use agg implemented for Torch

Hmmm. I thought, that at this point all values are already on cpu.
Here we call the log procedure:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f7622ebfca45abe7d8d34f2ee2070d6856e24646/pytorch_lightning/trainer/logging.py#L74

And few lines above this call, we transform the metrics to the scalars:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f7622ebfca45abe7d8d34f2ee2070d6856e24646/pytorch_lightning/trainer/logging.py#L64

So, do we really need tensors here? Metrics which come to the LightningLoggerBase are already itemed

@Borda , the TrainerLoggingMixin forces all metrics to be a scalars. And it seems not very convenient to transform them back to tensors. Besides, as I see, metrics could never be a tensor. For example, I can pass scalar value metric even on the training_step, like so:

def training_step(self, batch, batch_idx): loss, logits, _ = self.forward(batch) lr = self.trainer.optimizers[0].param_groups[0]['lr'] log = {'Loss/train': loss, 'Learning-Rate': lr} return {'loss': loss, 'log': log}

Here, the lr is a scalar, it's on cpu and it is not a tensor.

What do you think on this?

agree, let's get this done and think about speedup later... :]

ok perfect.

this might actually be the main cause of the minimal speed discrepancy between lightning and pure pytorch.

…ghtning-AI#1278)" This reverts commit ddbf7de.

…AI#1278) * `add_argparse_args` method fixed (argument types added) * autopep8 fixes * --gpus=0 removed from test (for ci tests) * Update pytorch_lightning/trainer/trainer.py Co-Authored-By: Joe Davison <joe@huggingface.co> * test_with_accumulate_grad_batches added * agg_and_log_metrics logic added to the base logger class * small format fix * agg metrics strategies removed (not to complicate stuff) * agg metrics: handle zero step * autopep8 * changelog upd * flake fix * metrics aggregators factored out, metrics_agg.py added + tests * metrics agg default value added * Update pytorch_lightning/loggers/metrics_agg.py Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * metrics aggregators factored out, metrics_agg.py added + tests * metrics agg default value added * Update pytorch_lightning/loggers/metrics_agg.py Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * remove .item which causes sync issues (Lightning-AI#1254) * remove .item which causes sync issues * fixed gradient acc sched * fixed gradient acc sched * test_metrics_agg.py removed (all tested in doctrings), agg metrics refactored * test_metrics_agg.py removed (all tested in doctrings), agg metrics refactored * autopep8 * loggers base.py types fixed * test * test * metrics aggregation for loggers: each key now has a specific function (or default one) * metrics aggregation for loggers: each key now has a specific function (or default one) * docstrings upd * manual typehints removed from docstrings * batch_size decreased for test `test_with_accumulate_grad_batches` * extend running accum * refactor * fix tests * fix tests * allowed_types generator scoped * trainer.py distutils was imported twice, fixed * TensorRunningAccum refactored * TensorRunningAccum added to change log (Changed) * change log pull link added Co-authored-by: Joe Davison <joe@huggingface.co> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>

alexeykarnachev changed the title ~~Added accumulation of loggers' metrics for the same steps~~ [WIP] Added accumulation of loggers' metrics for the same steps Mar 29, 2020

alexeykarnachev changed the title ~~[WIP] Added accumulation of loggers' metrics for the same steps~~ Added accumulation of loggers' metrics for the same steps Mar 29, 2020

jeremyjordan reviewed Mar 29, 2020

View reviewed changes

pytorch_lightning/loggers/base.py Outdated Show resolved Hide resolved

Borda requested review from ethanwharris, a team and jeremyjordan March 29, 2020 21:50

Borda requested changes Mar 29, 2020

View reviewed changes

alexeykarnachev changed the title ~~Added accumulation of loggers' metrics for the same steps~~ [WIP] Added accumulation of loggers' metrics for the same steps Mar 30, 2020

alexeykarnachev changed the title ~~[WIP] Added accumulation of loggers' metrics for the same steps~~ Added accumulation of loggers' metrics for the same steps Mar 30, 2020

Borda assigned jeremyjordan Mar 30, 2020

Borda requested a review from a team March 30, 2020 16:27

Borda requested changes Mar 30, 2020

View reviewed changes

pytorch_lightning/loggers/metrics_agg.py Outdated Show resolved Hide resolved

pytorch_lightning/loggers/base.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team March 30, 2020 22:33

Borda added bug Something isn't working feature Is an improvement or enhancement labels Apr 3, 2020

Borda added this to the 0.7.2 milestone Apr 3, 2020

Borda assigned ethanwharris Apr 3, 2020

Borda added 3 commits April 7, 2020 14:27

refactor

014b6dd

fix tests

0cf3122

fix tests

64ac60e

Borda approved these changes Apr 7, 2020

View reviewed changes

alexeykarnachev commented Apr 7, 2020

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

allowed_types generator scoped

2ecb8c5

trainer.py distutils was imported twice, fixed

382aadb

alexeykarnachev changed the title ~~[WIP] Added accumulation of loggers' metrics for the same steps~~ Added accumulation of loggers' metrics for the same steps Apr 7, 2020

Borda requested review from justusschock and a team April 7, 2020 15:16

TensorRunningAccum refactored

80da0fb

Borda reviewed Apr 7, 2020

View reviewed changes

alexeykarnachev added 2 commits April 7, 2020 19:04

TensorRunningAccum added to change log (Changed)

6b64fde

change log pull link added

5f1f557

Borda added the ready PRs ready to be merged label Apr 7, 2020

williamFalcon reviewed Apr 7, 2020

View reviewed changes

Borda added ready PRs ready to be merged and removed ready PRs ready to be merged labels Apr 7, 2020

williamFalcon merged commit ddbf7de into Lightning-AI:master Apr 8, 2020

alexeykarnachev deleted the fix-acuumulate-batches-logging branch April 8, 2020 13:13

alsrgv added a commit to carbonrobotics/pytorch-lightning that referenced this pull request Apr 11, 2020

Revert "Added accumulation of loggers' metrics for the same steps (Li…

9b83715

…ghtning-AI#1278)" This reverts commit ddbf7de.

fellnerse mentioned this pull request Apr 16, 2020

After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

Closed

festeh mentioned this pull request Apr 21, 2020

Metric aggragation is broken for LoggerCollection #1547

Closed

Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021

carmocca mentioned this pull request Feb 16, 2022

Lightning logger aggregation API deprecation #9145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added accumulation of loggers' metrics for the same steps #1278

Added accumulation of loggers' metrics for the same steps #1278

alexeykarnachev commented Mar 29, 2020 •

edited

Loading

pep8speaks commented Mar 29, 2020 •

edited

Loading

codecov bot commented Mar 29, 2020 •

edited

Loading

alexeykarnachev commented Mar 29, 2020

Borda left a comment

mergify bot commented Mar 30, 2020

Borda left a comment •

edited

Loading

mergify bot commented Mar 30, 2020

Borda commented Mar 31, 2020

justusschock commented Mar 31, 2020 •

edited

Loading

williamFalcon commented Apr 2, 2020

Borda commented Apr 2, 2020

alexeykarnachev commented Apr 2, 2020 •

edited

Loading

alexeykarnachev commented Apr 3, 2020

Borda left a comment

williamFalcon commented Apr 7, 2020

Borda left a comment •

edited

Loading

Borda commented Apr 7, 2020

williamFalcon Apr 7, 2020

Borda Apr 7, 2020

Borda Apr 7, 2020 •

edited

Loading

Borda Apr 7, 2020

alexeykarnachev Apr 8, 2020 •

edited

Loading

alexeykarnachev Apr 8, 2020

Borda Apr 8, 2020

williamFalcon Apr 8, 2020

Added accumulation of loggers' metrics for the same steps #1278

Added accumulation of loggers' metrics for the same steps #1278

Conversation

alexeykarnachev commented Mar 29, 2020 • edited Loading

Before submitting

What does this PR do?

Did you have fun?

pep8speaks commented Mar 29, 2020 • edited Loading

Comment last updated at 2020-04-07 16:06:12 UTC

codecov bot commented Mar 29, 2020 • edited Loading

Codecov Report

alexeykarnachev commented Mar 29, 2020

Borda left a comment

Choose a reason for hiding this comment

mergify bot commented Mar 30, 2020

Borda left a comment • edited Loading

Choose a reason for hiding this comment

mergify bot commented Mar 30, 2020

Borda commented Mar 31, 2020

justusschock commented Mar 31, 2020 • edited Loading

williamFalcon commented Apr 2, 2020

Borda commented Apr 2, 2020

alexeykarnachev commented Apr 2, 2020 • edited Loading

alexeykarnachev commented Apr 3, 2020

Borda left a comment

Choose a reason for hiding this comment

williamFalcon commented Apr 7, 2020

Borda left a comment • edited Loading

Choose a reason for hiding this comment

Borda commented Apr 7, 2020

williamFalcon Apr 7, 2020

Choose a reason for hiding this comment

Borda Apr 7, 2020

Choose a reason for hiding this comment

Borda Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

Borda Apr 7, 2020

Choose a reason for hiding this comment

alexeykarnachev Apr 8, 2020 • edited Loading

Choose a reason for hiding this comment

alexeykarnachev Apr 8, 2020

Choose a reason for hiding this comment

Borda Apr 8, 2020

Choose a reason for hiding this comment

williamFalcon Apr 8, 2020

Choose a reason for hiding this comment

alexeykarnachev commented Mar 29, 2020 •

edited

Loading

pep8speaks commented Mar 29, 2020 •

edited

Loading

codecov bot commented Mar 29, 2020 •

edited

Loading

Borda left a comment •

edited

Loading

justusschock commented Mar 31, 2020 •

edited

Loading

alexeykarnachev commented Apr 2, 2020 •

edited

Loading

Borda left a comment •

edited

Loading

Borda Apr 7, 2020 •

edited

Loading

alexeykarnachev Apr 8, 2020 •

edited

Loading