Add internal counters to send only the latest datapoints to studio #788

AlexandreKempf · 2024-02-15T09:58:29Z

Following a bug detected during this feature development.

Using the step value to send the latest data to Studio can lead to weird behavior because the step is poorly defined in some loggers (eg. pytorch lightning logger). Because the step definition is poorly defined in lightning we used a hack to ensure the log_metrics calls by the lightning trainer were correct. But calling log_metrics from outside the lightning trainer (a separate thread for instance) leads to data not being sent to Studio or duplicates data.

This PR introduces a counter for each metric that increments when Studio receives the data points. Instead of using the step property as a proxy for which data has been sent to studio, we literally count them now. This way, when we want to send data points to Studio, we can only send the points it hasn't received yet.

The test added in the PR fails in the main branch because logs[test_metric] is

[
    {'step': '0', 'test': '0.5'}, 
    {'step': '1', 'test': '0.5'}, 
    {'step': '2', 'test': '0.5'}, 
    {'step': '3', 'test': '0.5'}
]

which is expected. But test_calls is

[
    {'step': '0', 'test': '0.5'}, 
    {'step': '1', 'test': '0.5'},
    {'step': '1', 'test': '0.5'}, 
    {'step': '1', 'test': '0.5'},
    {'step': '2', 'test': '0.5'}, 
    {'step': '3', 'test': '0.5'},
    {'step': '3', 'test': '0.5'}
]

With this PR, both values logs[test_metric] and test_calls are the same and don't contain duplicated elements.

❗ I have followed the Contributing to DVCLive guide.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

codecov-commenter · 2024-02-15T10:04:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9eb04c2) 95.53% compared to head (bcf6c04) 95.55%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #788      +/-   ##
==========================================
+ Coverage   95.53%   95.55%   +0.02%     
==========================================
  Files          55       55              
  Lines        3559     3580      +21     
  Branches      319      319              
==========================================
+ Hits         3400     3421      +21     
  Misses        111      111              
  Partials       48       48

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/dvclive/studio.py

dberenbaum

LGTM! Do you think there are any additional tests that would be helpful to add?

AlexandreKempf · 2024-02-16T14:20:50Z

LGTM! Do you think there are any additional tests that would be helpful to add?

Sure! I took a bad habit the last few years not to write tests. It needs to become a habit once again. Sorry for that, I'll add them :)

tests/test_post_to_studio.py

shcheklein

We need to have a proper test, from the brief description I don't quite understand in what situations we have a bug - it would be helpful to have a test that makes it obvious / clear

…gger (another thread)

AlexandreKempf · 2024-02-19T12:03:14Z

@shcheklein
I updated the description to explain the problem a bit better.
I created a new test in the lightning framework since it was problematic.
The new test uses a thread and a sleep function, but I tried to apply what you suggested in the CPU monitoring PR about time.sleep in tests. I also used @pytest.mark.timeout(3) to ensure the test doesn't hang forever, but fails instead if it reaches a 3s duration.

Let me know what you think :)

tests/frameworks/test_lightning.py

shcheklein

Thanks for updating the description, I think I understand it better now.

Just a few small questions to clarify re the test. Otherwise I think it should be fine (though I'm not an expert in the details re the data points management)

shcheklein · 2024-02-20T03:16:51Z

@daavoo it would be great if you can take a look :)

* add test for repeated step in studio * drop outdated lightning test comments

shcheklein · 2024-02-21T17:37:38Z

thanks for the update @AlexandreKempf ! can we add a test / change this test a bit to have more datapoints with a different cadence of updates. Correct me folks, if I'm wrong but we keep counter per metric path, right? some metric can be updated a few times before the next step (and even before sync), some once. And we need in all cases make sure that on sync we send "delta" properly.

dberenbaum · 2024-02-27T17:36:55Z

@AlexandreKempf Are you ready to merge this one?

add counters to know what datapoints were sent to studio

74618f8

AlexandreKempf requested a review from dberenbaum February 15, 2024 09:58

AlexandreKempf mentioned this pull request Feb 15, 2024

Monitors CPU, RAM, and disk usage #773

Closed

2 tasks

dberenbaum reviewed Feb 16, 2024

View reviewed changes

src/dvclive/studio.py Outdated Show resolved Hide resolved

clean code and remove _num_points_read_from_file

ee8997d

dberenbaum approved these changes Feb 16, 2024

View reviewed changes

AlexandreKempf added 2 commits February 16, 2024 15:24

Merge branch 'main' into send-to-studio-counter

2978544

add unit tests about _num_points_sent_to_studio behavior

0eba875

shcheklein reviewed Feb 18, 2024

View reviewed changes

tests/test_post_to_studio.py Outdated Show resolved Hide resolved

shcheklein requested changes Feb 18, 2024

View reviewed changes

test if lightning logger sent data to studio when using a separate lo…

c0c6199

…gger (another thread)

AlexandreKempf force-pushed the send-to-studio-counter branch from 13415e1 to c0c6199 Compare February 19, 2024 11:47

AlexandreKempf requested a review from shcheklein February 19, 2024 12:07

shcheklein reviewed Feb 20, 2024

View reviewed changes

tests/frameworks/test_lightning.py Outdated Show resolved Hide resolved

shcheklein reviewed Feb 20, 2024

View reviewed changes

tests/frameworks/test_lightning.py Outdated Show resolved Hide resolved

shcheklein reviewed Feb 20, 2024

View reviewed changes

AlexandreKempf mentioned this pull request Feb 20, 2024

monitor GPU ressources #785

Merged

2 tasks

dberenbaum mentioned this pull request Feb 20, 2024

Test studio counter #790

Merged

Dave Berenbaum and others added 2 commits February 21, 2024 08:11

Test studio counter (#790)

8b4841e

* add test for repeated step in studio * drop outdated lightning test comments

added context to the unit test

90c32e5

AlexandreKempf added 2 commits February 22, 2024 09:38

add async metrics logs in the studio unit tests

3ab7557

Merge branch 'main' into send-to-studio-counter

bcf6c04

shcheklein approved these changes Feb 22, 2024

View reviewed changes

AlexandreKempf merged commit bc0a5b5 into main Feb 28, 2024
14 checks passed

AlexandreKempf deleted the send-to-studio-counter branch February 28, 2024 17:59

dberenbaum mentioned this pull request Apr 29, 2024

Drop outdated flakey lightning test #816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add internal counters to send only the latest datapoints to studio #788

Add internal counters to send only the latest datapoints to studio #788

AlexandreKempf commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading

dberenbaum left a comment

AlexandreKempf commented Feb 16, 2024

shcheklein left a comment

AlexandreKempf commented Feb 19, 2024 •

edited

Loading

shcheklein left a comment

shcheklein commented Feb 20, 2024

shcheklein commented Feb 21, 2024

dberenbaum commented Feb 27, 2024

Add internal counters to send only the latest datapoints to studio #788

Add internal counters to send only the latest datapoints to studio #788

Conversation

AlexandreKempf commented Feb 15, 2024 • edited Loading

codecov-commenter commented Feb 15, 2024 • edited Loading

Codecov Report

dberenbaum left a comment

Choose a reason for hiding this comment

AlexandreKempf commented Feb 16, 2024

shcheklein left a comment

Choose a reason for hiding this comment

AlexandreKempf commented Feb 19, 2024 • edited Loading

shcheklein left a comment

Choose a reason for hiding this comment

shcheklein commented Feb 20, 2024

shcheklein commented Feb 21, 2024

dberenbaum commented Feb 27, 2024

AlexandreKempf commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading

AlexandreKempf commented Feb 19, 2024 •

edited

Loading