Grid Analyzer! #81

tkornuta-ibm · 2018-11-09T22:09:45Z

Finally, working mip-grid-analyzer is here!

This required to solve several issues and introduce many changes in the whole experimental pipeline, starting from:

rethinking what we really want to show in the final csv file - fixes Rethink operation of grid-analyzer #66
investigating the content of statistics and rethinking the desired behavior of online/offline trainers (fixes Investigate the content of statistics, rethink the desired behavior of online/offline trainers #64 )
store termination cause (fixes Store termination cause in model checkpoint #67 ) along with testing and validation statistics (fixes Add exporting both training and validation statistics to model #74 ) in model checkpoint
fix formatting issue in grid-analyzer(fixes Fix formatting issue in grid-analyzer #77 )
I have also introduced fixes to all grid-workers-cpu that enables to distribute computations into several processes on CPU on MacOS (fixes grid_*_cpu not working on MAC/OSX #52 ) (WARNING! Didn't tested that on Ubuntu CPUs nor GPUs, should work, but needs testing... - creating new issues Test grid-* image classification pipeline on Ubuntu CPUs #82 and Test grid-* image classification pipeline on GPUs #83 )

In order to test the whole pipeline, I have also prepared a config for grid_trainer with LeNet5 and SimpleCNN model trained on MNIST (fixes #51 )

The use case is as follows:

run grid-trainer on the configuration file (2 models, 1 problem, 5 repetitions)
'mip-grid-trainer-cpu --c configs/vision/grid_trainer_mnist.yaml'
run grid-tested on the newly created directory with containing 10 different models
'mip-grid-tester-cpu --e --expdir ./experiments_20181109_131153'
run grid-analyzer to aggregate all the results in a single csv file
'mip-grid-analyzer --e --expdir ./experiments_20181109_131153'

Attaching the resulting file:

20181109_134802_grid_analysis.xlsx

(What I have described above will in fact become a part of "reproducible research: MNIST Image Classification with LeNet-5 - issue #12 )

Besides, I've made plenty of minor changes, like:

added Ctrl+C handling to grid workers (fixes Add Ctrl+C handling to grid workers #63 )
changed --o to --e (fixes Change --o to --e #65 )
cleaned up some of the arguments of grid workers (partially Clean up the flags of the different workers #25 )
adds option to select basic trainer from both command line and config of grid_trainer_* (fixes Add basic trainer selection to config of grid_trainer_* #62)

…ing_ analyser is failing

…pt exist

…e pulled from files

…now)

…analyzer

…eption

#65

… with wrongly formated values in grid-analyzer

…analyzer

tkornuta-ibm · 2018-11-09T22:39:45Z

This pull request introduces 3 alerts when merging 1ae12d3 into d8c20f1 - view on LGTM.com

new alerts:

3 for Unused import

Comment posted by LGTM.com

…e, changed trainer command line argument handling - fixes #62)

… fix/grid-analyzer

vmarois · 2018-11-10T02:03:16Z

It looks good! I haven't add the time to test it so will do so on Monday.

This PR adds a dependency over psutil.. Is there no way around it?

tkornuta-ibm · 2018-11-10T02:07:57Z

psutil

Sadly not, please review the solutions that I have analyzed when fixing issue #52

… fix/grid-analyzer

vmarois

It looks good, thanks!
I have done some doc polishing and set a few basic methods of GridAnalyzer as static ones.
I have tested the pipeline on Ubuntu CPUs (fixes #82 ) and GPUs (fixes #83 ), and it worked flawlessly.

Nonetheless, I have some remarks about the resulting csv file:

The train_status value does not always match the actual one reported in the trainer.log. Critically, some runs are indicated as "Not converged" when they actually did. I am thinking that this happens as we do not always save the model. Thus, it is possible that we save the model mid-training, and the training status is then different than the final one
I find the key name train_start to be somewhat confusing when compared to model_timestamp (I thought at the beginning that we were not reporting the correct timestamp). Could we clarify this?
We should report the aggregated statistics for the test, in place of the data point for the last episode. I think this is more valuable to the user, and shouldn't be hard to implement (since we have only 1 data point for the test, and the corresponding csv file exists in all cases)

Otherwise, this is an important PR, and this is great work! 👍

…ged in online trainer 2) added saving model after last epoch in offline trainer 3) importing aggregated test statistics in grid-analyzer

…ts at the end of training (when epoch/episode limit is hit), modifies model save. Fixes #85

tkornuta-ibm added 25 commits November 5, 2018 18:59

First fixes of number of available CPUs, grid_trainer and tester work…

21d9342

…ing_ analyser is failing

Cleaned up grid testers, now relying onf the fact whether best_model.…

f2332ae

…pt exist

hints -> hint

0b8105d

Work on mip-grid-analyzer, working up to the point when statistics ar…

818e9c0

…e pulled from files

Removed spannig many processes, commented file content analysis (for …

af5fba4

…now)

Merge branch 'develop' into fix/grid-analyzer

c2eabc2

Merge branch 'develop' of github.com:IBM/mi-prometheus into fix/grid-…

a24b4af

…analyzer

Merge branch 'feat/trainers_save_status' into fix/grid-analyzer

c372dd1

analyzer - processing data from checkpoint and csv file

b7eb77e

timestamp

1411276

episode limit - 1000

7657d50

Merge branch 'feat/trainers_save_status' into fix/grid-analyzer

f1ad3c4

Comment

21e0759

Reading training and validation from checkpoint

32bf7a0

Merge branch 'develop' of github.com:IBM/mi-prometheus into fix/grid-…

16f10c8

…analyzer

Polished experiment confirmation

8d6cb59

Standardization in vision mnist configs

d36ac12

Refined confirmation, added handling of termination with ctrl-c - exc…

3cad6ca

…eption

Micro-cleanup in commandline arguments, changed output to expdir - fixes

0626484

#65

Removed partial validation aggregation from offline trainer

632f668

Added export to checkpoint method to stats objects that fixes problem…

32a45d6

… with wrongly formated values in grid-analyzer

gid analyzer working

c58fc86

gid analyzer working

3c3ecec

Merge branch 'develop' of github.com:IBM/mi-prometheus into fix/grid-…

ad5d187

…analyzer

Analyzer cleanup + grid-training-mnist config

1ae12d3

tkornuta-ibm assigned vmarois Nov 9, 2018

tkornuta-ibm requested review from sesevgen, vincentalbouy, tsjayram and vmarois November 9, 2018 22:09

tkornuta-ibm and others added 7 commits November 9, 2018 15:38

Removed empty exception handling around user input()

28ea631

Removed unused imports

dc672a8

Model cleanup

b8e96b4

Missing try-except in analyzer

e2ee0cc

Added option to indicate trainer from grid training configuration fil…

2e7cfb6

…e, changed trainer command line argument handling - fixes #62)

Lots of small clean up / polishing.

840f571

Merge branch 'fix/grid-analyzer' of github.com:IBM/mi-prometheus into…

ad67e90

… fix/grid-analyzer

Fixed grid analyzer comments

f6dc6af

tkornuta-ibm and others added 3 commits November 9, 2018 18:08

Merge branch 'fix/grid-analyzer' of github.com:IBM/mi-prometheus into…

de98f88

… fix/grid-analyzer

input at the end of configuration

1e6165f

Made some methods static + polishing.

5b41532

vmarois suggested changes Nov 12, 2018

View reviewed changes

tkornuta-ibm added 4 commits November 12, 2018 14:00

Fixed few bugs raised by Vincent: 1) saving proper status when conver…

0ef9ee4

…ged in online trainer 2) added saving model after last epoch in offline trainer 3) importing aggregated test statistics in grid-analyzer

Added psutil to config and readme

d00fa16

if added to doc_build

a9cf3d2

Standardizes statuses accross trainers, updates statuses in checkpoin…

292213f

…ts at the end of training (when epoch/episode limit is hit), modifies model save. Fixes #85

vmarois approved these changes Nov 13, 2018

View reviewed changes

vmarois merged commit 9d0a1e2 into develop Nov 13, 2018

vmarois deleted the fix/grid-analyzer branch November 13, 2018 00:14

tkornuta-ibm restored the fix/grid-analyzer branch November 13, 2018 00:23

This was referenced Nov 13, 2018

Test grid-* image classification pipeline on GPUs #83

Closed

Release 0.3 #46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grid Analyzer! #81

Grid Analyzer! #81

tkornuta-ibm commented Nov 9, 2018 •

edited

Loading

tkornuta-ibm commented Nov 9, 2018

vmarois commented Nov 10, 2018

tkornuta-ibm commented Nov 10, 2018

vmarois left a comment •

edited

Loading

Grid Analyzer! #81

Grid Analyzer! #81

Conversation

tkornuta-ibm commented Nov 9, 2018 • edited Loading

tkornuta-ibm commented Nov 9, 2018

vmarois commented Nov 10, 2018

tkornuta-ibm commented Nov 10, 2018

vmarois left a comment • edited Loading

Choose a reason for hiding this comment

tkornuta-ibm commented Nov 9, 2018 •

edited

Loading

vmarois left a comment •

edited

Loading