Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grid Analyzer! #81

Merged
merged 40 commits into from
Nov 13, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
21d9342
First fixes of number of available CPUs, grid_trainer and tester work…
tkornuta-ibm Nov 6, 2018
f2332ae
Cleaned up grid testers, now relying onf the fact whether best_model.…
tkornuta-ibm Nov 7, 2018
0b8105d
hints -> hint
tkornuta-ibm Nov 7, 2018
818e9c0
Work on mip-grid-analyzer, working up to the point when statistics ar…
tkornuta-ibm Nov 7, 2018
af5fba4
Removed spannig many processes, commented file content analysis (for …
tkornuta-ibm Nov 7, 2018
c2eabc2
Merge branch 'develop' into fix/grid-analyzer
tkornuta-ibm Nov 7, 2018
a24b4af
Merge branch 'develop' of github.com:IBM/mi-prometheus into fix/grid-…
tkornuta-ibm Nov 7, 2018
c372dd1
Merge branch 'feat/trainers_save_status' into fix/grid-analyzer
tkornuta-ibm Nov 8, 2018
b7eb77e
analyzer - processing data from checkpoint and csv file
tkornuta-ibm Nov 8, 2018
1411276
timestamp
tkornuta-ibm Nov 8, 2018
7657d50
episode limit - 1000
tkornuta-ibm Nov 8, 2018
f1ad3c4
Merge branch 'feat/trainers_save_status' into fix/grid-analyzer
tkornuta-ibm Nov 8, 2018
21e0759
Comment
tkornuta-ibm Nov 9, 2018
32bf7a0
Reading training and validation from checkpoint
tkornuta-ibm Nov 9, 2018
16f10c8
Merge branch 'develop' of github.com:IBM/mi-prometheus into fix/grid-…
tkornuta-ibm Nov 9, 2018
8d6cb59
Polished experiment confirmation
tkornuta-ibm Nov 9, 2018
d36ac12
Standardization in vision mnist configs
tkornuta-ibm Nov 9, 2018
3cad6ca
Refined confirmation, added handling of termination with ctrl-c - exc…
tkornuta-ibm Nov 9, 2018
0626484
Micro-cleanup in commandline arguments, changed output to expdir - fi…
tkornuta-ibm Nov 9, 2018
632f668
Removed partial validation aggregation from offline trainer
tkornuta-ibm Nov 9, 2018
32a45d6
Added export to checkpoint method to stats objects that fixes problem…
tkornuta-ibm Nov 9, 2018
c58fc86
gid analyzer working
tkornuta-ibm Nov 9, 2018
3c3ecec
gid analyzer working
tkornuta-ibm Nov 9, 2018
ad5d187
Merge branch 'develop' of github.com:IBM/mi-prometheus into fix/grid-…
tkornuta-ibm Nov 9, 2018
1ae12d3
Analyzer cleanup + grid-training-mnist config
tkornuta-ibm Nov 9, 2018
28ea631
Removed empty exception handling around user input()
tkornuta-ibm Nov 9, 2018
dc672a8
Removed unused imports
tkornuta-ibm Nov 9, 2018
b8e96b4
Model cleanup
tkornuta-ibm Nov 9, 2018
e2ee0cc
Missing try-except in analyzer
tkornuta-ibm Nov 9, 2018
2e7cfb6
Added option to indicate trainer from grid training configuration fil…
tkornuta-ibm Nov 10, 2018
840f571
Lots of small clean up / polishing.
vmarois Nov 10, 2018
ad67e90
Merge branch 'fix/grid-analyzer' of github.com:IBM/mi-prometheus into…
vmarois Nov 10, 2018
f6dc6af
Fixed grid analyzer comments
tkornuta-ibm Nov 10, 2018
de98f88
Merge branch 'fix/grid-analyzer' of github.com:IBM/mi-prometheus into…
tkornuta-ibm Nov 10, 2018
1e6165f
input at the end of configuration
tkornuta-ibm Nov 10, 2018
5b41532
Made some methods static + polishing.
vmarois Nov 12, 2018
0ef9ee4
Fixed few bugs raised by Vincent: 1) saving proper status when conver…
tkornuta-ibm Nov 12, 2018
d00fa16
Added psutil to config and readme
tkornuta-ibm Nov 12, 2018
a9cf3d2
if added to doc_build
tkornuta-ibm Nov 12, 2018
292213f
Standardizes statuses accross trainers, updates statuses in checkpoin…
tkornuta-ibm Nov 13, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ The dependencies of MI-prometheus are:
* torchtext
* tensorboardx
* matplotlib
* psutil (enables grid-* to span child processes on MacOS and Ubuntu)
* PyYAML
* tqdm
* nltk
Expand Down
15 changes: 0 additions & 15 deletions configs/example_trainer_gpu.yaml

This file was deleted.

6 changes: 3 additions & 3 deletions configs/vision/alexnet_mnist.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,23 @@ training:
lr: 0.01
# settings parameters
terminal_conditions:
loss_stop: 1.0e-5
loss_stop: 1.0e-3
episode_limit: 50000
epochs_limit: 10

# Problem parameters:
validation:
problem:
name: *name
batch_size: 64
batch_size: *b
use_train_data: True # True because we are splitting the training set to: validation and training
resize: [224, 224]

# Problem parameters:
testing:
problem:
name: *name
batch_size: 64
batch_size: *b
use_train_data: False
resize: [224, 224]

Expand Down
43 changes: 43 additions & 0 deletions configs/vision/grid_trainer_mnist.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
grid_tasks:
-
default_configs: configs/vision/lenet5_mnist.yaml
-
default_configs: configs/vision/simplecnn_mnist.yaml

# Set exactly the same experiment conditions for the 2 tasks.
grid_overwrite:
training:
problem:
batch_size: &b 1000
sampler:
name: SubsetRandomSampler
indices: [0, 55000]
# Set the same optimizer parameters.
optimizer:
name: Adam
lr: 0.01
# Set the same terminal conditions.
terminal_conditions:
loss_stop: 4.0e-2
episode_limit: 10000
epoch_limit: 10

# Problem parameters:
validation:
problem:
batch_size: *b
sampler:
name: SubsetRandomSampler
indices: [55000, 60000]

testing:
problem:
batch_size: *b

grid_settings:
# Set number of repetitions of each experiments.
experiment_repetitions: 5
# Set number of concurrent running experiments.
max_concurrent_runs: 4
# Set trainer.
trainer: mip-online-trainer
24 changes: 12 additions & 12 deletions configs/vision/lenet5_mnist.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
training:
problem:
name: &name MNIST
batch_size: 64
batch_size: &b 64
use_train_data: True
mnist_folder: &folder '~/data/mnist'
data_folder: &folder '~/data/mnist'
resize: [32, 32]
# Use sampler that operates on a subset.
sampler:
Expand All @@ -15,19 +15,19 @@ training:
name: Adam
lr: 0.01
# settings parameters
#terminal_condition:
# loss_stop: 1.0e-5
# episode_limit: 10000
# epoch_limit: 10
terminal_conditions:
loss_stop: 1.0e-2
episode_limit: 10000
epoch_limit: 10

# Validation parameters:
validation:
partial_validation_interval: 100
#partial_validation_interval: 100
problem:
name: *name
batch_size: 64
use_train_data: True
mnist_folder: *folder
batch_size: *b
use_train_data: True # True because we are splitting the training set to: validation and training
data_folder: *folder
resize: [32, 32]
# Use sampler that operates on a subset.
sampler:
Expand All @@ -38,9 +38,9 @@ validation:
testing:
problem:
name: *name
batch_size: 10000
batch_size: *b
use_train_data: False
mnist_folder: *folder
data_folder: *folder
resize: [32, 32]

# Model parameters:
Expand Down
14 changes: 7 additions & 7 deletions configs/vision/simplecnn_mnist.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ training:
problem:
name: &name MNIST
batch_size: &b 64
mnist_folder: &folder '~/data/mnist'
data_folder: &folder '~/data/mnist'
use_train_data: True
resize: [32, 32]
sampler:
Expand All @@ -18,16 +18,16 @@ training:
lr: 0.01
# settings parameters
terminal_conditions:
loss_stop: 1.0e-5
episode_limit: 20000
loss_stop: 1.0e-3
episode_limit: 1000
epoch_limit: 1

# Problem parameters:
validation:
problem:
name: *name
batch_size: 5000
mnist_folder: *folder
batch_size: *b
data_folder: *folder
use_train_data: True # True because we are splitting the training set to: validation and training
resize: [32, 32]
sampler:
Expand All @@ -43,8 +43,8 @@ testing:
#seed_torch: 2452
problem:
name: *name
batch_size: 10000
mnist_folder: *folder
batch_size: *b
data_folder: *folder
use_train_data: False
resize: [32, 32]

Expand Down
5 changes: 4 additions & 1 deletion doc_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,7 @@ sphinx-build -b html source build
make html

# open web browser(s) to master table of content
firefox build/index.html
if which firefox
then
firefox build/index.html
fi
Loading