Fixes training/validation overlap #143, #233, #253, and #259 #282

yalaudah · 2020-04-22T19:50:49Z

This PR fixes the following issues: #143, #233, #253, and partially solves issue #259.

This solution is not ideal, it limits training and validation to one direction (inline or crossline). If the training results are significantly affected, we can add code to train and val in both directions.

Remaining:

Add unit tests
Test model performance with and without these changes.

update staging branch

remove duplicate code for validation (#208)

Update fork

…arning into staging

…mic-deeplearning into yalaudah-update-visualization

…rning into staging

README.md

contrib/experiments/interpretation/penobscot/local/test.py

contrib/scripts/ablation.sh

docker/Dockerfile

scripts/prepare_dutchf3.py

* correctness branch setup (#251) * created correctnes branch, trimmed experiments to Dutch F3 only * trivial change to re-trigger build * dummy PR to re-trigger malfunctioning builds * reducing scope further (#258) * created correctnes branch, trimmed experiments to Dutch F3 only * trivial change to re-trigger build * dummy PR to re-trigger malfunctioning builds * reducing scope of the correctness branch further * added branch triggers * 214 Ignite 0.3.0 upgrade (#261) * upgraded to Ignite 0.3.0 and fixed upgrade compatibility * added seeds and modified notebook for ignite 0.3.0 * updated code and tests to work with ignite 0.3.0 * made code consistent with Ignite 0.3.0 as much as possible * fixed iterator epoch_length bug by subsetting validation set * applied same fix to the notebook * bugfix in distributed train.py * increased distributed tests to 2 batched - hoping for one batch per GPU * resolved rebase conflict * added seeds and modified notebook for ignite 0.3.0 * updated code and tests to work with ignite 0.3.0 * made code consistent with Ignite 0.3.0 as much as possible * fixed iterator epoch_length bug by subsetting validation set * applied same fix to the notebook * bugfix in distributed train.py * increased distributed tests to 2 batched - hoping for one batch per GPU * update docker readme (#262) Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * tagged all TODOs with issues on github (and created issues) (#278) * created correctnes branch, trimmed experiments to Dutch F3 only * trivial change to re-trigger build * dummy PR to re-trigger malfunctioning builds * resolved merge conflict * flagged all non-contrib TODO with github issues * resolved rebase conflict * resolved merge conflict * cleaned up archaic voxel code * Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes (#264) I think moving forward, we'll use smaller PRs. But here are the changes in this one: Fixes issue #236 that involves rewriting a big portion of train.py such that: All the tensorboard event handlers are organized in tensorboard_handlers.py and only called in train.py to log training and validation results in Tensorboard The code logs the same results for training and validation. Also, it adds the class IoU score as well. All single-use functions (e.g. _select_max, _tensor_to_numpy, _select_pred_and_mask) are lambda functions now The code is organized into more meaningful "chunks".. e.g. all the optimizer-related code should be together if possible, same thing for logging, configuration, loaders, tensorboard, ..etc. In addition: Fixed a visualization bug where the seismic images where not normalized correctly. This solves Issue #217. Fixed a visualization bug where the predictions where not masked where the input image was padded. This improves the ability to visually inspect and evaluate the results. This solves Issue #230. Fixes a potential issue where Tensorboard can crash when a large training batchsize is used. Now the number of images visualized in Tensorboard from every batch has an upper limit. Completely removed OpenCV as a dependency from the DeepSeismic Repo. It was only used in a small part of the code where it wasn't really necessary, and OpenCV is a huge library. Fixes Issue #218 where the epoch number for the images in Tensorboard was always logged as 1 (therefore, not allowing use to see the epoch number of the different results in Tensorboard. Removes the HorovodLRScheduler class since its no longer used Removes toolz.take from Debug mode, and uses PyTorch's native Subset() dataset class Changes default patch size for the HRNet model to 256 In addition to several other minor changes Co-authored-by: Yazeed Alaudah <yalaudah@users.noreply.github.com> Co-authored-by: Ubuntu <yazeed@yaalauda-dsvm-nd24.jsxrnelwp15e1jpgk5vvfmbzyb.bx.internal.cloudapp.net> Co-authored-by: Max Kaznady <maxkaz@microsoft.com> * Fixes training/validation overlap #143, #233, #253, and #259 (#282) * Correctness single GPU switch (#290) * resolved rebase conflict * resolved merge conflict * resolved rebase conflict * resolved merge conflict * reverted multi-GPU builds to run on single GPU * 249r3 (#283) * resolved rebase conflict * resolved merge conflict * resolved rebase conflict * resolved merge conflict * wrote the bulk of checkerboard example * finished checkerboard generator * resolved merge conflict * resolved rebase conflict * got binary dataset to run * finished first implementation mockup - commit before rebase * made sure rebase went well manually * added new files * resolved PR comments and made tests work * fixed build error * fixed build VM errors * more fixes to get the test to pass * fixed n_classes issue in data.py * fixed notebook as well * cleared notebook run cell * trivial commit to restart builds * addressed PR comments * moved notebook tests to main build pipeline * fixed checkerboard label precision * relaxed performance tests for now * resolved merge conflict * resolved merge conflict * fixed build error * resolved merge conflicts * fixed another merge mistake * enabling development on docker (#291) * 289: correctness metrics and tighter tests (#293) * resolved rebase conflict * resolved merge conflict * resolved rebase conflict * resolved merge conflict * wrote the bulk of checkerboard example * finished checkerboard generator * resolved merge conflict * resolved rebase conflict * got binary dataset to run * finished first implementation mockup - commit before rebase * made sure rebase went well manually * added new files * resolved PR comments and made tests work * fixed build error * fixed build VM errors * more fixes to get the test to pass * fixed n_classes issue in data.py * fixed notebook as well * cleared notebook run cell * trivial commit to restart builds * addressed PR comments * moved notebook tests to main build pipeline * fixed checkerboard label precision * relaxed performance tests for now * resolved merge conflict * resolved merge conflict * fixed build error * resolved merge conflicts * fixed another merge mistake * resolved rebase conflict * resolved rebase 2 * resolved merge conflict * resolved merge conflict * adding new logging * added better logging - cleaner - debugged metrics on checkerboard dataset * resolved rebase conflict * resolved merge conflict * resolved merge conflict * resolved merge conflict * resolved rebase 2 * resolved merge conflict * updated notebook with the changes * addressed PR comments * addressed another PR comment * uniform colormap and correctness tests (#295) * correctness code good for PR review * addressed PR comments * added data dumps to the code * all dumps work properly now * fixed build error, added binary dataset * done - now need to test * finished dev build script * updates to tests to run on local machine as well we build * updated gradient direction in gen_checkerboard * increased Dutch F3 timeout Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com> Co-authored-by: Yazeed Alaudah <yalaudah@users.noreply.github.com> Co-authored-by: Ubuntu <yazeed@yaalauda-dsvm-nd24.jsxrnelwp15e1jpgk5vvfmbzyb.bx.internal.cloudapp.net>

yalaudah and others added 30 commits February 20, 2020 19:57

Merge pull request #4 from microsoft/staging

b10010b

update staging branch

Merge pull request #5 from microsoft/staging

b26c0d0

remove duplicate code for validation (#208)

Merge pull request #6 from microsoft/staging

b6e184b

Update fork

format with black

70ae8e9

updates to the HRnet notebook

1d85b20

updates to penobscot notebook

89008be

set normalize=True as default in create_image_writer

b1c1a29

remove HorovodLRScheduler

2659039

minor cleanup in logging_handlers

cc1c75d

tb_handlers update

9d36a03

major changes to train.py

417c721

Checkpointing

020c52f

Minor update to configs (TRAIN.DEPTH)

1f646d2

Remove OpenCV completely :)

942be67

Minor formatting, renaming, and cleanup.

21d9917

final cleanup

c1e4daa

Merge branch 'staging' of https://github.com/microsoft/seismic-deeple…

5f4bc2c

…arning into staging

resetting some params in the configs

759b4e0

minor

a17cd95

Merge branch 'update-visualization' of git://github.com/yalaudah/seis…

b8eb264

…mic-deeplearning into yalaudah-update-visualization

Merge branch 'yalaudah-update-visualization' into staging

c56643d

update notebook seeds

238b834

Merge branch 'staging' of https://github.com/yalaudah/seismic-deeplea…

920acfa

…rning into staging

bug fix

b9be4e5

adding TRAIN_VAL_SPLIT_DIRECTION to default.py

952491d

updates to split_section_train_val

f4b339b

updates to _get_aline_range

968b946

black formatting

d2a078e

updates to SplitTrainValCLI

baeb71c

black formatting

50ad90e

This was linked to issues Apr 22, 2020

enable switch of crossline or inline training and scoring - easy fix for training set leakage #253

Closed

prepare_dutchf3.py vertical sample locations drops patches at the bottom of the volume - leads to worse results in that region #259

Closed

yalaudah added 4 commits April 22, 2020 20:50

fixes to test

f7abb30

fixes to tests

7a9bd8d

test fixes

bf8cc96

test fixes

c1efe7c

yalaudah changed the title ~~Fixes training/validation overlap #143, #233, #253~~ Fixes training/validation overlap #143, #233, #253, and #259 Apr 23, 2020

yalaudah requested review from maxkazmsft and sharatsc April 23, 2020 01:00

yalaudah added 5 commits April 23, 2020 14:06

fixes to tests

966f92d

test fixes

b47581d

code enhancements

223e575

fixes to test

5bad1d4

bug fix

af07cb8

yalaudah marked this pull request as ready for review April 25, 2020 03:40

maxkazmsft reviewed Apr 27, 2020

View reviewed changes

yalaudah added 9 commits April 27, 2020 14:59

added "both" option

3d815e9

bug fix

a62780b

bug fix

1083e31

bug fixes

397a1f8

bug fixes

53685f4

bug fix

d5ee776

bug fixes

00d9e62

bug fix

a17bea8

bug fix

0cac087

maxkazmsft approved these changes Apr 28, 2020

View reviewed changes

yalaudah merged commit be94d9a into microsoft:correctness Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes training/validation overlap #143, #233, #253, and #259 #282

Fixes training/validation overlap #143, #233, #253, and #259 #282

yalaudah commented Apr 22, 2020 •

edited

Loading

Fixes training/validation overlap #143, #233, #253, and #259 #282

Fixes training/validation overlap #143, #233, #253, and #259 #282

Conversation

yalaudah commented Apr 22, 2020 • edited Loading

yalaudah commented Apr 22, 2020 •

edited

Loading