All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
For each Pull Request, the affected code parts should be briefly described and added here in the "Upcoming" section. Once a release is done, the "Upcoming" section becomes the release changelog, and a new empty "Upcoming" should be created.
- (#465) Adding ability to run segmentation inference module on test data with partial ground truth files. (Also 522.)
- (#502) More flags for fine control of when to run inference.
- (#492) Adding capability for regression tests for test jobs that run in AzureML.
- (#509) Run inference on registered models (single and
ensemble) using the parameter
model_id
. - (#554) Added a parameter
pretraining_dataset_id
toNIH_COVID_BYOL
to specify the name of the SSL training dataset. - (#559) Adding the accompanying code for the "Active label cleaning: Improving dataset quality under resource constraints" paper. The code can be found in the InnerEye-DataSelection subfolder. It provides tools for training noise robust models, running label cleaning simulation and loading our label cleaning benchmark datasets.
- (#531) Updated PL to 1.3.8, torchmetrics and pl-bolts and changed relevant metrics and SSL code API.
- (#533) Better defaults for inference on ensemble children.
- (#536) Inference will not run on the validation set by default, this can be turned on
via the
--inference_on_val_set
flag. - (#548) Many Azure-related functions have been moved out of the toolbox, into the separate hi-ml Python package.
- (#502) Renamed command line option 'perform_training_set_inference' to 'inference_on_train_set'. Replaced command line option 'perform_validation_and_test_set_inference' with the pair of options 'inference_on_val_set' and 'inference_on_test_set'.
- (#496) All plots are now saved as PNG, rather than JPG.
- (#497) Reducing the size of the code snapshot that gets uploaded to AzureML, by skipping all test folders.
- (#509) Parameter
extra_downloaded_run_id
has been renamed topretraining_run_checkpoints
. - (#526) Updated Covid config to use a multiclass
formulation. Moved functions
create_metric_computers
andcompute_and_log_metrics
fromScalarLightning
toScalarModelBase
. - (#554) Updated report in CovidModel. Set parameters in the config to run inference on both the validation and test sets by default.
- (#537) Print warning if inference is disabled but comparison requested.
- (#546) Environment and hello_world_model documentation updated
- (#525) Enable --store_dataset_sample
- (#495) Fix model comparison.
- (#547) The parameter pl_find_unused_parameters was no longer used to initialize the DDP Plugin.
- (#482) Check bool parameter is either true or false.
- (#475) Bug in AML SDK meant that we could not train any large models anymore because data loaders ran out of memory.
- (#472) Correct model path for moving ensemble models.
- (#494) Fix an issue where multi-node jobs for LightningContainer models can get stuck at test set inference.
- (#498) Workaround for the problem that downloading multiple large checkpoints can time out.
- (#515) Workaround for occasional issues with dataset mounting and running matplotblib on some machines. Re-instantiated a disabled test.
- (#509) Fix issue where model checkpoints were not loaded in inference-only runs when using lightning containers.
- (#553) Fix incomplete test data module setup in Lightning inference.
- (#557) Fix issue where learning rate was not set correctly in the SimCLR module
- (#558) Fix issue with the CovidModel config where model weights from a finetuning run were incompatible with the model architecture created for non-finetuning runs.
- (#542) Removed Windows test leg from build pipeline.
- (#509) Parameters
local_weights_path
andweights_url
can no longer be used to initialize a training run, only inference runs. - (#526) Removed
get_posthoc_label_transform
in classScalarModelBase
. Instead, functionsget_loss_function
andcompute_and_log_metrics
inScalarModelBase
can be implemented to compute the loss and metrics in a task-specific manner. - (#554) Removed cryptography from list of invalid
packages in
test_invalid_python_packages
as it is already present as a dependency in our conda environment.
- (#483) Allow cross validation with 'bring your own' Lightning models (without ensemble building).
- (#489) Remove portal query for outliers.
- (#488) Better handling of missing seriesId in segmentation cross validation reports.
- (#454) Checking that labels are mutually exclusive.
- (#447) Added a sanity check to ensure there are no missing channels, nor missing files. If missing channels in the csv file or filenames associated with channels are incorrect, pipeline exits with error report before running training or inference.
- (#446) Guarding
save_outlier
so that it works when institution id and series id columns are missing. - (#441) Add script to move models from one AzureML workspace to another:
python InnerEye/Scripts/move_model.py
- (#417) Added a generic way of adding PyTorch Lightning models to the toolbox. It is now possible to train almost any Lightning model with the InnerEye toolbox in AzureML, with only minimum code changes required. See the MD documentation for details.
- (#430) Update conversion to 1.0.1 InnerEye-DICOM-RT to add: manufacturer, SoftwareVersions, Interpreter and ROIInterpretedTypes.
- (#385) Add the ability to train a model on multiple
nodes in AzureML. Example: Add
--num_nodes=2
to the commandline arguments to train on 2 nodes. - (#366) and
(#407) add new parameters to the
score.py
script ofuse_dicom
andresult_zip_dicom_name
. Ifuse_dicom==True
then the input file should be a zip of a DICOM series. This will be unzipped and converted to Nifti format before processing. The result will then be converted to a DICOM-RT file, zipped and stored asresult_zip_dicom_name
. - (#416) Add a github action chat checks
if
CHANGELOG.md
has been modified. - (#412) Dataset files can now have arbitrary names, and
are no longer restricted to be called
dataset.csv
, via the config fielddataset_csv
. This allows to have a single set of image files in a folder, but multiple datasets derived from it. - (#391) Support for multilabel classification tasks.
Multilabel models can be trained by adding the parameter
class_names
to the config for classification models.class_names
should contain the name of each label class in the dataset, and the order of names should match the order of class label indices indataset.csv
.dataset.csv
supports multiple labels (indices corresponding toclass_names
) per subject in the label column. Multiple labels should be encoded as a string with labels separated by a|
, for example "0|2|4". Note that this PR does not add support for multiclass models, where the labels are mutually exclusive. - (#425) The number of layers in a Unet is no longer
fixed at 4, but can be set via the config field
num_downsampling_paths
. A lower number of layers may be useful for decreasing memory requirements, or for working with smaller images. (The minimum image size in any dimension when using a network of n layers is 2**n.) - (#426) Flake8, mypy, and testing the HelloWorld model is now happening in a Github action, no longer in Azure Pipelines.
- (#405) Cross-validation runs for classification models now also generate a report notebook summarising the metrics from the individual splits. Also includes minor formatting improvements for standard classification reports.
- (#438) Add links and small docs to InnerEye-Gateway and InnerEye-Inference
- (#439) Enable automatic job recovery from last recovery checkpoint in case of job pre-emption on AML. Give the possibility to the user to keep more than one recovery checkpoint.
- (#442) Enable defining custom scalar losses
(
ScalarLoss.CustomClassification
andCustomRegression
), prediction targets (ScalarModelBase.target_names
), and reporting (ModelConfigBase.generate_custom_report()
) in scalar configs, providing more flexibility for defining model configs with custom behaviour while leveraging the existing InnerEye workflows. - (#444) Added setup scripts and documentation to work with the FastMRI challenge datasets.
- (#444) Git-related information is now printed to the console for easier diagnostics.
- (#445) Adding test coverage for the
HelloContainer
model with multiple GPUs - (#450) Adds the metric "Accuracy at threshold 0.5" to the classification report (
classification_crossval_report.ipynb
). - (#451) Write a file
model_outputs.csv
with columnssubject
,prediction_target
,label
,model_output
andcross_validation_split_index
. This file is not written out for sequence models. - (#440) Added support for training of self-supervised models (BYOL and SimCLR) based on the bring-your-own-model framework. Providing examples configurations for training of SSL models on CIFAR10/100 datasets as well as for chest-x-ray datasets such as NIH CHest-Xray or RSNA Pneumonia Detection Challenge datasets. See SSL doc for more details.
- (#455) All models trained on AzureML are registered.
The codepath previously allowed only segmentation models (subclasses of
SegmentationModelBase
) to be registered. Models are registered after a training run or if theonly_register_model
flag is set. Models may be legacy InnerEye config-based models or may be defined using the LightningContainer class. Additionally, theTrainHelloWorldAndHelloContainer
job in the PR build has been split into two jobs,TrainHelloWorld
andTrainHelloContainer
. A pytest markerafter_training_hello_container
has been added to run tests after training is finished in theTrainHelloContainer
job. - (#456) Adding configs to train Covid detection models.
- (#463) Add arguments
dirs_recursive
anddirs_non_recursive
tomypy_runner.py
to let users specify a list of directories to run mypy on.
- (#385) Starting an AzureML run now uses the
ScriptRunConfig
object, rather than the deprecatedEstimator
object. - (#385) When registering a model, the name of the Python execution environment is added as a tag. This tag is read when running inference, and the execution environment is re-used.
- (#411) Upgraded to PyTorch 1.8.0, PyTorch-Lightning 1.1.8 and AzureML SDK 1.23.0
- (#432) Upgraded to PyTorch-Lightning 1.2.7. Add end-to-end test for classification cross-validation. WARNING: upgrade PL version causes hanging of multi-node training.
- (#437) Upgrade to PyTorch-Lightning 1.2.8.
- (#439) Recovery checkpoints are now
named
recovery_epoch=x.ckpt
instead ofrecovery.ckpt
orrecovery-v0.ckpt
. - (#451) Change the signature for function
generate_custom_report
inModelConfigBase
to take only the path to the reports folder and aModelProcessing
object. - (#444) The method
before_training_on_rank_zero
of theLightningContainer
class has been renamed tobefore_training_on_global_rank_zero
. The order in which the hooks are called has been changed. - (#458) Simplifying and generalizing the way we handle
data augmentations for classification models. The pipelining logic is now taken care of by a ImageTransformPipeline
class that takes as input a list of transforms to chain together. This pipeline takes of applying transforms on 3D or
2D images. The user can choose to apply the same transformation for all channels (RGB example) or whether to apply
different transformation for each channel (if each channel represents a different
modality / time point for example). The pipeline can now work directly with out-of-the box torchvision transform
(as long as they support [..., C, H, W] inputs). This allows to get rid of nearly all of our custom augmentations
functions. The conversion from pipeline of image transformation to ScalarItemAugmentation is now taken care of under
the hood, the user does not need to call this wrapper for each config class. In models derived from ScalarModelConfig
to change which augmentations are applied to the images inputs (resp. segmentations inputs), users can override
get_image_transform
(resp.get_segmentation_transform
). These two functions replace the oldget_image_sample_transforms
method. Seedocs/building_models.md
for more information on augmentations.
- (#422) Documentation - clarified
setting_up_aml.md
datastore creation instructions and fixed small typos inhello_world_model.md
- (#432) Fixed cross-validation for classification models. Fixed multi-gpu metrics aggregation. Add end-to-end test for classification cross-validation. Add fix to bug in ddp setting when running multi-node with 1 gpu per node.
- (#435) If parameter
model
inAzureConfig
is not set, display an error message and terminate the run. - (#437) Fixed multi-node DDP bug in PL v1.2.8. Re-add end-to-end test for multi-node.
- (#445) Fixed a bug when running inference for container models on machines with >1 GPU
- (#439) Deprecated
start_epoch
config argument. - (#450) Delete unused
classification_report.ipynb
. - (#455) Removed the AzureRunner conda environment. The full InnerEye conda environment is needed to submit a training job to AzureML.
- (#458) Getting rid of all the unused code for RandAugment & Co. The user has now instead complete freedom to specify the set of augmentations to use.
- (#468) Removed the
KneeSinglecoil
example model
- (#323) There are new model configuration fields
(and hence, commandline options), in particular for controlling PyTorch Lightning (PL) training:
max_num_gpus
controls how many GPUs are used at most for training (default: all GPUs, value -1).pl_num_sanity_val_steps
controls the PL trainer flagnum_sanity_val_steps
pl_deterministic
controls the PL trainer flagsbenchmark
anddeterministic
generate_report
controls if a HTML report will be written (default: True)recovery_checkpoint_save_interval
determines how often a checkpoint for training recovery is saved.
- (#336) New extensions of
SegmentationModelBases
HeadAndNeckBase
andProstateBase
. Use these classes to build your own Head&Neck or Prostate models, by just providing a list of foreground classes. - (#363) Grouped dataset splits and k-fold
cross-validation. This allows, for example, training on datasets with multiple images per subject without leaking data
from the same subject across train/test/validation sets or cross-validation folds. To use this functionality, simply
provide the name of the CSV grouping column (
group_column
) when creating theDatasetSplits
object in your model config'sget_model_train_test_dataset_splits()
method. See theInnerEye.ML.utils.split_dataset.DatasetSplits
class for details.
- (#323) The codebase has undergone a massive
refactoring, to use PyTorch Lightning as the foundation for all training. As a consequence of that:
- Training is now using Distributed Data Parallel with synchronized
batchnorm
. The number of GPUs to use can be controlled by a new commandline argumentmax_num_gpus
. - Several classes, like
ModelTrainingSteps*
, have been removed completely. - The final model is now always the one that is written at the end of all training epochs.
- The old code that options to run full image inference at multiple epochs (i.e., multiple checkpoints), this has
been removed, alongside the respective commandline options
save_start_epoch
,save_step_epochs
,epochs_to_test
,test_diff_epochs
,test_step_epochs
,test_start_epoch
- The commandline option
register_model_only_for_epoch
is now calledonly_register_model
, and is boolean. - All metrics are written to AzureML and Tensorboard in a unified format. A training Dice score for 'bladder' would previously be called Train_Dice/bladder, now it is train/Dice/bladder.
- Due to a different checkpoint format, it is no longer possible to use checkpoints written by the previous version of the code.
- Training is now using Distributed Data Parallel with synchronized
- The arguments of the
score.py
script changed:data_root
->data_folder
, it no longer assumes a fixeddata
subfolder.project_root
->model_root
,test_image_channels
->image_files
. - By default, the visualization of patch sampling for segmentation models will run on only 1 image (down from 5). This is because patch sampling is expensive to compute, taking 1min per large CT scan.
- (#336) Renamed
HeadAndNeckBase
toHeadAndNeckPaper
, andProstateBase
toProstatePaper
. - (#427) Move dicom loading function from SimpleITK to pydicom. Loading time improved by 30x.
- When registering a model, it now has a consistent folder structured, described here. This folder structure is present irrespective of using InnerEye as a submodule or not. In particular, exactly 1 Conda environment will be contained in the model.
- The commandline options to control which checkpoint is saved, and which is used for inference, have been removed:
save_start_epoch
,save_step_epochs
,epochs_to_test
,test_diff_epochs
,test_step_epochs
,test_start_epoch
- Removed blobxfer completely. When downloading a dataset from Azure, we now use AzureML dataset downloading tools. Please remove the following fields from your settings.yml file: 'datasets_storage_account' and 'datasets_container'.
- Removed
ProstatePaperBase
. - Removed ability to perform sub-fold cross validation. The parameters
number_of_cross_validation_splits_per_fold
andcross_validation_sub_fold_split_index
have been removed from ScalarModelBase.
- This is the baseline release.