Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation restructurization and fixes #93

Merged
merged 14 commits into from
Nov 15, 2018
2 changes: 1 addition & 1 deletion doc_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ rm -rf build

# create html pages
sphinx-build -b html source build
make html
#make html

# open web browser(s) to master table of content
if which firefox
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ def __getattr__(cls, name):
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
#html_static_path = ['_static']

# Custom sidebar templates, must be a dictionary that maps document names to template names.
# The default sidebars (for documents that don't match any pattern) are
Expand Down Expand Up @@ -216,8 +216,8 @@ def __getattr__(cls, name):
'torchvision': ('https://pytorch.org/docs/stable/', None),
'python': ('https://docs.python.org/3', None),
'yaml': ('https://yaml.readthedocs.io/en/latest/', None),
'numpy': ('https://numpy.readthedocs.io/en/latest/', None)
}
'numpy': ('https://numpy.readthedocs.io/en/latest/', None),
'matplotlib': ('https://matplotlib.org/', None)}

# -- Options for Texinfo output ----------------------------------------------

Expand Down
15 changes: 9 additions & 6 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@ MI Prometheus is an open source Python library, built using PyTorch, that enable

notes/*

.. toctree::
:glob:
:maxdepth: 1
:caption: MI-Prometheus Primer

mip_primer/*

.. toctree::
:glob:
:maxdepth: 1
Expand All @@ -26,15 +33,11 @@ MI Prometheus is an open source Python library, built using PyTorch, that enable


.. toctree::
:glob:
:maxdepth: 1
:caption: Package Reference

workers
grid_workers
helpers
models
problems
utils
api_reference/*


Indices and tables
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
MI-Prometheus Explained
================================
=======================
`@author: Tomasz Kornuta & Vincent Marois`

This page dives deep into MI-Prometheus and its inner workings.
Expand All @@ -16,8 +16,16 @@ When training a model, people write programs which typically follow a similar pa
- Updating the model parameters using an optimizer.


During each iteration, the program also needs to collect some statistics (such as the
training / validation loss & accuracy) and save the weights of the resulting model into a file.
During each iteration, the program also can collect some statistics (such as the
training / validation loss & accuracy) and (optionally) save the weights of the resulting model into a file.


.. figure:: ../img/core_concepts.png
:scale: 50 %
:alt: The 5 core concepts of Mi-Prometheus
:align: center

The 5 core concepts of Mi-Prometheus. Dotted elements indicate optional inputs/outputs/dataflows.


This typical workflow led us to the formalization of the core concepts of the framework:
Expand All @@ -29,30 +37,35 @@ This typical workflow led us to the formalization of the core concepts of the fr
- **Experiment**: a single run (training & validation or test) of a given Model on a given Problem, using a specific Worker and Configuration file(s).


.. figure:: ../img/core_concepts.png
:scale: 50 %
:alt: The 5 core concepts of Mi-Prometheus
:align: center
Aside of Workers, MI-Prometheus currently offers 2 types of specialized applications, namely:

The 5 core concepts of Mi-Prometheus. Dotted elements indicate optional inputs/outputs/dataflows.
- **Grid Worker**: a specialized application automating spanning of a number (grid) of experiments.
- **Helper**: an application useful from the point of view of experiment running, but independent/external to the Workers.

General idea here is that Grid Workers are useful in reproducible research, when one has e.g. to train a set of independent models on set of problems and
compare the results.
In such a case user can use Helpers e.g. to download required datasets (in advance, before training) and/or preprocess them in a specific way
(e.g. extract features from all images in a dataset once once, with a pretrained CNN model0, which will reduce overall time of all experiments.

Architecture
---------------

From an architectural point of view, MI-Prometheus can be seen as four stacked layers of interconnected modules.

- The lowest layer is formed by the external libraries that MI-Prometheus relies on, primarily PyTorch, NumPy and CUDA. Additionally, our basic workers rely on TensorBoardX, enabling the export of collected statistics, models and their parameters (weights, gradients) to TensorBoard. Optionally, some models and problems might depend on other external libraries. For instance, the framework currently incorporates problems and models from PyTorch’s wrapper to the TorchVision package.
- The second layer includes all the utilities that we have developed internally, such as the Parameter Registry (a singleton offering access to the registry of parameters), the Application State (another singleton representing the current state of application, e.g. whether the computations should be done on GPUs or not), factories used by the workers for instantiating the problem and model classes (indicated by the configuration file and loaded from the corresponding file). Additionally, this layer contains several tools, which are useful during an experiment run, such as logging facilities or statistics collectors (accessible by both the Problem and the Model).
- Next, the Components layer contains the models, problems and workers, i.e. the three major components required for the execution of one experiment. The problem and model classes are organized following specific hierarchies, using inheritance to facilitate their further extensions.
- Finally, the Experiment layer includes the configuration files, along with all the required inputs (such as the files containing the dataset, the files containing the saved model checkpoints with the weights to be loaded etc.) and outputs (logs from the experiment run, CSV files gathering the collected statistics, files containing the checkpoints of the best obtained model).


.. figure:: ../img/layers.png
:scale: 50 %
:alt: Mi-Prometheus is constituted of 4 main inter-connected layers.
:align: center

From an architectural point of view, MI-Prometheus can be seen as four stacked layers of interconnected modules.
Architecture of the MI-Prometheus framework.


The layers are as follows:

- The lowest layer is formed by the external libraries that MI-Prometheus relies on, primarily PyTorch, NumPy and CUDA. Additionally, our basic workers rely on TensorBoardX, enabling the export of collected statistics, models and their parameters (weights, gradients) to TensorBoard. Optionally, some models and problems might depend on other external libraries. For instance, the framework currently incorporates problems and models from PyTorch’s wrapper to the TorchVision package.
- The second layer includes all the utilities that we have developed internally, such as the Parameter Registry (a singleton offering access to the registry of parameters), the Application State (another singleton representing the current state of application, e.g. whether the computations should be done on GPUs or not), factories used by the workers for instantiating the problem and model classes (indicated by the configuration file and loaded from the corresponding file). Additionally, this layer contains several tools, which are useful during an experiment run, such as logging facilities or statistics collectors (accessible by both the Problem and the Model).
- Next, the Components layer contains the models, problems and workers, i.e. the three major components required for the execution of one experiment. The problem and model classes are organized following specific hierarchies, using inheritance to facilitate their further extensions.
- Finally, the Experiment layer includes the configuration files, along with all the required inputs (such as the files containing the dataset, the files containing the saved model checkpoints with the weights to be loaded etc.) and outputs (logs from the experiment run, CSV files gathering the collected statistics, files containing the checkpoints of the best obtained model).


.. See http://docutils.sourceforge.net/docs/ref/rst/directives.html for a breakdown of the options
Expand Down
88 changes: 88 additions & 0 deletions docs/source/mip_primer/4_workers_explained.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
Workers Explained
===================
`@author: Tomasz Kornuta & Vincent Marois`

The Workers are scripts which execute a certain task given a Model and a Problem.
They are related to either the training (Trainers) or the testing procedure (Tester) and can support both CPUs & GPUs.

.. figure:: ../img/worker_basic_class_diagram.png
:scale: 50 %
:alt: Class diagram of the workers.
:align: center

The class inheritance of the workers. The Trainers & the Tester classes inherit from a Worker class, to follow OOP best practices.

Trainers
^^^^^^^^^^

There are two types of Trainers: **Online Trainer** and **Offline Trainer**.

The **Offline Trainer** is based on epochs and validates the model on the validation set at the end of each epoch. Thus, it is well-suited for finite-size datases, such as MNIST.

While an epoch seems natural for all finite-size datasets, it makes less sense for problems which have a very large, almost infinite dataset (like algorithmic tasks, which generate data `on-the-fly`).
This is why we also developed the **Online Trainer**, which, instead of looping on epochs, iterates directly on episodes (we call an iteration on a single batch an episode).

By default, the **Online Trainer** validates the model every `n` episodes on a subset of the validation set, whereas **Offline Trainer** validates the model on the whole validation set at the end of every epoch.
The Offline Trainer can also validates the model every `n` episodes on a subset of the validation set (we refer to this as partial validation), and both trainers validate the model on the whole validation set at the end of training.

Tester
^^^^^^^^^^

The third Worker is **Tester**, which loads a trained model and iterates over the test set once, collecting all the specified statistics (mean loss, accuracy etc.).

Both the Trainers and the **Tester** share a similar logic of operation. They both also support CPU and GPU working modes.
The user can activate this by passing the `––gpu` argument when running a given worker from the command line, which will result in moving the tensors to GPU (e.g. `torch.FloatTensor` to `torch.cuda.FloatTensor`), thus allowing the Model to use CUDA and perform its computations on GPU.


We can distinguish two main phases of functioning for the workers: the initialization and the iteration over the batches of samples (each such iteration on a single batch is called an Episode) produced by the model.

Initialization:
^^^^^^^^^^^^^^^

.. figure:: ../img/initialization_sequence_diagram.png
:scale: 50 %
:alt: The most important interactions between Worker, Model & Problem during the initialization phase.
:align: center

The most important interactions between Worker, Model & Problem during the initialization phase.


After loading the configuration file(s) in the Parameter Registry, the worker initializes the logger, creates an output experiment folder and output CSV files, exports the current experiment settings (content of the Parameter Registry) to a file and (optionally) initializes a TensorBoard logger.

Next, it instantiates the problem and model classes using specialized factories. At that point, the Tester also loads the model weights from the checkpoints file indicated by one of the command line arguments (which is optional for the Trainers).

In order to ensure that the Problem and the Model are compatible, both basic workers perform an automated handshaking, to check whether the definitions (i.e. name, type and shape when relevant) of the inputs produced by the Problem match the required definitions of the Model inputs.
They also verify if the definitions of the model’s predictions match the definitions of the Problem targets and are compatible with the used loss function.


Iterations over the batches of samples:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. figure:: ../img/episode_sequence_diagram.png
:scale: 50 %
:alt: The interactions between the Worker, Problem and Model during a single episode, which are shared between the Trainer and the Tester.
:align: center

The interactions between the Worker, Problem and Model during a single episode, which are shared between the Trainer and the Tester.


In every episode, the Worker retrieves a batch of samples from the Problem, inputs it to the Model, collects the Model’s predictions and passes them back to the Problem in order to compute the loss (and other statistics, such as accuracy).

At the end of the episode, all events and collected statistics are logged to the experiment folder & files.
The Trainers performs several additional computations afterwards. First of all, they perform the model optimization, i.e. updating the model weights using error backpropagation and an optimizer (indicated in the configuration file). They also validate the Model, as explained above.

If visualization is active, the Trainers also display the current behavior of the Model, through a visualization window specific to the Model.
Finally, they also export the Model along with the collected statistics to a checkpoint file.

Terminal conditions:
^^^^^^^^^^^^^^^^^^^^

Training ends when one of the following conditions is met:

- The epoch limit is reached (used by default by the **Offline Trainer**),
- The episode limit is reached (used by default by the **Online Trainer**),
- The validation loss goes below a certain threshold. Depending on the Trainer, we consider:
+ average loss over the entire validation set calculated at the end of every epoch for the **Offline Trainer**,
+ partial validation loss (loss on a single batch) calculated every *partial iteration interval* for the **Online Trainer**.

It is worth mentioning that both trainers can use both limits -- user simply has to sed the adequate parameters in a configuration file.
30 changes: 30 additions & 0 deletions docs/source/mip_primer/5_grid_workers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

Grid Workers Explained
======================
`@author: Tomasz Kornuta & Vincent Marois`

There are five Grid Workers, i.e. scripts which manage sets of experiments on grids of CPUs/GPUs.
These are:

- two Grid Trainers (separate versions for collections of CPUs and GPUs) spanning several trainings in parallel,
- two Grid Testers (similarly),
- a single Grid Analyzer, which colleects the results of several trainings & tests in a given experiment directory into a single csv file.


.. figure:: ../img/worker_grid_class_diagram.png
:scale: 50 %
:alt: Class diagram of the grid workers.
:align: center

The class inheritance of the grid workers. The Trainers & the Tester classes inherit from a base Worker class, to follow OOP best practices.


The Grid Trainers and Testers in fact spawn several instances of base Trainers and Testers respectively.
The CPU & GPU versions execute different operations, i.e. the CPUs grid workers assign one processor for each child, whereas the GPUs ones assigns a single GPU instead.

Fig. 7 presents the most important sections of the grid trainer configuration files. Section grid tasks defines the grid of experiments that need to be executed, reusing the mechanism of default configuration nesting.
Additionally, in grid settings, the user needs to define the number of repetitions of each experiment, as well as the maximum number of authorized concurrent runs (which later on will be compared to the number of available CPUs/GPUs).
Optionally, the user might overwrite some parameters of a given experiment (in the `overwrite` section) or all experiments at once (`grid_overwrite`).

As a result of running these Grid Trainers and Testers, the user ends up with an experiment directory containing several models and statistics collected during several training, validation and test repetitions.
The role of the last script, Grid Analyzer, is to iterate through those directories, collecting all statistics and merging them into a single file that facilitates a further analysis of results, the comparison of the models performance, etc.
11 changes: 11 additions & 0 deletions docs/source/mip_primer/6_helpers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Helpers Explained
===================
`@author: Tomasz Kornuta & Vincent Marois`

Helper is an application useful from the point of view of experiment running, but independent/external to the Workers.
Currently MI-Prometheus offers two type of helpers:

- **Problem Initializer**, responsible for initialization of a problem (i.e. download of required data from internet or generation of all samples) in advance, before the real experiment starts.
- **Index Splitter**, responsible for generation of files with indices splitting given dataset (in fact set of indices) into two. The resulting files can later be used in training/verification testing when using ``SubsetRandomSampler``.

We expect this list to grow soon.
2 changes: 1 addition & 1 deletion docs/source/notes/1_installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ If you plan to develop and introduce changes, please call the following command

python setup.py develop

This will enable you to change the code of the existing problems/models/workers and still be able to run them by calling the associated 'mip-*' commands.
This will enable you to change the code of the existing problems/models/workers and still be able to run them by calling the associated ``mip-*`` commands.
More in that subject can be found in the following blog post on dev_mode_.

.. _guide: https://github.com/pytorch/pytorch#installation
Expand Down
Loading