Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Update config v2 doc (#3711)
Browse files Browse the repository at this point in the history
Co-authored-by: QuanluZhang <z.quanluzhang@gmail.com>
  • Loading branch information
kvartet and QuanluZhang authored Jun 8, 2021
1 parent eb65bc3 commit d1b1e7b
Show file tree
Hide file tree
Showing 7 changed files with 354 additions and 112 deletions.
10 changes: 10 additions & 0 deletions docs/en_US/TrainingService/Overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,13 @@ Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse m
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have different paths (even on different machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.

Step 3. **Collect metrics.** NNI then monitors the status of trial, updates the status (e.g., from ``WAITING`` to ``RUNNING``\ , ``RUNNING`` to ``SUCCEEDED``\ ) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.


Training Service Under Reuse Mode
---------------------------------

When reuse mode is enabled, a cluster, such as a remote machine or a computer instance on AML, will launch a long-running environment, so that NNI will submit trials to these environments iteratively, which saves the time to create new jobs. For instance, using OpenPAI training platform under reuse mode can avoid the overhead of pulling docker images, creating containers, and downloading data repeatedly.

In the reuse mode, user needs to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).

.. note:: Currently, only `Local <./LocalMode.rst>`__, `Remote <./RemoteMachineMode.rst>`__, `OpenPAI <./PaiMode.rst>`__ and `AML <./AMLMode.rst>`__ training services support resue mode. For Remote and OpenPAI training platforms, you can enable reuse mode according to `here <../reference/experiment_config.rst>`__ manually. AML is implemented under reuse mode, so the default mode is reuse mode, no need to manually enable.
6 changes: 4 additions & 2 deletions docs/en_US/Tutorial/ExperimentConfig.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
Experiment Config Reference
===========================
Experiment Config Reference (legacy)
====================================

This is the previous version (V1) of experiment configuration specification. It is still supported for now, but we recommend users to use `the new version of experiment configuration (V2) <../reference/experiment_config.rst>`_.

A config file is needed when creating an experiment. The path of the config file is provided to ``nnictl``.
The config file is in YAML format.
Expand Down
2 changes: 1 addition & 1 deletion docs/en_US/Tutorial/HowToUseSharedStorage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ All the information generated by the experiment will be stored under ``/nni`` fo
All the output produced by the trial will be located under ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}/nnioutput`` folder in your shared storage.
This saves you from finding for experiment-related information in various places.
Remember that your trial working directory is ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}``, so if you upload your data in this shared storage, you can open it like a local file in your trial code without downloading it.
And we will develop more practical features in the future based on shared storage.
And we will develop more practical features in the future based on shared storage. The config reference can be found `here <../reference/experiment_config.html#sharedstorageconfig>`_.

.. note::
Shared storage is currently in the experimental stage. We suggest use AzureBlob under Ubuntu/CentOS/RHEL, and NFS under Ubuntu/CentOS/RHEL/Fedora/Debian for remote.
Expand Down
3 changes: 3 additions & 0 deletions docs/en_US/Tutorial/QuickStart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,9 @@ Three steps to start an experiment
codeDir: .
gpuNum: 0
.. _nniignore:

.. Note:: If you are planning to use remote machines or clusters as your :doc:`training service <../TrainingService/Overview>`, to avoid too much pressure on network, we limit the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.

*Example:* :githublink:`config.yml <examples/trials/mnist-pytorch/config.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
Expand Down
6 changes: 3 additions & 3 deletions docs/en_US/reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ References
:maxdepth: 2

nnictl Commands <Tutorial/Nnictl>
Experiment Configuration <Tutorial/ExperimentConfig>
Experiment Configuration V2 <reference/experiment_config>
Experiment Configuration <reference/experiment_config>
Experiment Configuration (legacy) <Tutorial/ExperimentConfig>
Search Space <Tutorial/SearchSpaceSpec>
NNI Annotation <Tutorial/AnnotationSpec>
SDK API References <sdk_reference>
Supported Framework Library <SupportedFramework_Library>
Launch from python <Tutorial/HowToLaunchFromPython>
Launch from Python <Tutorial/HowToLaunchFromPython>
Shared Storage <Tutorial/HowToUseSharedStorage>
Tensorboard <Tutorial/Tensorboard>
Loading

0 comments on commit d1b1e7b

Please sign in to comment.