This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Add docs for nniignore and improve docs of training service #2561

Merged

ultmaster merged 14 commits into microsoft:master from ultmaster:doc-improvement-2006

Jun 29, 2020

Contributor

ultmaster commented Jun 16, 2020 •

edited

Loading

Add docs for nniignore.
Fix the weird margin in toc.
Improve the format of quickstart.
versionCheck default should be true.
Update docs of training service.


          Improve docs

a4e6e11

squirrelsc reviewed

View reviewed changes

docs/en_US/Tutorial/QuickStart.md Outdated Show resolved Hide resolved

ultmaster added 3 commits

June 16, 2020 15:48


          All using # to start

e70df2d


          uploaded -> excluded

50c3f37


          Set versionCheck default to true

e12ef5a

chicm-ms requested a review from QuanluZhang

June 17, 2020 02:51

QuanluZhang reviewed

View reviewed changes

docs/en_US/Tutorial/QuickStart.md

		@@ -85,7 +105,7 @@ If you want to use NNI to automatically train your model and find the optimal hy
		+ }
		```

Contributor

QuanluZhang Jun 17, 2020

do you think this part is a little misleading?

QuanluZhang reviewed

View reviewed changes

docs/en_US/Tutorial/QuickStart.md Outdated

@@ @@ -133,31 +153,37 @@ trial: @@
                 gpuNum: 0
               ```
-              Note, **for Windows, you need to change the trial command from `python3` to `python`**.
+              ```eval_rst
+              .. Note:: If you are planning to use remote machines or clusters as your :doc:`training service <../TrainingService/SupportTrainingService>`, to avoid too much pressure on network, we limit the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`_.

Contributor

QuanluZhang Jun 17, 2020

suggest to give a single example for this .nniignore file, but not in this file.

ultmaster added 4 commits

June 19, 2020 17:53


          Add .nniignore example

67be48f


          Rename training service docs

01898a8


          Merge branch 'master' of https://github.com/microsoft/nni into doc-im…

783f7e9

…provement-2006


          Update training service docs

4fa6cad

ultmaster changed the title ~~Add docs for nniignore and several minor improvements~~ Add docs for nniignore and improve docs of training service

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated

+              <img src="https://user-images.githubusercontent.com/23273522/51816536-ed055580-2301-11e9-8ad8-605a79ee1b9a.png" alt="drawing" width="700"/>
+              </p>
+              According to the architecture shown in [Overview](../Overview), training service (platform) is actually responsible for two events: 1) initiates a new trial; 2) collecting metrics and communicating with NNI core (NNI manager). Note that a lot is going on here. To demonstrated how training service works, we show the workflow of training service from the very beginning to the timing when first trial succeeds.

Contributor

QuanluZhang Jun 29, 2020

"Note that a lot is going on here" what is the meaning of this sentence?

Contributor

QuanluZhang Jun 29, 2020

to the timing -> to the moment

ultmaster added 2 commits

June 29, 2020 10:48


          Merge branch 'master' of https://github.com/microsoft/nni into doc-im…

48ca1fa

…provement-2006


          Fix step numbering

df6e66f

squirrelsc approved these changes

View reviewed changes

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated


		According to the architecture shown in [Overview](../Overview), training service (platform) is actually responsible for two events: 1) initiates a new trial; 2) collecting metrics and communicating with NNI core (NNI manager). Note that a lot is going on here. To demonstrated how training service works, we show the workflow of training service from the very beginning to the timing when first trial succeeds.

		Step 1. Validate config and prepare the training platform. Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will starts to prepare for the experiment by making the code directory (`codeDir`) accessible to training platform.

Contributor

QuanluZhang Jun 29, 2020

will starts -> will start

docs/en_US/TrainingService/Overview.md Outdated

+              Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse mode), NNI copies another few files (including parameters, launch script and etc.) onto training platform. After that, NNI launches the trial through subprocess, SSH, RESTful API, and etc.
+              ```eval_rst
+              .. Warning:: The working directory of trial command has the exact same content as ``codeDir``, but can have a differen path (even on differen machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.

Contributor

QuanluZhang Jun 29, 2020

the exact same -> exact the same

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated

+              .. Warning:: The working directory of trial command has the exact same content as ``codeDir``, but can have a differen path (even on differen machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
+              ```
+              Step 3. **Collect metrics.**  NNI will then monitors the status of trial, updates the status (e.g., from `WAITING` to `RUNNING`, `RUNNING` to `SUCCEEDED`) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.

Contributor

QuanluZhang Jun 29, 2020

will then monitors -> then monitors

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated

+              |TrainingService|Brief Introduction|
+              |---|---|
+              |[__Local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.|

Contributor

QuanluZhang Jun 29, 2020

.md file reference cannot be used in table, the reference cannot be properly rendered. you can use readthedocs url

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated

+              |---|---|
+              |[__Local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.|
+              |[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
+              |[__Pai__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.|

Contributor

QuanluZhang Jun 29, 2020

Pai -> PAI

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated

+              |[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
+              |[__Pai__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.|
+              |[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
+              |[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|

Contributor

QuanluZhang Jun 29, 2020

missed one, i.e., DLTS

ultmaster added 2 commits

June 29, 2020 12:32


          Resolve comments

1e451b2


          Fix typo

8ed269b

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated


		Next, users should prepare code directory, which is specified as `codeDir` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a `.nniignore` file that works like a `.gitignore` file. For more details on how to write this file, see the [git documentation](https://git-scm.com/docs/gitignore#_pattern_format).

		In case users intend to use large files in their experiment (like large-scaled datasets) and they are not using local mode, they can either: 1) download the data before each trial launches by putting it into trial command; or 2) use a shared storage that is accessible to worker nodes. NNI has no configuration or helper functions for that, and users need to do everything in trial command (concatenating several commands with `&&` if necessary).

Contributor

QuanluZhang Jun 29, 2020

this part is a little negative, I think in doc of each training service we talked about how to use shared storage.

QuanluZhang reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated


		## How to use Training Service?

		Training service needs to be chosen and configured properly in configuration YAML file. See [tutorial](../Tutorial/QuickStart) and [reference](../Tutorial/ExperimentConfig) on how to write this file.

Contributor

QuanluZhang Jun 29, 2020

-> Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also could refer to reference for more details of the specification of the experiment configuration file.

SparkSnail reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated


		Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md) and [FrameworkController](./FrameworkControllerMode.md). These are called built-in training services.

		If the computing resource customers try to use is not listed above, NNI provides interface that allows users can build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details.

Contributor

SparkSnail Jun 29, 2020

can -> to

SparkSnail reviewed

View reviewed changes

docs/en_US/TrainingService/Overview.md Outdated

+              <img src="https://user-images.githubusercontent.com/23273522/51816536-ed055580-2301-11e9-8ad8-605a79ee1b9a.png" alt="drawing" width="700"/>
+              </p>
+              According to the architecture shown in [Overview](../Overview), training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager). To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.

Contributor

SparkSnail Jun 29, 2020

monitor trial job status

ultmaster added 2 commits

June 29, 2020 16:08


          Resolve comments

41b3c52


          Fix broken link

c3812d3

QuanluZhang approved these changes

View reviewed changes

SparkSnail approved these changes

View reviewed changes

ultmaster merged commit 25c4c3b into microsoft:master

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet