Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
Refine document for new user to submit job (#2278)
Browse files Browse the repository at this point in the history
1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.
  • Loading branch information
squirrelsc authored Mar 7, 2019
1 parent a236afa commit c67ab37
Show file tree
Hide file tree
Showing 38 changed files with 380 additions and 291 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Please fill this for deployment related issues:

<!--User job related issues
GitHub is not the right place for support requests.
If you're looking for help, check [Stack Overflow](https://stackoverflow.com/questions/tagged/openpai) and the [troubleshooting guide](https://github.com/Microsoft/pai/blob/master/docs/job_log.md and https://github.com/Microsoft/pai/blob/master/docs/job_tutorial.md#debug).
If you're looking for help, check [Stack Overflow](https://stackoverflow.com/questions/tagged/openpai) and the [troubleshooting guide](https://github.com/Microsoft/pai/blob/master/docs/job_log.md and https://github.com/Microsoft/pai/blob/master/docs/job_tutorial.md#how-to-debug-a-job).
-->

**How to reproduce it**:
Expand Down
40 changes: 11 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ OpenPAI is an open source platform that provides complete AI model training and
1. [Why choose OpenPAI](#why-choose-openpai)
1. [Get started](#get-started)
1. [Deploy OpenPAI](#deploy-openpai)
1. [Train model](#train-models)
1. [Train models](#train-models)
1. [Administration](#administration)
1. [Reference](#reference)
1. [Get involved](#get-involved)
Expand Down Expand Up @@ -93,9 +93,9 @@ As various hardware environments and different use scenarios, default configurat

### Validate deployment

After deployment, it's recommended to [validate key components of OpenPAI](docs/pai-management/doc/validate-deployment.md) in health status. After validation is success, [submit a "hello world" job](examples/README.md#quickstart) and check if it works end-to-end.
After deployment, it's recommended to [validate key components of OpenPAI](docs/pai-management/doc/validate-deployment.md) in health status. After validation is success, [submit a hello-world job](docs/user/training.md) and check if it works end-to-end.

### Train users before "train model"
### Train users before "train models"

The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve productivity significantly and save time on maintaining environments.

Expand All @@ -109,41 +109,23 @@ If FAQ doesn't resolve it, refer to [here](#get-involved) to ask question or sub

## Train models

Like all machine learning platforms, OpenPAI is a production tool. To maximize utilization, it's recommended to submit training jobs and OpenPAI will allocate resource to run it. If there are too many jobs, some jobs may wait in the queue. This is different with training models on dedicated servers for each person, and it needs a bit more knowledge about how to submit/manage training jobs on OpenPAI.
Like all machine learning platforms, OpenPAI is a productive tool. To maximize utilization, it's recommended to submit training jobs and let OpenPAI to allocate resource and run it. If there are too many jobs, some jobs may be queued until enough resource available, and OpenPAI choose some server(s) to run a job. This is different with run code on dedicated servers, and it needs a bit more knowledge about how to submit/manage training jobs on OpenPAI.

OpenPAI also supports to allocate on demand resource to users, and users can use SSH or Jupyter like on a physical server, refer to [here](examples/jupyter/README.md) about how to use OpenPAI like this way. Though it's not efficient to resources, but it also saves cost on setup and managing environments on physical servers.
Note, OpenPAI also supports to allocate on demand resource besides queuing jobs. Users can use SSH or Jupyter to connect like on a physical server, refer to [here](examples/jupyter/README.md) about how to use OpenPAI like this way. Though it's not efficient to resources, but it also saves cost on setup and managing environments on physical servers.

### Train first model on OpenPAI
### Submit training jobs

Follow [here](examples/README.md#quickstart) to create the first job definition. Then [submit the job via web portal](docs/submit_from_webportal.md). It's a very simple job, as it downloads data and code from internet, and doesn't copy model back. It's used to understand OpenPAI job definition and familiar with Web portal.
Follow [submitting a hello-world job](docs/user/training.md), and learn more about training models on OpenPAI. It's a very simple job and used to understand OpenPAI job definition and familiar with Web portal.

### Learn deeper on job definition
### OpenPAI VS Code Client

* Choose training environment. OpenPAI uses [Docker](https://www.docker.com/) to provide runtime environment.

Refer to [here](https://hub.docker.com/r/ufoym/deepo) to find more deep learning environments, for example, `ufoym/deepo:pytorch-py36-cu90`.

Note, this docker doesn't include openssh-server, curl. So, if SSH is necessary with those docker images, it needs to add `apt install openssh-server curl` in command field.

* Put code and data in. OpenPAI creates a clean environment as docker image. The data and code may not be in the docker. So it needs to use command field to copy data and code into docker before training. The command field supports to join multiple commands with `&&`. If extra system or Python components are needed, they can be installed in the command by `apt install` or `python -m pip install` as well.

There are some suggested approach to exchange data with running environment, but it's better to check with administrators of OpenPAI, which kind of storage is supported, and recommended approach to access it.

* Copy model back. It's similar with above topic, if code and data can copy into docker, model can also be copied back.

* Running distributed training job. OpenPAI can allocate multiple environments to one job to support distributed training.

Learn more about job definition, refer to [here](docs/job_tutorial.md#write-a-job-json-configuration-file-).

### OpenPAI client

Rather than web portal, and [RESTful API](docs/rest-server/API.md), OpenPAI have a friendly client tool for user. It's an extension of Visual Studio Code, called [OpenPAI VS Code Client](contrib/pai_vscode/VSCodeExt.md). It can submit job, simulate job running locally, manage multiple OpenPAI environments, and so on.
[OpenPAI VS Code Client](contrib/pai_vscode/VSCodeExt.md) is a friendly, GUI based client tool of OpenPAI. It's an extension of Visual Studio Code. It can submit job, simulate job running locally, manage multiple OpenPAI environments, and so on.

### Troubleshooting job failure

Web portal and job log are helpful to analyze job failure, and OpenPAI supports SSH into environment for debugging.

Refer to [here](docs/job_tutorial.md#how-to-debug-the-job-) for more information of troubleshooting job failure on OpenPAI. It's recommended to get code succeeded locally, then submit to OpenPAI. So that it doesn't need to troubleshoot code problems remotely.
Refer to [here](docs/user/troubleshooting_job.md) for more information about troubleshooting job failure. It's recommended to get code succeeded locally, then submit to OpenPAI. It reduces posibility to troubleshoot remotely.

## Administration

Expand All @@ -152,7 +134,7 @@ Refer to [here](docs/job_tutorial.md#how-to-debug-the-job-) for more information

## Reference

* [Job definition](docs/job_tutorial.md#write-a-job-json-configuration-file-)
* [Job definition](docs/job_tutorial.md)
* [RESTful API](docs/rest-server/API.md)
* Design documents could be found [here](docs).

Expand Down
22 changes: 12 additions & 10 deletions contrib/pai_vscode/VSCodeExt.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,22 @@
# OpenPAI VS Code Client

## Installation
OpenPAI VS Code Client can submit AI jobs, simulate job running locally, manage HDFS files, and etc. It's an extension of Visual Studio Code.

Visual Studio Code is a popular, free, lightweight but powerful source code editor which runs on your desktop and is available for Windows, macOS and Linux.

Visual Studio Code is a free, lightweight but powerful source code editor which runs on your desktop and is available for Windows, macOS and Linux. Go to [Visual Studio Code Official Site](https://code.visualstudio.com/) to install and learn more.
## Installation

OpenPAI Client is a VS Code extension to connect PAI clusters, submit AI jobs, and manage files on HDFS, etc. You need to install the extension in VS code before using it.
1. Download and install [Visual Studio Code](https://code.visualstudio.com/) by several clicks.

To install the OpenPAI Client:
2. Install **OpenPAI Client**.

1. Launch VS Code.
2. Click the "Extensions" icon in Activity Bar or press **Ctrl+Shift+X** to bring up the Extensions view.
3. Input **openpai** in the text box, the OpenPAI VS Code Client will appear in the result list.
4. Click the **Install** button. The extension will be installed.
5. After a successful installation, you will see an introduction page. Follow the instructions there and try the PAI client.
1. Launch VS Code.
2. Click the "Extensions" icon in Activity Bar or press **Ctrl+Shift+X** to bring up the Extensions view.
3. Input **openpai** in the text box, the OpenPAI VS Code Client will appear in the result list.
4. Click the **Install** button. The extension will be installed.
5. After a successful installation, you will see an introduction page. Follow the instructions there and try the PAI client.

![Extension](./assets/ext-install-1.png)
![Extension](./assets/ext-install-1.png)

## Next step

Expand Down
Binary file removed docs/images/PAI_submit_online_1.png
Binary file not shown.
Binary file removed docs/images/PAI_submit_online_2.png
Binary file not shown.
Binary file removed docs/images/PAI_submit_online_3.png
Binary file not shown.
Binary file removed docs/images/PAI_submit_online_4.png
Binary file not shown.
45 changes: 23 additions & 22 deletions docs/job_log.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@

# How to diagnose job problems through logs <a name="cluster_configuration"></a>
# How to diagnose job problems through logs

## Table of Contents
- [1 Diagnostic job failure reason](#job)
- [1.1 View job's launcher AM log](#amlog)
- [1.2 View job's each task container log](#tasklog)
- [1.3 Job exitStatus Convention](#exit)
- [2 Diagnostic job retried many times reason](#retry)
- [How to diagnose job problems through logs](#how-to-diagnose-job-problems-through-logs)
- [1 Diagnose job failure reason](#1-diagnose-job-failure-reason)
- [1.1 View job launcher AM log](#11-view-job-launcher-am-log)
- [1.2 View job each task container log](#12-view-job-each-task-container-log)
- [1.3 Job exitStatus Convention](#13-job-exitstatus-convention)
- [2 Diagnostic job retried many times reason](#2-diagnostic-job-retried-many-times-reason)
- [Note:](#note)

## 1 Diagnose job failure reason <a name="job"></a>
## 1 Diagnose job failure reason

OpenPAI job is launched by [famework launcher](../subprojects/frameworklauncher/yarn/README.md), and each task container is managed by launcher application master.

LauncherAM will manage each job's tasks by customized feature requirement. You can refer to this document [frameworklauncher architecture](../subprojects/frameworklauncher/yarn/doc/USERMANUAL.md#Architecture) to understand the relationship between them.

When we diagnose job problems through logs, we shoud pay attention to job launcher AM log (get the main reason) or zoom in job task container log.

### 1.1 View job launcher AM log <a name="amlog"></a>
### 1.1 View job launcher AM log

Check the summary, and pay attention to the highlights.

Expand Down Expand Up @@ -105,24 +106,24 @@ Log example:

Please pay attention to these lines to diagnostic job failure reason

| line head | above example log info |
| --- | --- |
| [ExitDiagnostics] | ExitStatus undefined in Launcher, maybe UserApplication itself failed.|
| [ExitCode] | 134|
| Exception message | No such object: cntk-test-4621-17223-container_e9878_1532412068340_0018_01_000002. |
| Shell output | [DEBUG] EXIT signal received in yarn container, exiting ...[DEBUG] cntk-test-4621-17223-container_e9878_1532412068340_0018_01_000002 does not exist.|
|ContainerLogHttpAddress| ```http://10.151.40.165:8042/node/containerlogs/container_e9878_1532412068340_0018_01_000002/core/ ```|
|AppCacheNetworkPath|10.151.40.165:/var/lib/hadoopdata/nm-local-dir/usercache/core/appcache/application_1532412068340_0018|
|ContainerLogNetworkPath|10.151.40.165:/var/lib/yarn/userlogs/application_1532412068340_0018/container_e9878_1532412068340_0018_01_000002|
|[ApplicationCompletionReason]| [g2p_train]: FailedTaskCount 1 has reached MinFailedTaskCount 1.|
| line head | above example log info |
| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| [ExitDiagnostics] | ExitStatus undefined in Launcher, maybe UserApplication itself failed. |
| [ExitCode] | 134 |
| Exception message | No such object: cntk-test-4621-17223-container_e9878_1532412068340_0018_01_000002. |
| Shell output | [DEBUG] EXIT signal received in yarn container, exiting ...[DEBUG] cntk-test-4621-17223-container_e9878_1532412068340_0018_01_000002 does not exist. |
| ContainerLogHttpAddress | ```http://10.151.40.165:8042/node/containerlogs/container_e9878_1532412068340_0018_01_000002/core/ ``` |
| AppCacheNetworkPath | 10.151.40.165:/var/lib/hadoopdata/nm-local-dir/usercache/core/appcache/application_1532412068340_0018 |
| ContainerLogNetworkPath | 10.151.40.165:/var/lib/yarn/userlogs/application_1532412068340_0018/container_e9878_1532412068340_0018_01_000002 |
| [ApplicationCompletionReason] | [g2p_train]: FailedTaskCount 1 has reached MinFailedTaskCount 1. |

we could get information:
1. UserApplication itself failed.
2. cntk-test-4621-17223-container_e9878_1532412068340_0018_01_000002 does not exist is the reason.
3. Then we could visit ```http://10.151.40.165:8042/node/containerlogs/container_e9878_1532412068340_0018_01_000002/core/``` at step 1.2 to detect task failure reason.


### 1.2 View job each task container log <a name="tasklog"></a>
### 1.2 View job each task container log

- Check the failed task log who triggered the whole attempt failed, i.e.

Expand All @@ -134,11 +135,11 @@ ContainerLogHttpAddress:

![PAI_job_retry](./images/PAI_job_retry.png)

### 1.3 Job exitStatus Convention <a name="exit"></a>
### 1.3 Job exitStatus Convention

You can check all the defined ExitStatus by: ExitType, ExitDiagnostics from framework launcher [USERMANUAL.md](../subprojects/frameworklauncher/yarn/doc/USERMANUAL.md#ExitStatus_Convention)

## 2 Diagnostic job retried many times reason <a name="retry"></a>
## 2 Diagnostic job retried many times reason

If the Framework retried many times, check other attempts by searching the FrameworkName in the YARN Web:

Expand Down
Loading

0 comments on commit c67ab37

Please sign in to comment.