Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
merge changes
Browse files Browse the repository at this point in the history
  • Loading branch information
squirrelsc committed Mar 7, 2019
2 parents bfad5de + a236afa commit 1c88b61
Show file tree
Hide file tree
Showing 44 changed files with 877 additions and 247 deletions.
57 changes: 57 additions & 0 deletions docs/job_debugging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## Job Debugging

When user submit the job, set the following property in the jobEnv of jobConfig. If the job's user command fails, the container will be kept for 1 week. And user could debug the container after ssh to it. After debugging, user should manually stop it to recycle the system resources.

- 1 week is a default value. It could be configured by your cluster admin.


### Submit a debugging job

#### ```Submit job through json file```

If you submit through a json file and you want to enable job debugging feature for this job, you should set following configuration in your job's json file.
```JSON
{
"jobEnvs": {
"isDebug": true
}
}
```

#### ```Submit job through webportal```

If you submit through webportal and you want to enable job debugging feature for this job, you should set following configuration in ```jobEnvs```.

![webportal_submit_job](./pic/webportal-job-debugging.png)


### Debugging your job, after job failure


If users' job is failed and the command exits with a none-zero code, the job's container will be reserved for job debugging.

In Webportal, the job's status is running. ```TODO: show the debugging job with a specify tag in webportal.```

![webportal_reserved_job_status](./pic/webportal-job-debugging-status.png)

You will find the log following in your job.

```text
job has finished with exit code 2
=============================================================================
====== The job container failed, so it will be reserved for 1 week ======
====== After debugging, please stop the job manually. ======
=============================================================================
```


### Stop your job manually after debugging

You should manually stop the reserved job. Or the occupied resource won't be free.

![webportal_reserved_job_stop](./pic/webportal-job-debugging-stop.png)


### Job failed due to system error

The job, which failed due to system error such as too high usage of file system, can't be reserved by OpenPAI.
55 changes: 28 additions & 27 deletions docs/job_tutorial.md

Large diffs are not rendered by default.

Binary file added docs/pic/webportal-job-debugging-status.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pic/webportal-job-debugging-stop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pic/webportal-job-debugging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion docs/webportal/PLUGINS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,13 @@ webportal:
plugins:
- title: Marketplace
uri: /scripts/plugins/marketplace.bundle.js
config:
repo: Microsoft/pai
```
- The `title` field is the title of the web portal plugin listed in the menu, it could be customized by administrators for the same plugin with different configurations.
- The `uri` field is the entry file of the web portal plugin, usually previded by the plugin developer. It may be an absolute URL or a root-relative URL, as the different deploy type of the web portal plugin.
- The `config` field is a key-value dictionary to configure the web portal plugin, available configs are listed in web portal plugin's specific document.

In addition, you can also lock the plugin version if the uri refers the Internet, follow the [Publish](#publish) section to move the online web portal plugin to offline.

Expand Down Expand Up @@ -50,7 +53,7 @@ If any other PAI configuration is needed, please open an issue, PAI developers w

### Provide Plugin Configurations

Ask system administration to set the query string of the entry file, like `http://example.com/github-plugin.js?repo=Microsoft%2Fpai`, `document.currentScript.src` would help you get the full uri of the script, including the query string.
The config of the plugin will be set as the query string of the entry file, like `http://example.com/github-plugin.js?repo=Microsoft%2Fpai`, `document.currentScript.src` would help you get the full uri of the script, including the query string.

### Migrate Current AI Web Tools to PAI Web Portal Plugin

Expand Down
4 changes: 3 additions & 1 deletion examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,12 @@ rest-server:
default-pai-admin-username: your_default_pai_admin_username
# database admin password
default-pai-admin-password: your_default_pai_admin_password
# rest server would achieve marketplace template from below configed github repository
# rest server would achieve marketplace template from below configed github repository
#github-owner: Microsoft
#github-repository: pai
#github-path: marketplace
# Job Debugging Reservation Seconds.
#debugging-reservation-seconds: 604800

# uncomment following section if you want to customize the port of web portal
# webportal:
Expand Down
1 change: 1 addition & 0 deletions paictl.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,4 @@ def main(args):

setup_logging()
main(sys.argv[1:])

8 changes: 2 additions & 6 deletions src/drivers/build/install-ib-drivers
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,6 @@ lspci | grep -qE '(Network|Infiniband) controller.*Mellanox.*ConnectX' ||
# This script is used for installation of InfiniBand drivers
KERNEL_FULL_VERSION=`uname -r`

[[ -f /lib/modules/$KERNEL_FULL_VERSION/.modules-prepared ]] || {
echo Modules not yet prepared
exit 1
}

HOSTNAME=`hostname`
# HACK: using last octet of the host's IP
LAST_OCTET=`host $HOSTNAME | head -n1 | sed 's/^.*\.//'`
Expand All @@ -43,7 +38,7 @@ CURRENT_DRIVER=/var/drivers/mellanox/current
if [[ ! -f /var/drivers/mellanox/$MLNX_OFED_STRING/mlnxofedinstall ]]; then
[[ -f /tmp/$MLNX_OFED_STRING-ext.tgz ]] ||
{
./ /mlnx_add_kernel_support.sh -y -m ./$MLNX_OFED_STRING --make-tgz || exit $?
./mlnx_add_kernel_support.sh -y -m ./$MLNX_OFED_STRING --make-tgz || exit $?
}
mkdir -p /var/drivers/mellanox/$MLNX_OFED_STRING || exit $?
tar -xvf /tmp/$MLNX_OFED_STRING-ext.tgz -C /var/drivers/mellanox/$MLNX_OFED_STRING --strip 1 || exit $?
Expand Down Expand Up @@ -166,3 +161,4 @@ ibdev2netdev || exit $?
# Final check
ibPresent
echo ibPresent exit value: $?

22 changes: 12 additions & 10 deletions src/hadoop-ai/build/build-pre.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,31 +19,33 @@

pushd $(dirname "$0") > /dev/null

hadoopBinaryDir="/hadoop-binary/"
hadoopBinaryDir="/hadoop-binary"

hadoopBinaryPath="${hadoopBinaryDir}hadoop-2.9.0.tar.gz"
cacheVersion="${hadoopBinaryDir}12940533-12933562-docker_executor-done"
# When Changing the patch id, please update it.
patchId="12940533-12933562-docker_executor"

hadoopBinaryPath="${hadoopBinaryDir}/hadoop-2.9.0.tar.gz"
cacheVersion="${hadoopBinaryDir}/${patchId}-done"

echo "hadoopbinarypath:${hadoopBinaryDir}"

[[ -f $cacheVersion ]] && [[ -f $hadoopBinaryPath ]] && [[ $cacheVersion -ot $hadoopBinaryPath ]] &&
echo "Hadoop binary path: ${hadoopBinaryDir}"

[[ -f ${cacheVersion} ]] && [[ -f ${hadoopBinaryPath} ]] && [[ ${cacheVersion} -ot ${hadoopBinaryPath} ]] &&
{
echo "Hadoop ai with patch 12940533-12933562-docker_executor has been built"
echo "Hadoop ai with patch ${patchId} has been built"
echo "Skip this build precess"
exit 0
}

[[ ! -f "$hadoopBinaryPath" ]] ||
[[ ! -f "${hadoopBinaryPath}" ]] ||
{

rm -rf $hadoopBinaryPath
rm -rf ${hadoopBinaryPath}

}

# When Changing the patch id, please update the filename here.
rm ${hadoopBinaryDir}/*-done
touch $cacheVersion
touch ${cacheVersion}

docker build -t hadoop-build -f hadoop-ai .

Expand Down
10 changes: 10 additions & 0 deletions src/hadoop-data-node/deploy/hadoop-data-node.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ spec:
name: host-confg-volume
- mountPath: /var/lib/hadoopdata
name: hadoop-tmp-storage
- mountPath: /var/log/hadoop
name: log-dir
readinessProbe:
exec:
command:
Expand Down Expand Up @@ -77,6 +79,11 @@ spec:
value: datanode-start-service.sh
- name: HADOOP_DATANODE_DATA_DIR
value: {{ mount_points|join(",") }}
# Rolling File Appender, by default it keeps at most 256M*20=5G logs.
- name: HADOOP_ROOT_LOGGER
value: INFO,console,RFA
- name: HADOOP_LOG_DIR
value: /var/log/hadoop
- name: POD_IP
valueFrom:
fieldRef:
Expand All @@ -98,6 +105,9 @@ spec:
- name: host-confg-volume
configMap:
name: host-configuration
- name: log-dir
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/pai-service-log/data-node
tolerations:
- key: node.kubernetes.io/memory-pressure
operator: "Exists"
Expand Down
10 changes: 10 additions & 0 deletions src/hadoop-name-node/deploy/hadoop-name-node.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ spec:
name: hadoop-name-node-config-volume
- mountPath: /var/lib/hadoopdata
name: hadoop-tmp-storage
- mountPath: /var/log/hadoop
name: log-dir
readinessProbe:
exec:
command:
Expand All @@ -57,6 +59,11 @@ spec:
value: namenode-generate-script.sh
- name: START_SERVICE
value: namenode-start-service.sh
# Rolling File Appender, by default it keeps at most 256M*20=5G logs.
- name: HADOOP_ROOT_LOGGER
value: INFO,console,RFA
- name: HADOOP_LOG_DIR
value: /var/log/hadoop
{%- if cluster_cfg['cluster']['common']['qos-switch'] == "true" %}
resources:
limits:
Expand All @@ -74,3 +81,6 @@ spec:
- name: hadoop-tmp-storage
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/hadooptmp/namenode
- name: log-dir
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/pai-service-log/name-node
10 changes: 10 additions & 0 deletions src/hadoop-node-manager/deploy/hadoop-node-manager.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ spec:
name: host-confg-volume
- mountPath: /var/lib/hadoopdata
name: hadoop-tmp-storage
- mountPath: /var/log/hadoop
name: log-dir
readinessProbe:
exec:
command:
Expand Down Expand Up @@ -107,6 +109,11 @@ spec:
value: "3072"
- name: NV_DRIVER
value: /var/drivers/nvidia/current
# Rolling File Appender, by default it keeps at most 256M*20=5G logs.
- name: YARN_ROOT_LOGGER
value: INFO,console,RFA
- name: YARN_LOG_DIR
value: /var/log/hadoop
- name: POD_IP
valueFrom:
fieldRef:
Expand Down Expand Up @@ -144,6 +151,9 @@ spec:
- name: hadoop-tmp-storage
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/hadooptmp/nodemanager
- name: log-dir
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/pai-service-log/node-manager
tolerations:
- key: node.kubernetes.io/memory-pressure
operator: "Exists"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ spec:
name: yarn-resourcemanager-storage
- mountPath: /var/lib/hadoopdata
name: hadoop-tmp-storage
- mountPath: /var/log/hadoop
name: log-dir
- mountPath: /exclude-path
name: hadoop-resource-manager-exclude-nodes
readinessProbe:
Expand All @@ -78,6 +80,11 @@ spec:
value: resourcemanager-generate-script.sh
- name: START_SERVICE
value: resourcemanager-start-service.sh
# Rolling File Appender, by default it keeps at most 256M*20=5G logs.
- name: YARN_ROOT_LOGGER
value: INFO,console,RFA
- name: YARN_LOG_DIR
value: /var/log/hadoop
{%- if cluster_cfg['cluster']['common']['qos-switch'] == "true" %}
resources:
limits:
Expand All @@ -99,10 +106,11 @@ spec:
- "-p {{ cluster_cfg[ "hadoop-resource-manager" ]["yarn_exporter_port"] }}"
livenessProbe:
httpGet:
path: /
path: '/healthz'
port: {{ cluster_cfg[ "hadoop-resource-manager" ]["yarn_exporter_port"] }}
initialDelaySeconds: 5
timeoutSeconds: 1
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
imagePullSecrets:
- name: {{ cluster_cfg["cluster"]["docker-registry"]["secret-name"] }}
volumes:
Expand All @@ -115,6 +123,9 @@ spec:
- name: hadoop-tmp-storage
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/hadooptmp/resourcemanager
- name: log-dir
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/pai-service-log/resource-manager
- name: hadoop-resource-manager-exclude-nodes
configMap:
name: exclude-file
Loading

0 comments on commit 1c88b61

Please sign in to comment.