Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
[Job Debugging] Basic Implement Of Job Debugging. (#2272)
Browse files Browse the repository at this point in the history
  • Loading branch information
ydye authored Mar 7, 2019
1 parent 647a87d commit a236afa
Show file tree
Hide file tree
Showing 15 changed files with 105 additions and 3 deletions.
57 changes: 57 additions & 0 deletions docs/job_debugging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## Job Debugging

When user submit the job, set the following property in the jobEnv of jobConfig. If the job's user command fails, the container will be kept for 1 week. And user could debug the container after ssh to it. After debugging, user should manually stop it to recycle the system resources.

- 1 week is a default value. It could be configured by your cluster admin.


### Submit a debugging job

#### ```Submit job through json file```

If you submit through a json file and you want to enable job debugging feature for this job, you should set following configuration in your job's json file.
```JSON
{
"jobEnvs": {
"isDebug": true
}
}
```

#### ```Submit job through webportal```

If you submit through webportal and you want to enable job debugging feature for this job, you should set following configuration in ```jobEnvs```.

![webportal_submit_job](./pic/webportal-job-debugging.png)


### Debugging your job, after job failure


If users' job is failed and the command exits with a none-zero code, the job's container will be reserved for job debugging.

In Webportal, the job's status is running. ```TODO: show the debugging job with a specify tag in webportal.```

![webportal_reserved_job_status](./pic/webportal-job-debugging-status.png)

You will find the log following in your job.

```text
job has finished with exit code 2
=============================================================================
====== The job container failed, so it will be reserved for 1 week ======
====== After debugging, please stop the job manually. ======
=============================================================================
```


### Stop your job manually after debugging

You should manually stop the reserved job. Or the occupied resource won't be free.

![webportal_reserved_job_stop](./pic/webportal-job-debugging-stop.png)


### Job failed due to system error

The job, which failed due to system error such as too high usage of file system, can't be reserved by OpenPAI.
1 change: 1 addition & 0 deletions docs/job_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ Below please find the detailed explanation for each of the parameters in the con
| `retryCount` | Integer, optional | Job retry count, no less than 0 |
| `jobEnvs` | Object, optional | Job env parameters, key-value pairs, available in job container and **no substitution allowed** |
| `jobEnvs.paiAzRDMA` | Boolean, optional | If you cluster is azure rdma capable, you could specify the parameter to make your container azure rdma capable. How to use azure rdma? Please follow this [job example](../examples/azure-rdma-inte-mpi-benchmark-with-horovod-image) |
| `jobEnvs.isDebug` | Boolean, optional | after this flag is set as ```true```, if user's command exits with a none-zero value, the failed container will be reserved for job debugging. [More detail](./job_debugging.md)|

For more details on explanation, please refer to [frameworklauncher usermanual](../subprojects/frameworklauncher/yarn/doc/USERMANUAL.md).

Expand Down
Binary file added docs/pic/webportal-job-debugging-status.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pic/webportal-job-debugging-stop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pic/webportal-job-debugging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 3 additions & 1 deletion examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,12 @@ rest-server:
default-pai-admin-username: your_default_pai_admin_username
# database admin password
default-pai-admin-password: your_default_pai_admin_password
# rest server would achieve marketplace template from below configed github repository
# rest server would achieve marketplace template from below configed github repository
#github-owner: Microsoft
#github-repository: pai
#github-path: marketplace
# Job Debugging Reservation Seconds.
#debugging-reservation-seconds: 604800

# uncomment following section if you want to customize the port of web portal
# webportal:
Expand Down
1 change: 1 addition & 0 deletions paictl.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,4 @@ def main(args):

setup_logging()
main(sys.argv[1:])

8 changes: 8 additions & 0 deletions src/rest-server/config/rest-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ other config fields are optional, includes:
- `github-owner: Microsoft` The marketplace repo owner in GitHub
- `github-repository: pai` The marketplace repo name
- `github-path: marketplace` The marketpalce path in the repo
- `debugging-reservation-seconds: 604800` The seconds to reserved a job container to debug.

## Generated Configuration <a name="G_Config"></a>

Expand All @@ -34,6 +35,7 @@ rest-server:
github-owner: Microsoft
github-repository: pai
github-path: marketplace
debugging-reservation-seconds: 604800
```
## Table <a name="T_Config"></a>
Expand Down Expand Up @@ -99,4 +101,10 @@ rest-server:
<td>cluster_cfg["rest-server"]["etcd-uris"]</td>
<td>String</td>
</tr>
<tr>
<td>rest-server.debugging-reservation-seconds</td>
<td>com["rest-server"]["debugging-reservation-seconds"]</td>
<td>cluster_cfg["rest-server"]["debugging-reservation-seconds"]</td>
<td>String</td>
</tr>
</table>
1 change: 1 addition & 0 deletions src/rest-server/config/rest-server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ jwt-secret: pai-secret
github-owner: Microsoft
github-repository: pai
github-path: marketplace
debugging-reservation-seconds: 604800
7 changes: 7 additions & 0 deletions src/rest-server/config/rest_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ def validation_pre(self):
return False, '"default-pai-admin-username" is required in rest-server'
if 'default-pai-admin-password' not in self.service_configuration:
return False, '"default-pai-admin-password" is required in rest-server'
try:
reservation_time = int(self.service_configuration['debugging-reservation-seconds'])
except ValueError:
return False, '"debugging-reservation-seconds" should be a positive integer.'
if reservation_time <= 0:
return False, '"debugging-reservation-seconds" should be a positive integer.'

return True, None

Expand All @@ -50,6 +56,7 @@ def run(self):
service_object_model['github-owner'] = self.service_configuration['github-owner']
service_object_model['github-repository'] = self.service_configuration['github-repository']
service_object_model['github-path'] = self.service_configuration['github-path']
service_object_model['debugging-reservation-seconds'] = self.service_configuration['debugging-reservation-seconds']
service_object_model['etcd-uris'] = ','.join('http://{0}:4001'.format(host['hostip'])
for host in machine_list
if host.get('k8s-role') == 'master')
Expand Down
2 changes: 2 additions & 0 deletions src/rest-server/deploy/rest-server.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ spec:
value: {{ cluster_cfg['layout']['kubernetes']['api-servers-url'] }}
- name: AZ_RDMA
value: "{{ cluster_cfg['cluster']['common']['az-rdma']}}"
- name: DEBUGGING_RESERVATION_SECONDS
value: "{{ cluster_cfg['rest-server']['debugging-reservation-seconds']}}"
{% if cluster_cfg['rest-server']['github-owner'] %}
- name: GITHUB_OWNER
value: {{ cluster_cfg['rest-server']['github-owner'] }}
Expand Down
6 changes: 4 additions & 2 deletions src/rest-server/src/config/paiConfig.js
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,20 @@ try {
paiMachineList = yaml.safeLoad(fs.readFileSync('/pai-cluster-config/layout.yaml', 'utf8'))['machine-list'];
} catch (err) {
paiMachineList = [];
logger.info('Unable to load machine list from cluster-configuration.');
logger.info(err.stack);
logger.warn('Unable to load machine list from cluster-configuration.');
logger.warn('The machine list will be initialized as an empty list.');
}

let paiConfigData = {
machineList: paiMachineList,
debuggingReservationSeconds: Number(process.env.DEBUGGING_RESERVATION_SECONDS || '604800'),
};


// define the schema for pai configuration
const paiConfigSchema = Joi.object().keys({
machineList: Joi.array(),
debuggingReservationSeconds: Joi.number().integer().positive(),
}).required();


Expand Down
2 changes: 2 additions & 0 deletions src/rest-server/src/models/job.js
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,8 @@ class Job {
'azRDMA': azureEnv.azRDMA === 'false' ? false : true,
'paiMachineList': paiConfig.machineList,
'reqAzRDMA': data.jobEnvs && data.jobEnvs.paiAzRDMA === true ? true : false,
'isDebug': data.jobEnvs && data.jobEnvs.isDebug === true ? true : false,
'debuggingReservationSeconds': paiConfig.debuggingReservationSeconds,
});
return dockerContainerScript;
}
Expand Down
18 changes: 18 additions & 0 deletions src/rest-server/src/templates/dockerContainerScript.mustache
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,24 @@ else
wait $user_command_pid
user_command_exitcode=$?
echo "job has finished with exit code $user_command_exitcode"
{{# isDebug }}
if [[ $user_command_exitcode -ne 0 ]]; then
echo "============================================================================="
echo "====== The job container failed, so it will be reserved for 1 week ======"
echo "====== After debugging, please stop the job manually. ======"
echo "============================================================================="

sleep_time={{ debuggingReservationSeconds }}
sleep_count=0

while [ $(( $(date +%s) - $(stat -c %Y /alive/yarn_$PAI_CONTAINER_ID) )) -lt 30 ] && \
[ "$sleep_count" -lt "$sleep_time" ]; do
sleep 20
sleep_count=$((sleep_count+20))
done

fi
{{/ isDebug }}
exit $user_command_exitcode
fi

1 change: 1 addition & 0 deletions src/rest-server/test/setup.js
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ process.env.DEFAULT_PAI_ADMIN_PASSWORD = 'adminis';
process.env.YARN_URI = 'http://yarn.test.pai:8088';
process.env.K8S_APISERVER_URI = 'http://kubernetes.test.pai:8080';
process.env.AZ_RDMA = 'false';
process.env.DEBUGGING_RESERVATION_SECONDS = '604800';


// module dependencies
Expand Down

0 comments on commit a236afa

Please sign in to comment.