Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Job Debugging] Basic Implement Of Job Debugging. #2272

Merged
merged 9 commits into from
Mar 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions docs/job_debugging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## Job Debugging

When user submit the job, set the following property in the jobEnv of jobConfig. If the job's user command fails, the container will be kept for 1 week. And user could debug the container after ssh to it. After debugging, user should manually stop it to recycle the system resources.

- 1 week is a default value. It could be configured by your cluster admin.


### Submit a debugging job

#### ```Submit job through json file```

If you submit through a json file and you want to enable job debugging feature for this job, you should set following configuration in your job's json file.
```JSON
{
"jobEnvs": {
"isDebug": true
}
}
```

#### ```Submit job through webportal```

If you submit through webportal and you want to enable job debugging feature for this job, you should set following configuration in ```jobEnvs```.

![webportal_submit_job](./pic/webportal-job-debugging.png)


### Debugging your job, after job failure


If users' job is failed and the command exits with a none-zero code, the job's container will be reserved for job debugging.

In Webportal, the job's status is running. ```TODO: show the debugging job with a specify tag in webportal.```

![webportal_reserved_job_status](./pic/webportal-job-debugging-status.png)

You will find the log following in your job.

```text
job has finished with exit code 2
=============================================================================
====== The job container failed, so it will be reserved for 1 week ======
====== After debugging, please stop the job manually. ======
=============================================================================
```


### Stop your job manually after debugging

You should manually stop the reserved job. Or the occupied resource won't be free.

![webportal_reserved_job_stop](./pic/webportal-job-debugging-stop.png)


### Job failed due to system error

The job, which failed due to system error such as too high usage of file system, can't be reserved by OpenPAI.
1 change: 1 addition & 0 deletions docs/job_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ Below please find the detailed explanation for each of the parameters in the con
| `retryCount` | Integer, optional | Job retry count, no less than 0 |
| `jobEnvs` | Object, optional | Job env parameters, key-value pairs, available in job container and **no substitution allowed** |
| `jobEnvs.paiAzRDMA` | Boolean, optional | If you cluster is azure rdma capable, you could specify the parameter to make your container azure rdma capable. How to use azure rdma? Please follow this [job example](../examples/azure-rdma-inte-mpi-benchmark-with-horovod-image) |
| `jobEnvs.isDebug` | Boolean, optional | after this flag is set as ```true```, if user's command exits with a none-zero value, the failed container will be reserved for job debugging. [More detail](./job_debugging.md)|

For more details on explanation, please refer to [frameworklauncher usermanual](../subprojects/frameworklauncher/yarn/doc/USERMANUAL.md).

Expand Down
Binary file added docs/pic/webportal-job-debugging-status.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pic/webportal-job-debugging-stop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pic/webportal-job-debugging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 3 additions & 1 deletion examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,12 @@ rest-server:
default-pai-admin-username: your_default_pai_admin_username
# database admin password
default-pai-admin-password: your_default_pai_admin_password
# rest server would achieve marketplace template from below configed github repository
# rest server would achieve marketplace template from below configed github repository
#github-owner: Microsoft
#github-repository: pai
#github-path: marketplace
# Job Debugging Reservation Seconds.
#debugging-reservation-seconds: 604800

# uncomment following section if you want to customize the port of web portal
# webportal:
Expand Down
1 change: 1 addition & 0 deletions paictl.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,4 @@ def main(args):

setup_logging()
main(sys.argv[1:])

8 changes: 8 additions & 0 deletions src/rest-server/config/rest-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ other config fields are optional, includes:
- `github-owner: Microsoft` The marketplace repo owner in GitHub
- `github-repository: pai` The marketplace repo name
- `github-path: marketplace` The marketpalce path in the repo
- `debugging-reservation-seconds: 604800` The seconds to reserved a job container to debug.

## Generated Configuration <a name="G_Config"></a>

Expand All @@ -34,6 +35,7 @@ rest-server:
github-owner: Microsoft
github-repository: pai
github-path: marketplace
debugging-reservation-seconds: 604800
```

## Table <a name="T_Config"></a>
Expand Down Expand Up @@ -99,4 +101,10 @@ rest-server:
<td>cluster_cfg["rest-server"]["etcd-uris"]</td>
<td>String</td>
</tr>
<tr>
<td>rest-server.debugging-reservation-seconds</td>
<td>com["rest-server"]["debugging-reservation-seconds"]</td>
<td>cluster_cfg["rest-server"]["debugging-reservation-seconds"]</td>
<td>String</td>
</tr>
</table>
1 change: 1 addition & 0 deletions src/rest-server/config/rest-server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ jwt-secret: pai-secret
github-owner: Microsoft
github-repository: pai
github-path: marketplace
debugging-reservation-seconds: 604800
7 changes: 7 additions & 0 deletions src/rest-server/config/rest_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ def validation_pre(self):
return False, '"default-pai-admin-username" is required in rest-server'
if 'default-pai-admin-password' not in self.service_configuration:
return False, '"default-pai-admin-password" is required in rest-server'
try:
reservation_time = int(self.service_configuration['debugging-reservation-seconds'])
except ValueError:
return False, '"debugging-reservation-seconds" should be a positive integer.'
if reservation_time <= 0:
return False, '"debugging-reservation-seconds" should be a positive integer.'

return True, None

Expand All @@ -50,6 +56,7 @@ def run(self):
service_object_model['github-owner'] = self.service_configuration['github-owner']
service_object_model['github-repository'] = self.service_configuration['github-repository']
service_object_model['github-path'] = self.service_configuration['github-path']
service_object_model['debugging-reservation-seconds'] = self.service_configuration['debugging-reservation-seconds']
service_object_model['etcd-uris'] = ','.join('http://{0}:4001'.format(host['hostip'])
for host in machine_list
if host.get('k8s-role') == 'master')
Expand Down
2 changes: 2 additions & 0 deletions src/rest-server/deploy/rest-server.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ spec:
value: {{ cluster_cfg['layout']['kubernetes']['api-servers-url'] }}
- name: AZ_RDMA
value: "{{ cluster_cfg['cluster']['common']['az-rdma']}}"
- name: DEBUGGING_RESERVATION_SECONDS
value: "{{ cluster_cfg['rest-server']['debugging-reservation-seconds']}}"
{% if cluster_cfg['rest-server']['github-owner'] %}
- name: GITHUB_OWNER
value: {{ cluster_cfg['rest-server']['github-owner'] }}
Expand Down
6 changes: 4 additions & 2 deletions src/rest-server/src/config/paiConfig.js
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,20 @@ try {
paiMachineList = yaml.safeLoad(fs.readFileSync('/pai-cluster-config/layout.yaml', 'utf8'))['machine-list'];
} catch (err) {
paiMachineList = [];
logger.info('Unable to load machine list from cluster-configuration.');
logger.info(err.stack);
logger.warn('Unable to load machine list from cluster-configuration.');
logger.warn('The machine list will be initialized as an empty list.');
}

let paiConfigData = {
machineList: paiMachineList,
debuggingReservationSeconds: Number(process.env.DEBUGGING_RESERVATION_SECONDS || '604800'),
};


// define the schema for pai configuration
const paiConfigSchema = Joi.object().keys({
machineList: Joi.array(),
debuggingReservationSeconds: Joi.number().integer().positive(),
}).required();


Expand Down
2 changes: 2 additions & 0 deletions src/rest-server/src/models/job.js
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,8 @@ class Job {
'azRDMA': azureEnv.azRDMA === 'false' ? false : true,
'paiMachineList': paiConfig.machineList,
'reqAzRDMA': data.jobEnvs && data.jobEnvs.paiAzRDMA === true ? true : false,
'isDebug': data.jobEnvs && data.jobEnvs.isDebug === true ? true : false,
'debuggingReservationSeconds': paiConfig.debuggingReservationSeconds,
});
return dockerContainerScript;
}
Expand Down
18 changes: 18 additions & 0 deletions src/rest-server/src/templates/dockerContainerScript.mustache
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,24 @@ else
wait $user_command_pid
user_command_exitcode=$?
echo "job has finished with exit code $user_command_exitcode"
{{# isDebug }}
if [[ $user_command_exitcode -ne 0 ]]; then
echo "============================================================================="
echo "====== The job container failed, so it will be reserved for 1 week ======"
echo "====== After debugging, please stop the job manually. ======"
echo "============================================================================="

sleep_time={{ debuggingReservationSeconds }}
sleep_count=0

while [ $(( $(date +%s) - $(stat -c %Y /alive/yarn_$PAI_CONTAINER_ID) )) -lt 30 ] && \
[ "$sleep_count" -lt "$sleep_time" ]; do
sleep 20
sleep_count=$((sleep_count+20))
done

fi
{{/ isDebug }}
exit $user_command_exitcode
fi

1 change: 1 addition & 0 deletions src/rest-server/test/setup.js
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ process.env.DEFAULT_PAI_ADMIN_PASSWORD = 'adminis';
process.env.YARN_URI = 'http://yarn.test.pai:8088';
process.env.K8S_APISERVER_URI = 'http://kubernetes.test.pai:8080';
process.env.AZ_RDMA = 'false';
process.env.DEBUGGING_RESERVATION_SECONDS = '604800';


// module dependencies
Expand Down