Runtime component for deep learning workload
In order to better support deep learning workload, OpenPAI implements "PAI Runtime", a module that provides runtime support to job containers.
One major feature of PAI runtime is the instantiation of runtime environment variables. PAI runtime provides several built-in runtime environment variables, including the container role name and index, the IP, port of all the containers used in the job. With PAI runtime environment variables and Framework Controller, user can onboard custom workload (e.g., MPI, TensorBoard) without the involvement of (or modification to) OpenPAI platform itself. OpenPAI further allows users to define custom runtime environment variables, tailored for their workload.
Another major feature of OpenPAI runtime is the introduction of "PAI runtime plugin". The runtime plugin provides a way for users to customize their runtime behavior for a job container. Essentially, plugin is a generic method for user to inject some code during container initialization or container termination. OpenPAI implements several built-in plugins for desirable features, including a storage plugin that mounts to a remote storage service from within the job containers, an ssh plugin that supports ssh access to each container, and a failure analysis plugin that analyzes the failure reason when a container fails. We envision there will be more features implemented by the plugin mechanism.
- Prepare OpenPAI runtime environment variables
- Failure analysis: report possible job failure reason based on the failure pattern
- Storage plugin: used to auto mount remote storage according to storage config
- SSH plugin: used to support ssh access to job container
- Cmd plugin: used to run customized commands before/after job
Please run docker build -f ./build/openpai-runtime.dockerfile .
to build openpai-runtime docker image
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.