AWS summit seoul 2019 - How to scale your ml job with Amazon EKS
- SlideShare: https://pt.slideshare.net/awskorea/amazon-eks-lg-aws-summit-seoul-2019
- Video: https://www.youtube.com/watch?v=egv2TlfLL1Y&list=PLORxAVAC5fUWyB6Hsk9ibYJHw97k1h6s9&index=46
setup.sh
: setup script before demotrain.py
: model script to traintrain-oom.py
: model script to demonstrate OOM errorexample.yaml
: example of kubernetes joboom.yaml
: example of OOM killed jobexperiments.yaml
: list of experiment to trainrun-experiments.py
: python code to run experimentsDockerfile
: file to make docker imagehongkunyoo/eks-ml
Dockerfile.oom
: file to make OOM training imagehongkunyoo/eks-ml:oom
Install awscli & configure ACCESS_KEY
and SECRET_KEY
.
The user should have following permissions.
- EKS full access
- CloudFormation full access
- EC2 full access
- IAM full access
these permissions are quite administrative. please use it at your own risk.
Run setup.sh
as sudo user.
sudo ./setup.sh
Apply first example of job
kubectl apply -f example.yaml
Run all experiments
python run-experiments.py