Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Update config files for g4 #20

Merged
merged 18 commits into from
May 12, 2020
Merged

Update config files for g4 #20

merged 18 commits into from
May 12, 2020

Conversation

ChaiBapchya
Copy link
Contributor

@ChaiBapchya ChaiBapchya commented Apr 17, 2020

Config files for G4 instance on MXNet CI [unix-gpu slaves]

UNIX AMI Creation changes

G4 instances have Tesla T4 drivers

  1. refactored setup
  2. updates
  • nvidia-driver 418->440†,
  • cuda driver 10.1->10.2,
  • docker version 18->19, docker compose
  1. upgrade host OS 16.04 -> 18.04
  2. update instance type [p3/g3 -> g4]

† G4 instances require driver version 418.87 or later.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html

Autoscaling Lambda function changes

  1. Added environment variables [test + prod] for G4
  • launch template
  • EXECUTORS_PER_LABEL
  • WARM_POOL_SIZE
  • MINIMUM_QUEUE_TIMES_SEC
  • CCACHE_EFS_DNS
  • MAXIMUM_STARTUP_TIME_SEC
  • MANAGED_JENKINS_NODE_LABELS
  1. Reduced executor per label for linux-cpu from 3 to 2
    This resolves the cant connect to linux-cpu error
    by reducing number of parallel jobs per instance
  2. Update state from 'startingtopending` [as starting is incorrect state]

@leezu
Copy link
Contributor

leezu commented Apr 17, 2020

Suggest to have a single folder for all GPU instances. Not introduce a new folder for G4, but rather consolidate existing 2 folders into a single one.

@ChaiBapchya
Copy link
Contributor Author

Sure makes sense!

@marcoabreu
Copy link
Contributor

Why is it necessary at all? You can create an image on one machine type and just run it on all.

@leezu
Copy link
Contributor

leezu commented Apr 17, 2020

g4 needs a more recent driver. Therefore AMI requires update. Yes, there should be only a single AMI.

@marcoabreu
Copy link
Contributor

Sure, but why not just simply update the existing setup script instead of introducing a new one?

@leezu
Copy link
Contributor

leezu commented Apr 17, 2020

Yes, that's what I suggested in the first comment above #20 (comment)

@ChaiBapchya
Copy link
Contributor Author

@marcoabreu @leezu Updated with a single folder for all GPU instances.
Made a note of what needs to be updated in the README.md file of slave-creation-unix folder.

tools/jenkins-slave-creation-unix/README.md Outdated Show resolved Hide resolved
@josephevans
Copy link
Contributor

@ChaiBapchya After our testing on Friday, I think we should also disable automatic Ubuntu updates. We know there is some fragility around the nvidia driver (if gcc is updated, for example, the driver stops working on the DLAMI based on Ubuntu.)

@ChaiBapchya
Copy link
Contributor Author

https://askubuntu.com/questions/1059971/disable-updates-from-command-line-in-ubuntu-16-04

Adding

systemctl disable --now apt-daily{,-upgrade}.{timer,service}

@ChaiBapchya
Copy link
Contributor Author

@leezu @josephevans Plz help review/merge.
Thanks!

@leezu leezu merged commit 639cd92 into apache:master May 12, 2020
@ChaiBapchya ChaiBapchya deleted the g4_conf branch July 26, 2020 07:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants