Skip to content

Helmut Hoffer von Ankershoffen experimenting with arm64 based NVIDIA Jetson (Nano and AGX Xavier) edge devices running Kubernetes (K8s) for machine learning (ML) including Jupyter Notebooks, TensorFlow Training and TensorFlow Serving using CUDA for smart IoT.

License

Notifications You must be signed in to change notification settings

helmut-hoffer-von-ankershoffen/jetson

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NVIDIA Jetson Nano and NVIDIA Jetson AGX Xavier for Kubernetes (K8s) and machine learning (ML) for smart IoT

Experimenting with arm64 based NVIDIA Jetson (Nano and AGX Xavier) edge devices running Kubernetes (K8s) for machine learning (ML) including Jupyter Notebooks, TensorFlow Training and TensorFlow Serving using CUDA for smart IoT.

Hints:

  • Assumes an NVIDIA Jetson Nano, TX1, TX2 or AGX Xavier as edge device.
  • Assumes a macOS workstation for development such as MacBook Pro
  • Assumes access to a bare-metal Kubernetes cluster the Jetson devices can join e.g. set up using Project Max.
  • Assumes basic knowledge of Ansible, Docker and Kubernetes.

Mission

  • Evaluate feasibility and complexity of Kubernetes + machine learning on edge devices for future smart IoT projects
  • Provide patterns to the Jetson community on how to use Ansible and Kubernetes with Jetson devices for automation and orchestration
  • Lower the entry barrier for some machine learning students by using cheap edge devices available offline with this starter instead of requiring commercial clouds available online
  • Provide a modern Jupyter based infrastructure for students of the Stanford CS229 course using Octave as kernel
  • Remove some personal rust regarding deep learning, multi ARM ,-) bandits, artificial intelligence in general and have fun

alt text

Features

  • basics: Prepare hardware including shared shopping list
  • basics: Automatically provision requirements on macOS device for development
  • basics: Manually provision root os using NVIDIA JetPack and balenaEtcher on Nanos
  • basics: Works with latest JetPack 4.2.1 and default container runtime
  • basics: Works with official NVIDIA Container Runtime, NGC and nvcr.io/nvidia/l4t-base base image provided by NVIDIA
  • basics: Automatically setup secure ssh access
  • basics: Automatically setup sudo
  • basics: Automatically setup basic packages for development, DevOps and CloudOps
  • basics: Automatically setup LXDE for decreased memory consumption
  • basics: Automatically setup VNC for headless access
  • basics: Automatically setup RDP for headless access
  • basics: Automatically setup performance mode including persistent setting of max frequencies for CPU+GPU+EMC to boost throughput
  • basics: Automatically build custom kernel as required by Docker + Kubernetes + Weave networking and allowing boot from USB 3.0 / SSD drive
  • basics: Automatically provision Jetson Xaviers (in addition to Jetson Nanos) including automatic setup of guest VM for building custom kernel and rootfs and headless oem-setup using a USB-C/serial connection from the guest VM to the Xavier device
  • basics: Automatically setup NVMe / SSD drive and use for Docker container images and volumes used by Kubernetes on Jetson Xaviers as part of provisioning
  • k8s: Automatically join Kubernetes cluster max as worker node labeled as jetson:true and jetson_model:[nano,xavier] - see Project Max reg. max
  • k8s: Automatically build and deploy CUDA deviceQuery as pod in Kubernetes cluster to validate access to GPU and correct labeling of jetson based Kubernetes nodes
  • k8s: Build and deploy using Skaffold, a custom remote builder for Jetson devices and kustomize for configurable Kubernetes deployments without the need to write Go templates in Helm for Tiller
  • k8s: Integrate Container Structure Tests into workflows
  • ml: Automatically repack CUDNN, TensorRT and support libraries including python bindings that are bundled with JetPack on Jetson host as deb packages using dpkg-repack as part of provisioning for later use in Docker builds
  • ml: Automatically build Docker base image jetson/[nano,xavier]/ml-base including CUDA, CUDNN, TensorRT, TensorFlow and Anaconda for arm64
  • ml: Automatically build and deploy Jupyter server for data science with support for CUDA accelerated Octave, Seaborn, TensorFlow and Keras as pod in Kubernetes cluster running on jetson node
  • ml: Automatically build Docker base image jetson/[nano,xavier]/tensorflow-serving-base using bazel and the latest TensorFlow core including support for CUDA capabilities of Jetson edge devices
  • ml: Automatically build and deploy simple Python/Fast API based webservice as facade of TensorFlow Serving for getting predictions inc. health check for K8s probes, interactive OAS3 API documentation, request/response validation, access of TensorFlow Serving via REST and alternatively gRPC
  • ml: Provide variants of Docker images and Kubernetes deployments for Jetson AGX Xavier using auto-activation of Skaffold profiles and Kustomize overlays
  • ml: Build out end to end example application using S3 compatible shared model storage provided by minio for deploying models trained using TensorFlow for inference using TensorFlow serving
  • ml: Automatically setup machine learning workflows using Kubeflow
  • optional: Semi-automatically setup USB 3.0 / SSD boot drive given existing installation on micro SD card on Jetson Nanos
  • optional: Automatically cross-build arm64 Docker images on macOS for Jetson devices using buildkit and buildx
  • optional: Automatically setup firewall on host using ufw for basic security
  • community: Publish versioned images on Docker Hub per release and provide Skaffold profiles to pull from Docker Hub instead of having to build images before deploy in Kubernetes cluster
  • community: Author a series of blog posts explaining how to set up ML in Kubernetes on Jetson devices based on this starter, gives more context and explains the toolset used.
  • ml: Deploy Polarize.AI ml training and inference tiers on Jetson nodes (separate project)

Hardware

Shopping list (for Jetson Nano)

Total ca. $210 including options.

Hints:

  • Assumes a USB mouse, USB keyboard, HDMI cable and monitor is given as required for initial boot of the root OS
  • Assumes an existing bare-metal Kubernetes cluster including an Ethernet switch to connect to - have a look at Project Max for how to build one with mini PCs or Project Ceil for the Raspberry PI variant.
  • Except for injecting the SSD into its enclosure and shortening one jumper (as described below) no assembly is required.

Shopping list (for Jetson Xavier)

Total ca. $767.

Hints:

  • A USB mouse, USB keyboard, HDMI cable and monitor is not required for initial boot of the root OS as the headless OEM-setup is supported
  • Assumes an existing bare-metal Kubernetes cluster including an Ethernet switch to connect to - have a look at Project Max for how to build one with mini PCs or Project Ceil for the Raspberry PI variant.
  • Except for installing the NVMe SSD (as described below) no assembly is required.

Install NVMe SSD (required)

As the eMMC soldered onto a Xavier board is 32GB only an SSD is required to provide adequate disk space for Docker images and volume.

  1. Install the NVMe SSD by connecting to the M.2 port of the daughterboard as described in this article - you can skip the section about setting up the installed SSD in the referenced article as integration (mounting, creation of filesystem, synching of files etc.) is done automatically during provisioning as described in part 2 below.
  2. Provisioning as described below will automatically integrate the SSD to provide the /var/lib/docker directory

Bootstrap macOS development environment

  1. Execute make bootstrap-environment to install requirements on your macOS device and setup hostnames such as nano-one.local in your /etc/hosts

Hint:

  • During bootstrap you will have to enter the passsword of the current user on your macOS device to allow software installation multiple times
  • The system / security settings of your macOS device must allow installation of software coming from outside of the macOS AppStore
  • make is used as a facade for all workflows triggered from the development workstation - execute make help on your macOS device to list all targets and see the Makefile in this directory for details
  • Ansible is used for configuration management on macOS and edge devices - all Ansible roles are idempotent thus can be executed repeatedly e.g. after you make changes in the configuration in workflow/provision/group_vars/all.yml
  • Have a look at workflow/requirements/macOS/ansible/requirements.yml for a list of packages and applications installed

Provision Jetson devices and join Kubernetes cluster as nodes

Part 1 (for Jetson Nanos): Manually flash root os, create provision account as part of oem-setup and automatically establish secure access

  1. Execute make nano-image-download on your macOS device to download and unzip the NVIDIA Jetpack image into workflow/provision/image/
  2. Start the balenaEtcher application and flash your micro sd card with the downloaded image
  3. Wire the nano up with the ethernet switch of your Kubernetes cluster
  4. Insert the designated micro sd card in your NVIDIA Jetson nano and power up
  5. Create a user account with username provision and "Administrator" rights via the UI and set nano-one as hostname - wait until the login screen appears
  6. Execute make setup-access-secure and enter the password you set for the provision user in the step above when asked - passwordless ssh access and sudo will be set up

Hints:

  • The balenaEtcher application was installed as part of bootstrap - see above
  • Step 5 requires to wire up your nano with a USB mouse, keyboard and monitor - after that you can unplug and use the nano with headless access using ssh, VNC, RDP or http
  • In case of a Jetson Nano you might experience intermittent operation - make sure to provide an adequate power supply - the best option is to put a Jumper on pins J48 and use a DC barrel jack power supply (with 5.5mm OD / 2.1mm ID / 9.5mm length, center pin positive) than can supply 5V with 4A - see NVIDIA DevTalk for a schema of the board layout.
  • Depending on how the DHCP server of your cluster is configured you might have to adapt the IP address of nano-one.local in workflow/requirements/generic/ansible/playbook.yml - run make requirements-hosts after updating the IP address which in turn updates the /etc/hosts file of your mac
  • SSH into your Nano using make nano-one-ssh - your ssh public key was uploaded in step 6 above so no password is asked for

Part 1 (for Jetson AGX Xaviers): Provision guest VM (Ubuntu), build custom Kernel and rootfs in guest VM, flash Xavier, create provision account on Xavier as part of oem-setup and automatically establish secure access

  1. Execute make guest-sdk-manager-download on your macOS device and follow instructions shown to download the NVIDIA SDK Manager installer into workflow/guest/download/ - if you fail to download the NVIDIA SDK manager you will be instructed in the next step on how to do it.
  2. Execute make guest-build to automatically a) create, boot and provision a Ubuntu guest VM on your macOS device using Vagrant, Virtual Box and Ansible and b) build a custom kernel and rootfs inside the guest VM for flashing the Xavier device - the Linux kernel is built with requirements for Kubernetes + Weave networking - such as activating IP sets and the Open vSwitch datapath module, SDK components are added to the rootfs for automatic installation during provisioning (see part 2). You will be prompted to download SDK components via the NVIDIA SDK manager that was automatically installed during provisioning of the guest VM - please do as instructed on-screen.
  3. Execute make guest-flash to flash the Xavier with the custom kernel and rootfs - wire up the Xavier with your macOS device using USB-C and enter the recovery mode by powering up and pressing the buttons as described in the printed user manual that was part of your Jetson AGX Xavier shipment before execution
  4. Execute make guest-oem-setup to start the headless oem-setup process. Follow the on-screen instructions to setup your locale and timezone, create a user account called provision and set an initial password - press the reset button of your Xavier after flashing before triggering the oem-setup.
  5. Execute make setup-access-secure and enter the password you set for the user provision in the step above when asked - passwordless ssh access and sudo will be set up

Hints:

  • There is no need to wire up your Xavier with a USB mouse, keyboard and monitor for oem-setup as headless oem-setup is implemented as described in step 4
  • Depending on how the DHCP server of your cluster is configured you might have to adapt the IP address of xavier-one.local in workflow/requirements/generic/ansible/playbook.yml - you can check the assigned IP after step 4 by logging in as user provision with the password you set, executing ifconfig eth0 | grep 'inet' and checking the IP address shown - run make requirements-hosts after updating the IP address which in turn updates the /etc/hosts file of your mac
  • SSH into your Xavier using make xavier-one-ssh - your ssh public key was uploaded in step 5 above so no password is asked for

Part 2 (for all Jetson devices): Automatically provision tools, services, kernel and Kubernetes on Jetson host

  1. Execute make provision - amongst others services will provisioned, kernel will be compiled (on Nanos only), Kubernetes cluster will be joined

Hints:

  • You might want to check settings in workflow/provision/group_var/all.yml
  • You might want to use make provision-nanos and make provision-xaviers to provision one type of Jetson model only
  • For Nanos the Linux kernel is built with requirements for Kubernetes + Weave networking on device during provisioning - such as activating IP sets and the Open vSwitch datapath module - thus initial provisioning takes ca. one hour - for Xaviers a custom kernel is built in the guest VM and flashed in part 1 (see above)
  • Execute kubectl get nodes to check that your edge devices joined your cluster and are ready
  • To easily inspect the Kubernetes cluster in depth execute the lovely click from the terminal which was installed as part of bootstrap - enter nodes to list the nodes with the nano being one of them - see Click for details
  • VNC into your Nano by starting the VNC Viewer application which was installed as part of bootstrap and connect to nano-one.local:5901 or xavier-one.local:5901 respectively - the password is secret
  • For Xaviers the NVMe SSD is automatically integrated during provisioning to provide /var/lib/docker. For Nanos you can optionally use a SATA SSD as boot device as described below.
  • If you want to provision step by step execute make help | grep "provision-" and execute the desired make target e.g. make provision-kernel
  • Provisioning is implemented idempotent thus it is safe to repeat provisioning as a whole or individual steps

Part 3 (for all devices): Automatically build Docker images and deploy services to Kubernetes cluster

  1. Execute make ml-base-build-and-test once to build the Docker base image jetson/[nano,xavier]/ml-base, test via container structure tests and push to the private registry of your cluster, which, amongst others, includes CUDA, CUDNN, TensorRT, TensorFlow, python bindings and Anaconda - have a look at the directory workflow/deploy/ml-base for details - most images below derive from this image
  2. Execute make device-query-deploy to build and deploy a pod into the Kubernetes cluster that queries CUDA capabilities thus validating GPU and Tensor Core access from inside Docker and correct labeling of Jetson/arm64 based Kubernetes nodes - execute make device-query-log-show to show the result after deploying
  3. Execute make jupyter-deploy to build and deploy a Jupyter server as a Kubernetes pod running on nano supporting CUDA accelerated TensorFlow + Keras including support for Octave as an alternative Jupyter Kernel in addition to iPython - execute make jupyter-open to open a browser tab pointing to the Jupyter server to execute the bundled Tensorflow Jupyter notebooks for deep learning
  4. Execute make tensorflow-serving-base-build-and-test once to build the TensorFlow Serving base image jetson/[nano,xavier]/tensorflow-serving-base test via container structure tests and push to the private registry of your cluster - have a look at the directory workflow/deploy/tensorflow-serving-base for details - most images below derive from this image
  5. Execute make tensorflow-serving-deploy to build and deploy TensorFlow Serving plus a Python/Fast API based Webservice for getting predictions as a Kubernetes pod running on nano - execute make tensorflow-serving-docs-open to open browser tabs pointing to the interactive OAS3 documentation Webservice API; execute make tensorflow-serving-health-check to check the health as used in K8s readiness and liveness probes; execute make tensorflow-serving-predict to get predictions

Hints:

  • To target Xavier devices for build, test, deploy and publish execute export JETSON_MODEL=xavier in your shell before executing the make ... commands which will auto-activate the matching Skaffold profiles (see below) - if JETSON_MODEL is not set the Nanos will be targeted
  • All deployments automatically create a private Kubernetes namespace using the pattern jetson-$deployment - e.g. jetson-jupyter for the Jupyter deployment - thus easing inspection in the Kubernetes dashboard, click or similar
  • All deployments provide a target for deletion such as make device-query-delete which will automatically delete the respective namespace, persistent volumes, pods, services, ingress and loadbalancer on the cluster
  • All builds on the Jetson device run as user build which was created during provisioning - use make nano-one-ssh-build or make xavier-one-ssh-build to login as user build to monitor intermediate results
  • Remote building on the Jetson device is implemented using Skaffold and a custom builder - have a look at workflow/deploy/device-query/skaffold.yaml and workflow/deploy/tools/builder for the approach
  • Skaffold supports a nice build, deploy, tail, watch cycle - execute make device-query-dev as an example
  • Containers mount devices /dev/nv* at runtime to access the GPU and Tensor Cores - see workflow/deploy/device-query/kustomize/base/deployment.yaml for details
  • Kubernetes deployments are defined using kustomize - you can thus define kustomize overlays for deployments on other clusters or scale-out easily
  • Kustomize overlays can be easily referenced using Skaffold profiles - have a look at workflow/deploy/device-query/skaffold.yaml and workflow/deploy/device-query/kustomize/overlays/xavier for an example - in this case the xavier profile is auto-activated respecting the JETSOON_MODEL environment variable (see above) with the profile in turn activating the xavier Kustomize overlay
  • All containers provide targets for Google container structure tests - execute make device-query-build-and-test as an example
  • For easier consumption all container images are published on Docker Hub - if you want to publish your own create a file called .docker-hub.auth in this directory (see docker-hub.auth.template) and execute the approriate make target, e.g. make ml-base-publish
  • If you did not use Project Max to provision your bare-metal Kubernetes cluster make sure your cluster provides a DNSMASQ and DHCP server, firewalling, VPN and private Docker registry as well as support for Kubernetes features such as persistent volumes, ingress and loadbalancing as required during build and in deployments - adapt occurrences of the name max-one accordingly to wire up with your infrastructure
  • The webservice of tensorflow-serving accesses TensorFlow Serving via its REST or alternatively the Python bindings of the gRPC API - have a look at the directory workflow/deploy/tensorflow-serving/src/webservice for details of the implementation

Optional: Configure swap

  1. Execute provision-swap

Hints:

  • You can configure the required swap size in workflow/provision/group_vars/all.yml

Optional (Jetson Nanos only): Boot from USB 3.0 SATA SSD - for advanced users only

  1. Wire up your SATA SSD with one of the USB 3.0 ports, unplug all other block devices connected via USB ports and reboot via make nano-one-reboot
  2. Set the ssd.id_serial_short of the SSD in workflow/provision/group_vars given the info provided by executing make nano-one-ssd-id-serial-short-show
  3. Execute nano-one-ssd-prepare to assign the stable device name /dev/ssd, wipe and partition the SSD, create an ext4 filesystem and sync the micro SD card to the SSD
  4. Set the ssd.uuid of the SSD in workflow/provision/group_vars given the info provided by executing make nano-one-ssd-uuid-show
  5. Execute nano-one-ssd-activate to configure the boot menu to use the SSD as the default boot device and reboot

Hints:

  • A SATA SSD connected via USB 3.0 is more durable and typically faster by a factor of three relative to a micro sd card - tested using a WD Blue M.2 SSD versus a SanDisk Extreme microSD card
  • The micro sd card is mounted as /mnt/mmc after step 5 in case you want to update the kernel later which now resides in /mnt/mmc/boot/Image
  • You can wire up additional block devices via USB again after step 5 as the UUID is used for referencing the SSD designated for boot (wip)

Optional: Cross-build Docker images for Jetson on macOS (wip)

  1. Execute make l4t-deploy to cross-build Docker image on macOS using buildkit based on official base image from NVIDIA and deploy - functionality is identical to device-query - see above

Hints:

  • Have a look at the custom Skaffold builder for Mac in workflow/deploy/l4t/builder.mac on how this is achieved
  • A daemon.json switching on experimental Docker features on your macOS device required for this was automatically installed as part of bootstrap - see above

Optional: Setup firewall (wip)

  1. Execute provision-firewall

Additional labeled references