From 30431bbf9aa053d4756592d8619586d5a6d8ac03 Mon Sep 17 00:00:00 2001 From: Hye Soo Yang Date: Mon, 25 Feb 2019 16:39:56 -0800 Subject: [PATCH 1/3] Adding an RFC for TF on Demand Project. --- rfcs/20190225-tf-on-demand.md | 191 ++++++++++++++++++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 rfcs/20190225-tf-on-demand.md diff --git a/rfcs/20190225-tf-on-demand.md b/rfcs/20190225-tf-on-demand.md new file mode 100644 index 000000000..c5d55e676 --- /dev/null +++ b/rfcs/20190225-tf-on-demand.md @@ -0,0 +1,191 @@ +# TensorFlow On Demand + +| Status | Proposed | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Gunhan Gulsoy (gunan@google.com), Hye Soo Yang (hyey@google.com) | +| **Sponsor** | Gunhan Gulsoy (gunan@google.com) | +| **Collaborators** | Christoph Goern (goern@redhat.com), Subin Modeel (smodeel@redhat.com) | +| **Updated** | 2019-02-25 | + +## Objective + +This document proposes a system to build optimized binaries in the cloud and deliver the artifacts to users of TensorFlow. With this system, we aim to achieve the following: + +* Hide the complexity of the build system of TF from beginner and intermediate users. +* Improve the out-of-the-box performance for users. +* Build on the strength of TensorFlow's community partners. + +## Motivation + +### Overview + +TF build is difficult and the friction present during building phase is apparent in our user experiences. For a successful build, users needs to be aware of requirements and configurations that they might not be too familiar with. This can be extremely challenging for beginner and intermediate users. + +Currently, TF python release artifacts are hosted on PyPI as a single source of TF binaries for download. However, PyPI is quite limited in capability; it is unable to recognize and/or deploy on user machines. + +Historically, this has been a problem when trying to satisfy the following requirements for our artifacts: + +* **Portability**: The artifacts should be able to run on as many platforms as possible. +* **Performance**: The artifacts should run as fast as possible. +* **Size**: The artifacts should be as small as possible in size. + +To achieve the above, we propose a system that takes inputs from the user, build artifacts based on the inputs and send the output back to the user. We are proposing two user endpoints to the system: + +1. Command line tool (python) +1. Web interface + +An overview of the system: + +![drawing](https://docs.google.com/drawings/d/11jAVBtR4nV4bkDVW1WrHhLXdzX2ZouOPwsztW2hZQz4/export/png) + +A system flow chart summarizing the ordering of events for all use cases of the system: + +![drawing](https://docs.google.com/drawings/d/1lSSHaYktst8MNTqF860cPlkfbgc3Ft_t_8PWzH0VkW4/export/png) + + +### Collaboration with Red Hat + +Red Hat already built a system to build and deliver TF on the cloud. Reference artifacts released by Red Hat are listed below: + +* [tensorflow-build-s2i ](https://github.com/thoth-station/tensorflow-build-s2i) +* [Tensorflow-release-job](https://github.com/thoth-station/tensorflow-release-job) +* [tensorflow-serving-build](https://github.com/thoth-station/tensorflow-serving-build) + +We propose collaborating with the Red Hat team and leveraging the team's open-source software as much as possible. The server side of the project, in particular, can have concentrated involvement from the team given the backend GKE cluster that already has a lot of functionality built and available. (Please refer to the "Server" block in the diagram above for details on the server-side event flow.) Based on Additional support and functionality will be built around and on top of it to match what TensorFlow currently supports and should support moving forward. + +## Design Proposal + +### Front-end + +**Web UI (download.tensorflow.org)** + +We propose a simple web interface that allows users to manually enter necessary system specs and desired build options for requesting a custom TF binary. The UI will be straightforward and easy-to-use. A sample mock up UI is shown below: + +![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png) + +Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary. + +Before sending out the binary, the system will first check whether the binary is supported (or unsupported) by going through what is being tested in CI. It will then inform the users accordingly. + +**Command Line Tool (TF Downloader Binary)** + +We propose a simple binary which will detect and fill out most of the inputs that the Web UI requires. It will then send the request to the backend to build. A sample execution (not final) is shown below: + +``` +> python tfdownloader.py +Downloader will now detect your system: + - Detecting CPU……. Found Intel Core i7-8700K + - Detecting GPU……. Found NVIDIA GTX 980 + - Detecting CUDA…… Found 9.2 + - Detecting cuDNN….. Found 7.4 + - Detecting Distribution……. Found Ubuntu 18.04 + +Requesting TF build with options: + CPU options: -mavx2 + GPU enabled: yes + CUDA version: 9.2 + cuDNN version: 7.4 + CUDA compute capability: 5.2 + GCC version: 7.3 +``` + + +It is possible that the system this script is run on may be incompatible for running TF. Such incapabilities and what the script is able to do can be categorized into two failure modes: + + + +* *Hard Constraint Failures* + * 32-bit OS + * GPU too old (Cuda Compute Capability older than 3.0) +* *Soft Constraint Failures* + * CUDA version too old + * cuDNN missing + * Other runtime libraries missing or too old (libstdc++) + * Unsupported python version found + +In hard constraint failure cases, the tool will point users to cloud options. In soft constraint failure cases, the tool will offer alternative build configurations (CPU only) or request users to install the missing software. + +We propose distributing this binary through PyPI. However, please note that binary distribution method is still under discussion. + +### Back-end + +**TF Builder Service** + +TF builder service will require a proto with build options. This proto will define all options the system recognizes to build TF. Once the request is received, the system will check the cache to see if we already have such a package built. If yes, then it will send back the URL for downloading this package. A sample proto (not final) is shown below: + +``` +proto build_options { + string version = 0; + enum CpuOptions { + SSE3 = 0; + SSE4 = 1; + SSE4_1 = 2; + SSE4_A = 3; + SSE4_2 = 4; + AVX = 5; + AVX2 = 6; + AVX512F = 7; + } + repeated CpuOptions cpu_options = 1; + + // CUDA options + bool cuda_enabled = 2; + enum NvidiaGPUGeneration { + KEPLER = 0; + MAXWELL = 1; + PASCAL = 2; + VOLTA = 3; + TURING = 4; + } + NvidiaGPUGeneration gpu_generation = 3; + string CUDA_version = 4; + string cuDNN version = 5; +…… + // Free formed string of options to append to bazel. + string extra_options = 100; +} +``` + +If no such package exists in cache, then the system will execute the following commands with the appropriate flags and environment variables to newly build it. + +``` +git checkout +yes "" | ./configure +bazel build tensorflow/tools/pip_package:build_pip_package +./bazel-bin/tensorflow/tools/pip_package/build_pip_package +``` + +Once the build is complete, the package will be stored in the cache before sending out a link to the user for download. + +**Package Cache** + +We will design the package cache as a simple GCS bucket supported by a simple database. The database can have the following schema: + +``` +PackageCacheDataStore { + // This is the primary key to our datastore. + // As build_options is such a complex data type, + // this will be the the build_options fed through a hash function + string build_options_key; + + // The path to the artifact. Will look like: + // gs://tensorflow-artifacts/on-demand//packagename.whl + string artifact_location; + + // Maybe, maybe not + date last_built; + date last_downloaded; + int download_count; +} +``` + +### Other Details + +**Bug Filing** + +With the complicated build system TF currently has, there is undoubtedly going to be issues when building TF. In such cases, we would like the system to go through a process in which it will: + +1. Prepare full reproduction instructions. +1. File the bug to most relevant teams. + * e.g. Bugs related to the build system will be filed to appropriate TF teams while server-side bugs will be filed to the Red Hat's support team. (This is just an example and is not final.) +1. While the bug is being filed, unblock the user by recommending alternate download options. From b8b1067c205af8b6671dd9fcbce5b2ec1ea3389a Mon Sep 17 00:00:00 2001 From: Hye Soo Yang Date: Mon, 25 Feb 2019 17:10:44 -0800 Subject: [PATCH 2/3] modified one line in tf-on-demand md file. --- rfcs/20190225-tf-on-demand.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/rfcs/20190225-tf-on-demand.md b/rfcs/20190225-tf-on-demand.md index c5d55e676..9a4212877 100644 --- a/rfcs/20190225-tf-on-demand.md +++ b/rfcs/20190225-tf-on-demand.md @@ -59,13 +59,13 @@ We propose collaborating with the Red Hat team and leveraging the team's open-so **Web UI (download.tensorflow.org)** -We propose a simple web interface that allows users to manually enter necessary system specs and desired build options for requesting a custom TF binary. The UI will be straightforward and easy-to-use. A sample mock up UI is shown below: +We propose a simple web interface that allows users to manually enter necessary system specs and desired build options for requesting a custom TF binary. The UI will be straightforward and easy-to-use. A sample mock up UI is shown below. (Please refer to [PyTorch](https://pytorch.org) download UI as an additional reference.) ![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png) Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary. -Before sending out the binary, the system will first check whether the binary is supported (or unsupported) by going through what is being tested in CI. It will then inform the users accordingly. +Before sending out the binary, the system will first check whether the binary is supported (or unsupported) by going through what is being tested in CI. It will then inform the users accordingly. **Command Line Tool (TF Downloader Binary)** @@ -186,6 +186,6 @@ PackageCacheDataStore { With the complicated build system TF currently has, there is undoubtedly going to be issues when building TF. In such cases, we would like the system to go through a process in which it will: 1. Prepare full reproduction instructions. -1. File the bug to most relevant teams. +1. File the bug to most relevant teams. * e.g. Bugs related to the build system will be filed to appropriate TF teams while server-side bugs will be filed to the Red Hat's support team. (This is just an example and is not final.) 1. While the bug is being filed, unblock the user by recommending alternate download options. From 1f6dae299abd236614d66ed730710a5183b501ec Mon Sep 17 00:00:00 2001 From: Hye Soo Yang Date: Tue, 9 Apr 2019 15:42:44 -0700 Subject: [PATCH 3/3] Changing RFC status from PROPOSED to ACCEPTED. --- rfcs/20190225-tf-on-demand.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20190225-tf-on-demand.md b/rfcs/20190225-tf-on-demand.md index 9a4212877..ae32a801b 100644 --- a/rfcs/20190225-tf-on-demand.md +++ b/rfcs/20190225-tf-on-demand.md @@ -1,6 +1,6 @@ # TensorFlow On Demand -| Status | Proposed | +| Status | Accepted | :-------------- |:---------------------------------------------------- | | **Author(s)** | Gunhan Gulsoy (gunan@google.com), Hye Soo Yang (hyey@google.com) | | **Sponsor** | Gunhan Gulsoy (gunan@google.com) |