-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: TF on Demand Project #69
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
# TensorFlow On Demand | ||
|
||
| Status | Accepted | | ||
:-------------- |:---------------------------------------------------- | | ||
| **Author(s)** | Gunhan Gulsoy (gunan@google.com), Hye Soo Yang (hyey@google.com) | | ||
| **Sponsor** | Gunhan Gulsoy (gunan@google.com) | | ||
| **Collaborators** | Christoph Goern (goern@redhat.com), Subin Modeel (smodeel@redhat.com) | | ||
| **Updated** | 2019-02-25 | | ||
|
||
## Objective | ||
|
||
This document proposes a system to build optimized binaries in the cloud and deliver the artifacts to users of TensorFlow. With this system, we aim to achieve the following: | ||
|
||
* Hide the complexity of the build system of TF from beginner and intermediate users. | ||
* Improve the out-of-the-box performance for users. | ||
* Build on the strength of TensorFlow's community partners. | ||
|
||
## Motivation | ||
|
||
### Overview | ||
|
||
TF build is difficult and the friction present during building phase is apparent in our user experiences. For a successful build, users needs to be aware of requirements and configurations that they might not be too familiar with. This can be extremely challenging for beginner and intermediate users. | ||
|
||
Currently, TF python release artifacts are hosted on PyPI as a single source of TF binaries for download. However, PyPI is quite limited in capability; it is unable to recognize and/or deploy on user machines. | ||
|
||
Historically, this has been a problem when trying to satisfy the following requirements for our artifacts: | ||
|
||
* **Portability**: The artifacts should be able to run on as many platforms as possible. | ||
* **Performance**: The artifacts should run as fast as possible. | ||
* **Size**: The artifacts should be as small as possible in size. | ||
|
||
To achieve the above, we propose a system that takes inputs from the user, build artifacts based on the inputs and send the output back to the user. We are proposing two user endpoints to the system: | ||
|
||
1. Command line tool (python) | ||
1. Web interface | ||
|
||
An overview of the system: | ||
|
||
![drawing](https://docs.google.com/drawings/d/11jAVBtR4nV4bkDVW1WrHhLXdzX2ZouOPwsztW2hZQz4/export/png) | ||
|
||
A system flow chart summarizing the ordering of events for all use cases of the system: | ||
|
||
![drawing](https://docs.google.com/drawings/d/1lSSHaYktst8MNTqF860cPlkfbgc3Ft_t_8PWzH0VkW4/export/png) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The diagrams seems of low resolution... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All diagrams are updated with better resolution! |
||
|
||
|
||
### Collaboration with Red Hat | ||
|
||
Red Hat already built a system to build and deliver TF on the cloud. Reference artifacts released by Red Hat are listed below: | ||
|
||
* [tensorflow-build-s2i ](https://github.com/thoth-station/tensorflow-build-s2i) | ||
* [Tensorflow-release-job](https://github.com/thoth-station/tensorflow-release-job) | ||
* [tensorflow-serving-build](https://github.com/thoth-station/tensorflow-serving-build) | ||
|
||
We propose collaborating with the Red Hat team and leveraging the team's open-source software as much as possible. The server side of the project, in particular, can have concentrated involvement from the team given the backend GKE cluster that already has a lot of functionality built and available. (Please refer to the "Server" block in the diagram above for details on the server-side event flow.) Based on Additional support and functionality will be built around and on top of it to match what TensorFlow currently supports and should support moving forward. | ||
|
||
## Design Proposal | ||
|
||
### Front-end | ||
|
||
**Web UI (download.tensorflow.org)** | ||
|
||
We propose a simple web interface that allows users to manually enter necessary system specs and desired build options for requesting a custom TF binary. The UI will be straightforward and easy-to-use. A sample mock up UI is shown below. (Please refer to [PyTorch](https://pytorch.org) download UI as an additional reference.) | ||
|
||
![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png) | ||
|
||
Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To me, the binaries collection are a fixed one - fixed combination of platforms/configurations from the Build system. So, for the configuration not supported by default Build system, people need to build by their own, right? And, what about the addon (contrib), as they are moving out from TF repo. Are they included in the selectable configurations? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As new combinations are requested we can expand the list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The initial version will include core tensorflow only (and no addons like contrib). However, as @sub-mod explained above, we will consider including addons and/or sub-projects based on incoming requests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For unsupported configurations, the tool will suggest best-known alternative solutions. Depending on the case (and most likely), the user will need to build the wheel on his/her own. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why don't we expect basically ~all configurations that don't have active CI to be broken? That is, what happens with random compile errors that will show up in a lot of different combinations of libc/compiler/cuda? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The system will automatically file bugs for such compile errors and suggest alternative solutions to the users to help complete the build. We currently run a large set of builds on tf-nightly across multiple platforms to make sure the builds are successful and to catch any such random compile errors upon failures. Based on our observation, we do not anticipate running into these issues, at least not too frequently. Upon receiving these compiler error bugs, we will triage them and aim to improve the system appropriately. Looping in @gunan for any additional explanation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would there be a way to request a specific version of gcc to be used for compilation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes we can do that. Also for centos/rhel OS we can install a certain devtoolset based on user input.We build the docker image first based on the user's input. And then we compile TF in this docker image. |
||
|
||
Before sending out the binary, the system will first check whether the binary is supported (or unsupported) by going through what is being tested in CI. It will then inform the users accordingly. | ||
|
||
**Command Line Tool (TF Downloader Binary)** | ||
|
||
We propose a simple binary which will detect and fill out most of the inputs that the Web UI requires. It will then send the request to the backend to build. A sample execution (not final) is shown below: | ||
|
||
``` | ||
> python tfdownloader.py | ||
Downloader will now detect your system: | ||
- Detecting CPU……. Found Intel Core i7-8700K | ||
- Detecting GPU……. Found NVIDIA GTX 980 | ||
- Detecting CUDA…… Found 9.2 | ||
- Detecting cuDNN….. Found 7.4 | ||
- Detecting Distribution……. Found Ubuntu 18.04 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of these be found at https://github.com/thoth-station/thamos/blob/master/thamos/discover.py#L33 |
||
|
||
Requesting TF build with options: | ||
CPU options: -mavx2 | ||
GPU enabled: yes | ||
CUDA version: 9.2 | ||
cuDNN version: 7.4 | ||
CUDA compute capability: 5.2 | ||
GCC version: 7.3 | ||
``` | ||
|
||
|
||
It is possible that the system this script is run on may be incompatible for running TF. Such incapabilities and what the script is able to do can be categorized into two failure modes: | ||
|
||
|
||
|
||
* *Hard Constraint Failures* | ||
* 32-bit OS | ||
* GPU too old (Cuda Compute Capability older than 3.0) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. People with old GPUs can still build TF for CPUs, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Always. And <3.0 was never supported for GPU anyway. |
||
* *Soft Constraint Failures* | ||
* CUDA version too old | ||
* cuDNN missing | ||
* Other runtime libraries missing or too old (libstdc++) | ||
* Unsupported python version found | ||
|
||
In hard constraint failure cases, the tool will point users to cloud options. In soft constraint failure cases, the tool will offer alternative build configurations (CPU only) or request users to install the missing software. | ||
|
||
We propose distributing this binary through PyPI. However, please note that binary distribution method is still under discussion. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to integrate into the existing TF PyPI packages? People may get confused when there are too many PyPI packages to choose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check this ML thread: https://www.mail-archive.com/wheel-builders@python.org/msg00334.html There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PyPI lacks the ability to distinguish different build configurations. The system will not be integrated with PyPI (and therefore it won't introduce confusion to users trying to download TF PyPI packages). |
||
|
||
### Back-end | ||
|
||
**TF Builder Service** | ||
|
||
TF builder service will require a proto with build options. This proto will define all options the system recognizes to build TF. Once the request is received, the system will check the cache to see if we already have such a package built. If yes, then it will send back the URL for downloading this package. A sample proto (not final) is shown below: | ||
|
||
``` | ||
proto build_options { | ||
string version = 0; | ||
enum CpuOptions { | ||
SSE3 = 0; | ||
SSE4 = 1; | ||
SSE4_1 = 2; | ||
SSE4_A = 3; | ||
SSE4_2 = 4; | ||
AVX = 5; | ||
AVX2 = 6; | ||
AVX512F = 7; | ||
} | ||
repeated CpuOptions cpu_options = 1; | ||
|
||
// CUDA options | ||
bool cuda_enabled = 2; | ||
enum NvidiaGPUGeneration { | ||
KEPLER = 0; | ||
MAXWELL = 1; | ||
PASCAL = 2; | ||
VOLTA = 3; | ||
TURING = 4; | ||
} | ||
NvidiaGPUGeneration gpu_generation = 3; | ||
string CUDA_version = 4; | ||
string cuDNN version = 5; | ||
…… | ||
// Free formed string of options to append to bazel. | ||
string extra_options = 100; | ||
} | ||
``` | ||
|
||
If no such package exists in cache, then the system will execute the following commands with the appropriate flags and environment variables to newly build it. | ||
|
||
``` | ||
git checkout <tag> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, Curious to know if this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can reference any branch |
||
yes "" | ./configure | ||
bazel build tensorflow/tools/pip_package:build_pip_package | ||
./bazel-bin/tensorflow/tools/pip_package/build_pip_package | ||
``` | ||
|
||
Once the build is complete, the package will be stored in the cache before sending out a link to the user for download. | ||
|
||
**Package Cache** | ||
|
||
We will design the package cache as a simple GCS bucket supported by a simple database. The database can have the following schema: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How long will the built artifacts be available for download? Will it be possible to list available builds (e.g. have listing of IMHO its worth to consider keeping build options somewhere handy (e.g. publish well defined proto or JSON/YAML alongside with the resulting wheel files, or keep it directly in wheel files) so that a user can see build configuration of a wheel file before downloading it or cross-checking it. This is also handy for automated systems that would like to consume such metadata when using built wheel files. Another option (instead of creating a metadata files) is to keep prefix consisting of build configuration options as we did originally here the issue is in maintaining such prefix and keeping it parse-able. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hyeygit ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, we plan to have the built artifacts be available indefinitely. In the future, however, we may think about cleaning them up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding storing the build options, we are still actively discussing, trying to come up with a solution that would be manageable and scalable. The second option you provided has actually been brought up. I do agree that giving the users the option to retrieve the build configuration of a wheel file would be handy. Will make sure to consider the first option. |
||
|
||
``` | ||
PackageCacheDataStore { | ||
// This is the primary key to our datastore. | ||
// As build_options is such a complex data type, | ||
// this will be the the build_options fed through a hash function | ||
string build_options_key; | ||
|
||
// The path to the artifact. Will look like: | ||
// gs://tensorflow-artifacts/on-demand/<hash>/packagename.whl | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great to ffollow wheel file name convention as defined in PEP standards around wheel files. The name of file is relevant when installing a wheel file and Python's toolchain uses it during installation to verify if the given wheel file fits platform. One can then directly install the downloaded wheel file, also installing via HTTPS is an option (if the listing will support it). With the above, wheel files will carry tensorflow version. It's worth to consider to put versions built with the same configuration options into same prefix (the description makes this not clear - the build system would override EDIT: maybe I miss-understand |
||
string artifact_location; | ||
|
||
// Maybe, maybe not | ||
date last_built; | ||
date last_downloaded; | ||
int download_count; | ||
} | ||
``` | ||
|
||
### Other Details | ||
|
||
**Bug Filing** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the plan to allow feature requests/PR submissions? For example, how would I request for the system to build a binary with compiler flags for the latest Xeon instruction set? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And how frequently will the build system compiler be updated to support updated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As long as the compiler flags are available for the OS+gcc combination it can be passed in to the build environment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant OS(operating System) |
||
|
||
With the complicated build system TF currently has, there is undoubtedly going to be issues when building TF. In such cases, we would like the system to go through a process in which it will: | ||
|
||
1. Prepare full reproduction instructions. | ||
1. File the bug to most relevant teams. | ||
* e.g. Bugs related to the build system will be filed to appropriate TF teams while server-side bugs will be filed to the Red Hat's support team. (This is just an example and is not final.) | ||
1. While the bug is being filed, unblock the user by recommending alternate download options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this proposal mean that TensorFlow binaries will no longer be released on PyPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These builds are for specific combination of OS/python/CUDA/arch etc. PyPI currently lacks the capability to differentiate wheels based on platform.
PyPi will continue to have the tensorflow wheels with generic arch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view its worth to think about publishing TF Wheels in a different thread: there are a few issues like naming (mentioned by Subin above), include build time configuration in wheel file, ...
I am sure we can not change the way PyPi works short term, so we need to come up with a cleaver structure of index hosts and directories.