RFC: TF on Demand Project #69

hyeygit · 2019-02-26T17:54:56Z

Comment period is open until 2019-03-13

Status	Proposed
Author(s)	Gunhan Gulsoy (gunan@google.com), Hye Soo Yang (hyey@google.com)
Sponsor	Gunhan Gulsoy (gunan@google.com)
Collaborators	Christoph Goern (goern@redhat.com), Subin Modeel (smodeel@redhat.com)
Updated	2019-02-25

Objective

This document proposes a system to build optimized binaries in the cloud and deliver the artifacts to users of TensorFlow. With this system, we aim to achieve the following:

Hide the complexity of the build system of TF from beginner and intermediate users.
Improve the out-of-the-box performance for users.
Build on the strength of TensorFlow's community partners.

justin1121 · 2019-02-27T20:12:18Z

rfcs/20190225-tf-on-demand.md

+If no such package exists in cache, then the system will execute the following commands with the appropriate flags and environment variables to newly build it.
+
+```
+git checkout <tag>


Hi,

Curious to know if this <tag> could reference any branch known to the tensorflow repo (maybe impossible to reach forks?) or will be limited to released tensorflow versions or release candidates? Think it would pretty interesting to be able to build custom tensorflow builds based of a PR/fork submitted for consideration to the tensorflow github repo.

It can reference any branch

fridex · 2019-02-27T20:00:53Z

rfcs/20190225-tf-on-demand.md

+	- Detecting GPU……. Found NVIDIA GTX 980
+	- Detecting CUDA…… Found 9.2
+	- Detecting cuDNN….. Found 7.4
+	- Detecting Distribution……. Found Ubuntu 18.04


Some of these be found at https://github.com/thoth-station/thamos/blob/master/thamos/discover.py#L33

fridex · 2019-02-27T20:20:46Z

rfcs/20190225-tf-on-demand.md

+
+**Package Cache**
+
+We will design the package cache as a simple GCS bucket supported by a simple database. The database can have the following schema: 


How long will the built artifacts be available for download? Will it be possible to list available builds (e.g. have listing of gs://tensorflow-artifacts/on-demand/?

IMHO its worth to consider keeping build options somewhere handy (e.g. publish well defined proto or JSON/YAML alongside with the resulting wheel files, or keep it directly in wheel files) so that a user can see build configuration of a wheel file before downloading it or cross-checking it. This is also handy for automated systems that would like to consume such metadata when using built wheel files.

Another option (instead of creating a metadata files) is to keep prefix consisting of build configuration options as we did originally here the issue is in maintaining such prefix and keeping it parse-able.

For now, we plan to have the built artifacts be available indefinitely. In the future, however, we may think about cleaning them up.

Regarding storing the build options, we are still actively discussing, trying to come up with a solution that would be manageable and scalable. The second option you provided has actually been brought up. I do agree that giving the users the option to retrieve the build configuration of a wheel file would be handy. Will make sure to consider the first option.

fridex · 2019-02-27T20:48:54Z

rfcs/20190225-tf-on-demand.md

+   string build_options_key;
+
+   // The path to the artifact. Will look like:
+   // gs://tensorflow-artifacts/on-demand/<hash>/packagename.whl


It would be great to ffollow wheel file name convention as defined in PEP standards around wheel files. The name of file is relevant when installing a wheel file and Python's toolchain uses it during installation to verify if the given wheel file fits platform. One can then directly install the downloaded wheel file, also installing via HTTPS is an option (if the listing will support it).

With the above, wheel files will carry tensorflow version. It's worth to consider to put versions built with the same configuration options into same prefix (the description makes this not clear - the build system would override packagename.whl if a different tensorflow version is built as only hash of build_options_key distinguishes these builds). This way one can use the proposed system as a TensorFlow-specific PyPI - by providing URL that would be like https://<path-to-gs>/tensorflow-artifacts/on-demand/<hash>, when doing pip install --index-url <URL> --extra-index-url https://pypi.org/simple or in other tools that try to deal with multi-package source index configuration (such as Pipenv). Package updates/new version releases for specific build configuration options would work out-of-the-box.

EDIT: maybe I miss-understand packagename.whl in the description (if it is correctly wheel named file)

alextp · 2019-02-27T22:02:01Z

rfcs/20190225-tf-on-demand.md

+
+*   *Hard Constraint Failures*
+    *   32-bit OS
+    *   GPU too old (Cuda Compute Capability older than 3.0)


People with old GPUs can still build TF for CPUs, right?

Always. And <3.0 was never supported for GPU anyway.

frreiss · 2019-02-27T22:26:56Z

rfcs/20190225-tf-on-demand.md

+
+TF build is difficult and the friction present during building phase is apparent in our user experiences. For a successful build, users needs to be aware of requirements and configurations that they might not be too familiar with. This can be extremely challenging for beginner and intermediate users.
+
+Currently, TF python release artifacts are hosted on PyPI as a single source of TF binaries for download. However, PyPI is quite limited in capability; it is unable to recognize and/or deploy on user machines.


Does this proposal mean that TensorFlow binaries will no longer be released on PyPI?

Does this proposal mean that TensorFlow binaries will no longer be released on PyPI?

Guess not.

These builds are for specific combination of OS/python/CUDA/arch etc. PyPI currently lacks the capability to differentiate wheels based on platform.
PyPi will continue to have the tensorflow wheels with generic arch.

From my point of view its worth to think about publishing TF Wheels in a different thread: there are a few issues like naming (mentioned by Subin above), include build time configuration in wheel file, ...

I am sure we can not change the way PyPi works short term, so we need to come up with a cleaver structure of index hosts and directories.

zhenhuaw-me · 2019-02-28T03:28:10Z

rfcs/20190225-tf-on-demand.md

+
+A system flow chart summarizing the ordering of events for all use cases of the system:
+
+![drawing](https://docs.google.com/drawings/d/1lSSHaYktst8MNTqF860cPlkfbgc3Ft_t_8PWzH0VkW4/export/png)


The diagrams seems of low resolution...

All diagrams are updated with better resolution!

zhenhuaw-me · 2019-02-28T03:33:13Z

rfcs/20190225-tf-on-demand.md

+
+![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png)
+
+Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary.


To me, the binaries collection are a fixed one - fixed combination of platforms/configurations from the Build system. So, for the configuration not supported by default Build system, people need to build by their own, right?

And, what about the addon (contrib), as they are moving out from TF repo. Are they included in the selectable configurations?

As new combinations are requested we can expand the list.

The initial version will include core tensorflow only (and no addons like contrib). However, as @sub-mod explained above, we will consider including addons and/or sub-projects based on incoming requests.

For unsupported configurations, the tool will suggest best-known alternative solutions. Depending on the case (and most likely), the user will need to build the wheel on his/her own.

zhenhuaw-me · 2019-02-28T03:36:18Z

rfcs/20190225-tf-on-demand.md

+
+In hard constraint failure cases, the tool will point users to cloud options. In soft constraint failure cases, the tool will offer alternative build configurations (CPU only) or request users to install the missing software.
+
+We propose distributing this binary through PyPI. However, please note that binary distribution method is still under discussion.


Is it possible to integrate into the existing TF PyPI packages? People may get confused when there are too many PyPI packages to choose.

Check this ML thread: https://www.mail-archive.com/wheel-builders@python.org/msg00334.html

PyPI lacks the ability to distinguish different build configurations. The system will not be integrated with PyPI (and therefore it won't introduce confusion to users trying to download TF PyPI packages).

r4nt · 2019-02-28T10:15:36Z

rfcs/20190225-tf-on-demand.md

+
+![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png)
+
+Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary.


Why don't we expect basically ~all configurations that don't have active CI to be broken? That is, what happens with random compile errors that will show up in a lot of different combinations of libc/compiler/cuda?

The system will automatically file bugs for such compile errors and suggest alternative solutions to the users to help complete the build.

We currently run a large set of builds on tf-nightly across multiple platforms to make sure the builds are successful and to catch any such random compile errors upon failures. Based on our observation, we do not anticipate running into these issues, at least not too frequently.

Upon receiving these compiler error bugs, we will triage them and aim to improve the system appropriately.

Looping in @gunan for any additional explanation.

Would there be a way to request a specific version of gcc to be used for compilation?

yes we can do that. Also for centos/rhel OS we can install a certain devtoolset based on user input.We build the docker image first based on the user's input. And then we compile TF in this docker image.

foxik · 2019-03-09T19:33:27Z

Are there any plans to support libtensorflow as well? We are thinking of distributing a C++ application using the C libtensorflow as a backend, and having a simple way of compiling the library for the user system would be a huge help.

claynerobison · 2019-03-12T23:26:12Z

rfcs/20190225-tf-on-demand.md

+
+### Other Details
+
+**Bug Filing**


What is the plan to allow feature requests/PR submissions? For example, how would I request for the system to build a binary with compiler flags for the latest Xeon instruction set?

And how frequently will the build system compiler be updated to support updated -m flags?

As long as the compiler flags are available for the OS+gcc combination it can be passed in to the build environment.

I meant OS(operating System)

hyeygit · 2019-03-15T19:28:05Z

@foxik Regarding your question above --we will start with pip packages first and then will expand to supporting others like libtensorflow later.

Looping in @gunan for any additional comments.

ewilderj · 2019-04-03T16:58:11Z

@hyeygit Please push a change with "Status: Accepted" and I will merge this RFC.

hyeygit · 2019-04-09T22:42:04Z

Summary

Built TF artifacts will be available indefinitely.
TF binaries will still be release on PyPI. This project is a separate effort.
Initial release for this project will include core TF only (excluding addons / sub-projects / libtensorflow etc.).
Additional supports will be included in the follow up releases depending on incoming requests.
Supporting new instruction set(s) will be possible as long as the compiler flags are available (for the OS and GCC combination) and can be passed into the build environment.
Wheel file naming convention will be finalized during the implementation phase as it is subject to change. The solution will aim to respect the PEP standards around wheel files and provide users the ability to retrieve build configurations of a wheel file.

* Adding an RFC for TF on Demand Project. * modified one line in tf-on-demand md file. * Changing RFC status from PROPOSED to ACCEPTED.

@karllessard

* Adding a doc to deprecate collections * Responding to Karmels comments * Minor fix to VariableTracker sample code * RFC for random numbers in TensorFlow 2.0 * Changes after some feedback * Removed 'global_seed' in the main code and showed the design with 'global_seed' in the Questions section. * Some changes after feedback * A tweak * Change after feedback * A tweak * changes * changes * fix link * new-rfc * changes * Update rfcs/20181225-tf-backend.md Co-Authored-By: alextp <apassos@google.com> * Added some considerations about tf.function * Renamed the internal name "op_generator" to "global_generator" * Changed seed size from 256 to 1024 bits * Initial signpost for community meetings Adding this so there is basic information about how to find the community calendar and get invited to meetings. * Add iCal link too * changes * Initial version of embedding and partitioned variable RFC. * Fix one formatting issue. * Fix another formatting issue. * Use markdown language for the table instead of HTML. * Add tensorflow/io R Package CRAN release instructions (tensorflow#53) * Added Design Review Notes * Make clear distinction between embedding variables and loadbalancing variables. * Added decisions below each question, and "how to use generators with distribution strategies". * Adopted Dong Lin's suggestions * Add a paragraph pointing out the problem with the `partition_strategy` argument. * RFC: Move from tf.contrib to addons (tensorflow#37) * Checkpoint addons RFC for review * Add code review to RFC Add future pull request information to criteria Update modified date added some description RFC Move to addons * Add weight decay optimizers * Remove conv2d_in_plane * Add group_norm * Accept addons RFC * Update alternatives since `DynamicPartition` and `DynamicStitch` do have GPU kernels. * Add a section for saving and restore `PartitionedVariable`. * Mention that variable types can be nested, attention needs to be paid to their saving and restoring mechanism. * Create README.md (tensorflow#57) * Splitted `_state_var` into `_state_var` and `_alg_var` (because of concerns from implementation), and changed status to "Accepted" * Updated timestamp * Moved the auto-selection of algorithm from `create_rng_state` to `Generator.__init__` * Update according to the discussion * Move performance heuristics in Distribution Strategy level. We will not expose knobs for users to control; * Emphasize that embedding support in v2 will all be via `Embedding` layer. Users can use `tf.compat.v1` to handle embedding by themselves; * Mention that default `partition_strategy` in v1 `embedding_lookup` is "mod", which will possibly break users's model when they update to TF 2.0; * We want to prioritize shuffling embedding after 2.0 release; * We have plans to serialize and deserialize `Embedding` layer and Distribution Strategies to allow loading a saved model to a different number of partitions. * Update relese binary build command for sig-io (tensorflow#58) This PR updates relese binary build command for sig-io Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add Bryan to SIG IO release team (tensorflow#59) * Change to accepted * Add link to TensorFlow IO R package * Updated link for the friction log. (tensorflow#64) * Switch DistStrat revised API examples to TensorFlow 2 style. (tensorflow#63) * RFC: Attention for Dense Networks on Keras (tensorflow#54) * Design review for "Attention for Dense Networks" * RFC: Stateful Containers with tf.Module (tensorflow#56) * Create 20190117-tf-module.md * Update 20190117-tf-module.md * Loosen return type for variable properties. * Use Dense consistently. Thanks brilee@ for spotting! * Remove convert_to_tensor from examples. This wasn't ever required and including it might cause confusion. h/t pluskid@ gehring@ and awav@ * Remove owned_* methods. * Document `_flatten` See tensorflow/tensorflow@5076adf6 for more context. * Fix typo in module name. Thanks k-w-w@! * Update 20190117-tf-module.md * RFC: New tf.print (tensorflow#14) * New tf.print proposal * Attempt to fix table of contents * Removed not-working TOC label * Minor updates to the doc. * Update tf.print to be accepted * Added design review notes * Marking doc as accepted * Update cond_v2 design doc (tensorflow#70) * Update to bring in line with implementation * Added the symbol map to the RFC. * Updated testing section of the Community site. * Removed the 100%, formatting tweaks. * Update CHARTER.md * Change contact email address I will leave my current company soon, so update my email. * Create README.md * Logos for SIGs * Update README.md * Update addons owners (tensorflow#85) Add Yan Facai as another project lead. * Created a FAQ for TF 2.0. (tensorflow#78) Adding 2.0 related FAQ to the Testing group. * Request and charter for SIG JVM (tensorflow#86) Chartering docs for SIG JVM * Update CODEOWNERS Add @karllessard, @sjamesr and @tzolov as code owners for sigs/jvm. * Update CODEOWNERS Add missing / * Update CODEOWNERS Add @dynamicwebpaige as owner for sigs/testing/ * Update RFC with current information (tensorflow#89) Make current to SIG Addons * RFC: TF on Demand Project (tensorflow#69) * Adding an RFC for TF on Demand Project. * modified one line in tf-on-demand md file. * Changing RFC status from PROPOSED to ACCEPTED. * RFC: SavedModel Save/Load in 2.x (tensorflow#34) * RFC for SavedModel Save/Load in 2.x * Minor edits and a discussion topic for load() with multiple MetaGraphs * Tweak to the "Imported representations of signatures" section * Update "Importing existing SavedModels" with the .signatures change * Update RFC and add review notes * Status -> accepted * Update CHARTER.md New leads. * Update 20180920-unify-rnn-interface.md (tensorflow#81) Typo fix. * Update yyyymmdd-rfc-template.md Adding "user benefit" section into the RFC template, to encourage articulating the benefit to users in a clear way. * Update while_v2 design doc (tensorflow#71) * Update while_v2 design doc, include link to implementation * Update TF 2.0 FAQ to link to TensorBoard TF 2.0 tutorial (tensorflow#94) * CLN: update sig addons logo png (tensorflow#99) * Add SIG Keras Add a reference link to Keras' governance repository for SIG Keras. * RFC: String Tensor Unification (tensorflow#91) * RFC: String Tensor Unification * Updated rfcs/20190411-string-unification.md Updated TFLite sections to address feedback from @jdduke. Marked as Accepted. * Start RFC for tensor buffers

hyeygit added 2 commits February 25, 2019 16:39

Adding an RFC for TF on Demand Project.

30431bb

modified one line in tf-on-demand md file.

b8b1067

hyeygit requested review from ewilderj, goldiegadde and martinwicke as code owners February 26, 2019 17:54

googlebot added the cla: yes label Feb 26, 2019

ewilderj added RFC: Proposed RFC Design Document SIG Build labels Feb 27, 2019

justin1121 reviewed Feb 27, 2019

View reviewed changes

fridex reviewed Feb 27, 2019

View reviewed changes

alextp reviewed Feb 27, 2019

View reviewed changes

frreiss reviewed Feb 27, 2019

View reviewed changes

zhenhuaw-me reviewed Feb 28, 2019

View reviewed changes

r4nt reviewed Feb 28, 2019

View reviewed changes

claynerobison reviewed Mar 12, 2019

View reviewed changes

Changing RFC status from PROPOSED to ACCEPTED.

1f6dae2

ewilderj approved these changes Apr 10, 2019

View reviewed changes

ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Apr 10, 2019

ewilderj merged commit a89faae into tensorflow:master Apr 10, 2019

broken pushed a commit to broken/community that referenced this pull request Apr 30, 2019

RFC: TF on Demand Project (tensorflow#69)

7561117

* Adding an RFC for TF on Demand Project. * modified one line in tf-on-demand md file. * Changing RFC status from PROPOSED to ACCEPTED.


		Package Cache

		We will design the package cache as a simple GCS bucket supported by a simple database. The database can have the following schema:


		TF build is difficult and the friction present during building phase is apparent in our user experiences. For a successful build, users needs to be aware of requirements and configurations that they might not be too familiar with. This can be extremely challenging for beginner and intermediate users.

		Currently, TF python release artifacts are hosted on PyPI as a single source of TF binaries for download. However, PyPI is quite limited in capability; it is unable to recognize and/or deploy on user machines.


		A system flow chart summarizing the ordering of events for all use cases of the system:

		![drawing](https://docs.google.com/drawings/d/1lSSHaYktst8MNTqF860cPlkfbgc3Ft_t_8PWzH0VkW4/export/png)


		![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png)

		Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary.


		In hard constraint failure cases, the tool will point users to cloud options. In soft constraint failure cases, the tool will offer alternative build configurations (CPU only) or request users to install the missing software.

		We propose distributing this binary through PyPI. However, please note that binary distribution method is still under discussion.

RFC: TF on Demand Project #69

RFC: TF on Demand Project #69

Conversation

hyeygit commented Feb 26, 2019 • edited by ewilderj Loading

Objective

justin1121 Feb 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fridex Feb 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fridex Feb 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenhuaw-me Feb 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sub-mod Mar 14, 2019 • edited Loading

Choose a reason for hiding this comment

foxik commented Mar 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sub-mod Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hyeygit commented Mar 15, 2019

ewilderj commented Apr 3, 2019

hyeygit commented Apr 9, 2019

hyeygit commented Feb 26, 2019 •

edited by ewilderj

Loading

justin1121 Feb 27, 2019 •

edited

Loading

fridex Feb 27, 2019 •

edited

Loading

fridex Feb 27, 2019 •

edited

Loading

zhenhuaw-me Feb 28, 2019 •

edited

Loading

sub-mod Mar 14, 2019 •

edited

Loading

sub-mod Mar 13, 2019 •

edited

Loading