-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: TF on Demand Project #69
Conversation
If no such package exists in cache, then the system will execute the following commands with the appropriate flags and environment variables to newly build it. | ||
|
||
``` | ||
git checkout <tag> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Curious to know if this <tag>
could reference any branch known to the tensorflow repo (maybe impossible to reach forks?) or will be limited to released tensorflow versions or release candidates? Think it would pretty interesting to be able to build custom tensorflow builds based of a PR/fork submitted for consideration to the tensorflow github repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can reference any branch
- Detecting GPU……. Found NVIDIA GTX 980 | ||
- Detecting CUDA…… Found 9.2 | ||
- Detecting cuDNN….. Found 7.4 | ||
- Detecting Distribution……. Found Ubuntu 18.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these be found at https://github.com/thoth-station/thamos/blob/master/thamos/discover.py#L33
|
||
**Package Cache** | ||
|
||
We will design the package cache as a simple GCS bucket supported by a simple database. The database can have the following schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long will the built artifacts be available for download? Will it be possible to list available builds (e.g. have listing of gs://tensorflow-artifacts/on-demand/
?
IMHO its worth to consider keeping build options somewhere handy (e.g. publish well defined proto or JSON/YAML alongside with the resulting wheel files, or keep it directly in wheel files) so that a user can see build configuration of a wheel file before downloading it or cross-checking it. This is also handy for automated systems that would like to consume such metadata when using built wheel files.
Another option (instead of creating a metadata files) is to keep prefix consisting of build configuration options as we did originally here the issue is in maintaining such prefix and keeping it parse-able.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hyeygit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we plan to have the built artifacts be available indefinitely. In the future, however, we may think about cleaning them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding storing the build options, we are still actively discussing, trying to come up with a solution that would be manageable and scalable. The second option you provided has actually been brought up. I do agree that giving the users the option to retrieve the build configuration of a wheel file would be handy. Will make sure to consider the first option.
string build_options_key; | ||
|
||
// The path to the artifact. Will look like: | ||
// gs://tensorflow-artifacts/on-demand/<hash>/packagename.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to ffollow wheel file name convention as defined in PEP standards around wheel files. The name of file is relevant when installing a wheel file and Python's toolchain uses it during installation to verify if the given wheel file fits platform. One can then directly install the downloaded wheel file, also installing via HTTPS is an option (if the listing will support it).
With the above, wheel files will carry tensorflow version. It's worth to consider to put versions built with the same configuration options into same prefix (the description makes this not clear - the build system would override packagename.whl
if a different tensorflow version is built as only hash of build_options_key
distinguishes these builds). This way one can use the proposed system as a TensorFlow-specific PyPI - by providing URL that would be like https://<path-to-gs>/tensorflow-artifacts/on-demand/<hash>
, when doing pip install --index-url <URL> --extra-index-url https://pypi.org/simple
or in other tools that try to deal with multi-package source index configuration (such as Pipenv). Package updates/new version releases for specific build configuration options would work out-of-the-box.
EDIT: maybe I miss-understand packagename.whl
in the description (if it is correctly wheel named file)
|
||
* *Hard Constraint Failures* | ||
* 32-bit OS | ||
* GPU too old (Cuda Compute Capability older than 3.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People with old GPUs can still build TF for CPUs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always. And <3.0 was never supported for GPU anyway.
|
||
TF build is difficult and the friction present during building phase is apparent in our user experiences. For a successful build, users needs to be aware of requirements and configurations that they might not be too familiar with. This can be extremely challenging for beginner and intermediate users. | ||
|
||
Currently, TF python release artifacts are hosted on PyPI as a single source of TF binaries for download. However, PyPI is quite limited in capability; it is unable to recognize and/or deploy on user machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this proposal mean that TensorFlow binaries will no longer be released on PyPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this proposal mean that TensorFlow binaries will no longer be released on PyPI?
Guess not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These builds are for specific combination of OS/python/CUDA/arch etc. PyPI currently lacks the capability to differentiate wheels based on platform.
PyPi will continue to have the tensorflow wheels with generic arch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view its worth to think about publishing TF Wheels in a different thread: there are a few issues like naming (mentioned by Subin above), include build time configuration in wheel file, ...
I am sure we can not change the way PyPi works short term, so we need to come up with a cleaver structure of index hosts and directories.
|
||
A system flow chart summarizing the ordering of events for all use cases of the system: | ||
|
||
![drawing](https://docs.google.com/drawings/d/1lSSHaYktst8MNTqF860cPlkfbgc3Ft_t_8PWzH0VkW4/export/png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagrams seems of low resolution...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All diagrams are updated with better resolution!
|
||
![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png) | ||
|
||
Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the binaries collection are a fixed one - fixed combination of platforms/configurations from the Build system. So, for the configuration not supported by default Build system, people need to build by their own, right?
And, what about the addon (contrib), as they are moving out from TF repo. Are they included in the selectable configurations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As new combinations are requested we can expand the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial version will include core tensorflow only (and no addons like contrib). However, as @sub-mod explained above, we will consider including addons and/or sub-projects based on incoming requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unsupported configurations, the tool will suggest best-known alternative solutions. Depending on the case (and most likely), the user will need to build the wheel on his/her own.
|
||
In hard constraint failure cases, the tool will point users to cloud options. In soft constraint failure cases, the tool will offer alternative build configurations (CPU only) or request users to install the missing software. | ||
|
||
We propose distributing this binary through PyPI. However, please note that binary distribution method is still under discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to integrate into the existing TF PyPI packages? People may get confused when there are too many PyPI packages to choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this ML thread: https://www.mail-archive.com/wheel-builders@python.org/msg00334.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyPI lacks the ability to distinguish different build configurations. The system will not be integrated with PyPI (and therefore it won't introduce confusion to users trying to download TF PyPI packages).
|
||
![drawing](https://docs.google.com/drawings/d/1Krze2no7zjfqe7nldOm-ArECOGXVVjgtk0VMXCabzkw/export/png) | ||
|
||
Once all fields are filled in, the system backend will check if the corresponding binary has already been built and present in cache. If it exists, then it will provide the user a URL to the binary for download. If it does not exist, then it will ask the user for an email address to receive a link for downloading the newly built binary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we expect basically ~all configurations that don't have active CI to be broken? That is, what happens with random compile errors that will show up in a lot of different combinations of libc/compiler/cuda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The system will automatically file bugs for such compile errors and suggest alternative solutions to the users to help complete the build.
We currently run a large set of builds on tf-nightly across multiple platforms to make sure the builds are successful and to catch any such random compile errors upon failures. Based on our observation, we do not anticipate running into these issues, at least not too frequently.
Upon receiving these compiler error bugs, we will triage them and aim to improve the system appropriately.
Looping in @gunan for any additional explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a way to request a specific version of gcc to be used for compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can do that. Also for centos/rhel OS we can install a certain devtoolset based on user input.We build the docker image first based on the user's input. And then we compile TF in this docker image.
Are there any plans to support libtensorflow as well? We are thinking of distributing a C++ application using the C libtensorflow as a backend, and having a simple way of compiling the library for the user system would be a huge help. |
|
||
### Other Details | ||
|
||
**Bug Filing** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan to allow feature requests/PR submissions? For example, how would I request for the system to build a binary with compiler flags for the latest Xeon instruction set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how frequently will the build system compiler be updated to support updated -m
flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the compiler flags are available for the OS+gcc combination it can be passed in to the build environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OC
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant OS(operating System)
@hyeygit Please push a change with "Status: Accepted" and I will merge this RFC. |
Summary
|
* Adding an RFC for TF on Demand Project. * modified one line in tf-on-demand md file. * Changing RFC status from PROPOSED to ACCEPTED.
* Adding a doc to deprecate collections * Responding to Karmels comments * Minor fix to VariableTracker sample code * RFC for random numbers in TensorFlow 2.0 * Changes after some feedback * Removed 'global_seed' in the main code and showed the design with 'global_seed' in the Questions section. * Some changes after feedback * A tweak * Change after feedback * A tweak * changes * changes * fix link * new-rfc * changes * Update rfcs/20181225-tf-backend.md Co-Authored-By: alextp <apassos@google.com> * Added some considerations about tf.function * Renamed the internal name "op_generator" to "global_generator" * Changed seed size from 256 to 1024 bits * Initial signpost for community meetings Adding this so there is basic information about how to find the community calendar and get invited to meetings. * Add iCal link too * changes * Initial version of embedding and partitioned variable RFC. * Fix one formatting issue. * Fix another formatting issue. * Use markdown language for the table instead of HTML. * Add tensorflow/io R Package CRAN release instructions (tensorflow#53) * Added Design Review Notes * Make clear distinction between embedding variables and loadbalancing variables. * Added decisions below each question, and "how to use generators with distribution strategies". * Adopted Dong Lin's suggestions * Add a paragraph pointing out the problem with the `partition_strategy` argument. * RFC: Move from tf.contrib to addons (tensorflow#37) * Checkpoint addons RFC for review * Add code review to RFC Add future pull request information to criteria Update modified date added some description RFC Move to addons * Add weight decay optimizers * Remove conv2d_in_plane * Add group_norm * Accept addons RFC * Update alternatives since `DynamicPartition` and `DynamicStitch` do have GPU kernels. * Add a section for saving and restore `PartitionedVariable`. * Mention that variable types can be nested, attention needs to be paid to their saving and restoring mechanism. * Create README.md (tensorflow#57) * Splitted `_state_var` into `_state_var` and `_alg_var` (because of concerns from implementation), and changed status to "Accepted" * Updated timestamp * Moved the auto-selection of algorithm from `create_rng_state` to `Generator.__init__` * Update according to the discussion * Move performance heuristics in Distribution Strategy level. We will not expose knobs for users to control; * Emphasize that embedding support in v2 will all be via `Embedding` layer. Users can use `tf.compat.v1` to handle embedding by themselves; * Mention that default `partition_strategy` in v1 `embedding_lookup` is "mod", which will possibly break users's model when they update to TF 2.0; * We want to prioritize shuffling embedding after 2.0 release; * We have plans to serialize and deserialize `Embedding` layer and Distribution Strategies to allow loading a saved model to a different number of partitions. * Update relese binary build command for sig-io (tensorflow#58) This PR updates relese binary build command for sig-io Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add Bryan to SIG IO release team (tensorflow#59) * Change to accepted * Add link to TensorFlow IO R package * Updated link for the friction log. (tensorflow#64) * Switch DistStrat revised API examples to TensorFlow 2 style. (tensorflow#63) * RFC: Attention for Dense Networks on Keras (tensorflow#54) * Design review for "Attention for Dense Networks" * RFC: Stateful Containers with tf.Module (tensorflow#56) * Create 20190117-tf-module.md * Update 20190117-tf-module.md * Loosen return type for variable properties. * Use Dense consistently. Thanks brilee@ for spotting! * Remove convert_to_tensor from examples. This wasn't ever required and including it might cause confusion. h/t pluskid@ gehring@ and awav@ * Remove owned_* methods. * Document `_flatten` See tensorflow/tensorflow@5076adf6 for more context. * Fix typo in module name. Thanks k-w-w@! * Update 20190117-tf-module.md * RFC: New tf.print (tensorflow#14) * New tf.print proposal * Attempt to fix table of contents * Removed not-working TOC label * Minor updates to the doc. * Update tf.print to be accepted * Added design review notes * Marking doc as accepted * Update cond_v2 design doc (tensorflow#70) * Update to bring in line with implementation * Added the symbol map to the RFC. * Updated testing section of the Community site. * Removed the 100%, formatting tweaks. * Update CHARTER.md * Change contact email address I will leave my current company soon, so update my email. * Create README.md * Logos for SIGs * Update README.md * Update addons owners (tensorflow#85) Add Yan Facai as another project lead. * Created a FAQ for TF 2.0. (tensorflow#78) Adding 2.0 related FAQ to the Testing group. * Request and charter for SIG JVM (tensorflow#86) Chartering docs for SIG JVM * Update CODEOWNERS Add @karllessard, @sjamesr and @tzolov as code owners for sigs/jvm. * Update CODEOWNERS Add missing / * Update CODEOWNERS Add @dynamicwebpaige as owner for sigs/testing/ * Update RFC with current information (tensorflow#89) Make current to SIG Addons * RFC: TF on Demand Project (tensorflow#69) * Adding an RFC for TF on Demand Project. * modified one line in tf-on-demand md file. * Changing RFC status from PROPOSED to ACCEPTED. * RFC: SavedModel Save/Load in 2.x (tensorflow#34) * RFC for SavedModel Save/Load in 2.x * Minor edits and a discussion topic for load() with multiple MetaGraphs * Tweak to the "Imported representations of signatures" section * Update "Importing existing SavedModels" with the .signatures change * Update RFC and add review notes * Status -> accepted * Update CHARTER.md New leads. * Update 20180920-unify-rnn-interface.md (tensorflow#81) Typo fix. * Update yyyymmdd-rfc-template.md Adding "user benefit" section into the RFC template, to encourage articulating the benefit to users in a clear way. * Update while_v2 design doc (tensorflow#71) * Update while_v2 design doc, include link to implementation * Update TF 2.0 FAQ to link to TensorBoard TF 2.0 tutorial (tensorflow#94) * CLN: update sig addons logo png (tensorflow#99) * Add SIG Keras Add a reference link to Keras' governance repository for SIG Keras. * RFC: String Tensor Unification (tensorflow#91) * RFC: String Tensor Unification * Updated rfcs/20190411-string-unification.md Updated TFLite sections to address feedback from @jdduke. Marked as Accepted. * Start RFC for tensor buffers
Comment period is open until 2019-03-13
Objective
This document proposes a system to build optimized binaries in the cloud and deliver the artifacts to users of TensorFlow. With this system, we aim to achieve the following: