-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620
{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620
Conversation
…tches: TensorFlow-2.9.1_fix-protobuf-include-def.patch
Currently problem with nsync_cv.h not found from dependency during build. Might have to solve similarily to how this was solved for protobuf. |
@VRehnberg: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/3444607748
bleep, bloop, I'm just a bot (boegelbot v20200716.01) |
I've seen this error before and I don't remember the fix |
My current plan is to create a genrule that symlinks needed protobuf
includes from PROTOBUF_INCLUDE_PATH as dependency for those rules where
similar errors occurs.
I expect it to be slow going.
Den ons 16 nov. 2022 18:24Alexandre Strube ***@***.***> skrev:
… In file included from tensorflow/core/platform/protobuf.cc:16:
./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
28 | #include "google/protobuf/io/coded_stream.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
I've seen this error before and I don't remember the fix
—
Reply to this email directly, view it on GitHub
<#16620 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIQ4G75PUN3OCJBAR6M7543WIUKDXANCNFSM6AAAAAAR5QBJ5Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Adding + "coded_stream": ("google/protobuf/io/coded_stream.h", []), right after the port_def line fixes this |
adding the aforementioned line goes further but then I get this:
|
One possible reason for the latest error you write about is that the includes in coded_stream.h needs to be in the list of imports that is the second value in the include tuples. I've seen the following include errors so far
|
What I don't understand is about this PR tensorflow/tensorflow#44144 that seem to have removed many of those include files that we now want to link to. |
At that time it should have been sufficient to rebuild with empty disk_cache, but either something has changed or that cache is very persistent. I've been building on different nodes and different module trees with similar symptoms. |
This issue tensorflow/tensorflow#37861 describes our issue quite well. However, while the symptom is the same it doesn't seem to be the same cause. CPATH is specified correctly in the builds, but inlcude files are not found anyway. The command that is run is
and uses(?) nvcc. Perhaps, nvcc doesn't honor the CPATH? I've got no idea. But it seems that going back to symlinking files is probably the wrong approach and should perhaps be avoided. |
I couldn't care less for protobuf version, and I would be happy to have a second protobuf in the system only for tensorflow. They mention protobuf 3.6.1 in their ci build
As it also depends on protobuf-python, I created both from 3.19.4, just removing the checksums and changing versions. protobuf-python 3.6.1 needs a little patch which I made here: python3 patch for protobuf-python 3.6.0 But then I get this:
, which is correct - protobuf 3.6.1 doesn't have this So, I tried now with 3.8.0 to see if I see any improvement. Which made me arrive to... The same place. ./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
28 | #include "google/protobuf/io/coded_stream.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated. I don't know what to do further. |
google/protobuf/io/coded_stream.h exists at $PROTOBUF_INCLUDE_PATH. As such either CPATH=$PROTOBUF_INCLUDE_PATH:$CPATH should be specified or used by the build tool. From logs (building with With Bazel I learnt that --action_env command line variable should be used to specify this. And at least on my system it is and CPATH is used for some actions. However, not for all. Right now I'm trying to find the first commit that introduced the error and then I should know which rule to patch at least. This is how it has been done in the past tensorflow/tensorflow#43019 Other avenues are:
|
I started getting back into TensorFlow ECs now and got 2.7.1 working on all platforms. Then I went to the next version we have which is 2.8.4 and it doesn't have a CUDA EC, so I added it and am testing it now: I run into the same issue you see here:
So the issue appeared somewhere between 2.7.1 and 2.8.4 and only for the CUDA version, the non-CUDA version compiles fine for me. As for Previously I was able to "solve" the issue by replacing " |
Happy new year! Just as an update on this: On the last (work) day of the last year I was able to figure out why the includes were missing. I summarized that in an issue against Bazel pointing out the commit I found via I'm also currently working on a fix which will solve the issue here and in other TF ECs by changing the EasyBlock. So I'd ask for a bit of patience while I'm testing this. |
This patch doesn't actually solve the problem and is the wrong solution anyway.
Happy new year and thanks for your effort! I started bisecting myself but I kept running into other bugs for commits between releases and feedback was so slow that I got tired of it and haven't touched it in a few weeks. I wouldn't mind taking a look at something untested and could run tests on my end as well. |
@VRehnberg: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/3836480475
bleep, bloop, I'm just a bot (boegelbot v20200716.01) |
I know your pain. It helped that I knew the bug was already in 2.8.4 but not in 2.7.1 and I had 2 powerful machines (unintentional pun: PPC-machines) for compiling and I was also using ccache with same hacked ad-hoc config.
Let me finish my changes to 2.8.4 first (which lacks the CUDA version and that runs into the same issue) and then we can tackle this one once the underlying issue is fixed. |
Closing in favor of #17092 |
(created using
eb --new-pr
)