{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620

VRehnberg · 2022-11-11T11:17:59Z

(created using eb --new-pr)

…tches: TensorFlow-2.9.1_fix-protobuf-include-def.patch

VRehnberg · 2022-11-11T11:20:43Z

Currently problem with nsync_cv.h not found from dependency during build. Might have to solve similarily to how this was solved for protobuf.

boegelbot · 2022-11-11T12:24:19Z

@VRehnberg: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/3444607748
Output from first failing test suite run:

FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 953, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches of extension astor in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension dill in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension tblib in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension TensorFlow in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 10 patches vs 10 checksums
Checksums missing for one or more sources/patches of extension tensorflow-datasets in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums

======================================================================
FAIL: test_style_conformance (test.easyconfigs.styletests.StyleTest)
Check the easyconfigs for style
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/styletests.py", line 68, in test_style_conformance
    self.assertEqual(result, 0, error_msg)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/base/testing.py", line 116, in assertEqual
    raise AssertionError("%s:\nDIFF%s:\n%s" % (msg, limit, ''.join(diff[:self.ASSERT_MAX_DIFF])))
AssertionError: There shouldn't be any code style errors (and/or warnings), found 2:
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:175:121: E501 line too long (127 > 120 characters)
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:196:121: E501 line too long (143 > 120 characters)

: 2 != 0:
DIFF:
- 2

----------------------------------------------------------------------
Ran 15691 tests in 678.346s

FAILED (failures=2)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

surak · 2022-11-16T17:24:01Z

In file included from tensorflow/core/platform/protobuf.cc:16:
./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
   28 | #include "google/protobuf/io/coded_stream.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

I've seen this error before and I don't remember the fix

VRehnberg · 2022-11-16T17:41:38Z

My current plan is to create a genrule that symlinks needed protobuf includes from PROTOBUF_INCLUDE_PATH as dependency for those rules where similar errors occurs. I expect it to be slow going. Den ons 16 nov. 2022 18:24Alexandre Strube ***@***.***> skrev:

…

In file included from tensorflow/core/platform/protobuf.cc:16: ./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory 28 | #include "google/protobuf/io/coded_stream.h" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. I've seen this error before and I don't remember the fix — Reply to this email directly, view it on GitHub <#16620 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIQ4G75PUN3OCJBAR6M7543WIUKDXANCNFSM6AAAAAAR5QBJ5Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

surak · 2022-11-16T18:26:43Z

Adding

+    "coded_stream": ("google/protobuf/io/coded_stream.h", []),

right after the port_def line fixes this

surak · 2022-11-16T18:51:38Z

adding the aforementioned line goes further but then I get this:

ERROR: /easybuild/2020/build/TensorFlow/2.9.1/foss-2022a-CUDA-11.7.0/TensorFlow/tensorflow-2.9.1/tensorflow/core/BUILD:1640:16: Compiling tensorflow/core/util/work_sharder.cc failed: undeclared inclusion(s) in rule '//tensorflow/core:framework_internal_impl':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/util/work_sharder.cc':
  'bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/io/coded_stream.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

VRehnberg · 2022-11-17T12:06:35Z

One possible reason for the latest error you write about is that the includes in coded_stream.h needs to be in the list of imports that is the second value in the include tuples.

I've seen the following include errors so far

<google/protobuf/port_def.inc>
"google/protobuf/io/coded_stream.h"
"nsync_cv.h" // NOLINT
"double-conversion/double-conversion.h"
"google/protobuf/duration.pb.h"
adding duration.pb.h to the WELL_KNOWN_PROTOS dict gets a similar error as coded_stream.h, at least if the import list is empty. I haven't had time to test my hypothesis yet.

VRehnberg · 2022-11-17T14:21:02Z

What I don't understand is about this PR tensorflow/tensorflow#44144 that seem to have removed many of those include files that we now want to link to.

VRehnberg · 2022-11-17T14:27:30Z

At that time it should have been sufficient to rebuild with empty disk_cache, but either something has changed or that cache is very persistent. I've been building on different nodes and different module trees with similar symptoms.

VRehnberg · 2022-11-18T13:41:06Z

This issue tensorflow/tensorflow#37861 describes our issue quite well. However, while the symptom is the same it doesn't seem to be the same cause. CPATH is specified correctly in the builds, but inlcude files are not found anyway.

The command that is run is
external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc
which is a

Crosstool wrapper for compiling CUDA programs.

and uses(?) nvcc.

Perhaps, nvcc doesn't honor the CPATH? I've got no idea. But it seems that going back to symlinking files is probably the wrong approach and should perhaps be avoided.

surak · 2022-11-22T16:47:21Z

I couldn't care less for protobuf version, and I would be happy to have a second protobuf in the system only for tensorflow.

They mention protobuf 3.6.1 in their ci build

I tried that.

As it also depends on protobuf-python, I created both from 3.19.4, just removing the checksums and changing versions. protobuf-python 3.6.1 needs a little patch which I made here: python3 patch for protobuf-python 3.6.0

But then I get this:

tensorflow/core/grappler/mutable_graph_view.cc:345:18: error: const class google::protobuf::MessageLite has no member named ShortDebugString

, which is correct - protobuf 3.6.1 doesn't have this ShortDebugString protobuf 3.6 source (first version which has it is 3.8.0)

So, I tried now with 3.8.0 to see if I see any improvement. Which made me arrive to...

The same place.

./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
   28 | #include "google/protobuf/io/coded_stream.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

I don't know what to do further.

VRehnberg · 2022-11-23T08:32:41Z

./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
   28 | #include "google/protobuf/io/coded_stream.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

google/protobuf/io/coded_stream.h exists at $PROTOBUF_INCLUDE_PATH. As such either CPATH=$PROTOBUF_INCLUDE_PATH:$CPATH should be specified or used by the build tool. From logs (building with eb --debug --trace) I can see that the correct CPATH is used for some SUBCOMMANDs in some actions

With Bazel I learnt that --action_env command line variable should be used to specify this. And at least on my system it is and CPATH is used for some actions. However, not for all. Right now I'm trying to find the first commit that introduced the error and then I should know which rule to patch at least.

This is how it has been done in the past tensorflow/tensorflow#43019

Other avenues are:

Figure out what goes wrong by understanding Bazel and TensorFlow such that you can tell how to fix it by looking at the log (I gave up on this in favor of brute force)
Create an issue with TensorFlow and hope that they know how to fix it (should probably create an issue sooner or later anyway)

Flamefire · 2022-12-21T08:50:08Z

With Bazel I learnt that --action_env command line variable should be used to specify this. And at least on my system it is and CPATH is used for some actions. However, not for all. Right now I'm trying to find the first commit that introduced the error and then I should know which rule to patch at least.

This is how it has been done in the past tensorflow/tensorflow#43019

I started getting back into TensorFlow ECs now and got 2.7.1 working on all platforms. Then I went to the next version we have which is 2.8.4 and it doesn't have a CUDA EC, so I added it and am testing it now:

I run into the same issue you see here: CPATH action_env is not passed to some commands resulting in e.g.

In file included from bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/compiler/mlir/tools/kernel_gen/compile_cache_item.pb.cc:4:
bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/compiler/mlir/tools/kernel_gen/compile_cache_item.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory
   10 | #include <google/protobuf/port_def.inc>

So the issue appeared somewhere between 2.7.1 and 2.8.4 and only for the CUDA version, the non-CUDA version compiles fine for me.

As for crosstool_wrapper_driver_is_not_gcc: This is a wrapper on the compiler which can be either GCC or NVCC depending on what the build process (i.e. Bazel) passes to it. Not much going on there.

Previously I was able to "solve" the issue by replacing "exec_tools=" by "tools=" in the build files but that doesn't seem to work now. I'm now trying to find the commit introducing the issue in TF

Flamefire · 2023-01-04T08:30:00Z

Happy new year! Just as an update on this: On the last (work) day of the last year I was able to figure out why the includes were missing. I summarized that in an issue against Bazel pointing out the commit I found via git bisect: tensorflow/tensorflow@07cbc7b

I'm also currently working on a fix which will solve the issue here and in other TF ECs by changing the EasyBlock. So I'd ask for a bit of patience while I'm testing this.

This patch doesn't actually solve the problem and is the wrong solution anyway.

VRehnberg · 2023-01-04T08:48:41Z

I'm also currently working on a fix which will solve the issue here and in other TF ECs by changing the EasyBlock. So I'd ask for a bit of patience while I'm testing this.

Happy new year and thanks for your effort! I started bisecting myself but I kept running into other bugs for commits between releases and feedback was so slow that I got tired of it and haven't touched it in a few weeks.

I wouldn't mind taking a look at something untested and could run tests on my end as well.

boegelbot · 2023-01-04T09:23:42Z

@VRehnberg: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/3836480475
Output from first failing test suite run:

FAIL: test__parse_easyconfig_TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb (test.easyconfigs.easyconfigs.EasyConfigTest)
Test for easyconfig TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 1515, in innertest
    template_easyconfig_test(self, spec_path)
  File "test/easyconfigs/easyconfigs.py", line 1409, in template_easyconfig_test
    self.assertTrue(os.path.isfile(ext_patch_full), msg)
AssertionError: Patch file /home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1_fix-protobuf-include-def.patch is available for TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb

======================================================================
FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 960, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches of extension astor in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension dill in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension tblib in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension TensorFlow in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 10 patches vs 10 checksums
Checksums missing for one or more sources/patches of extension tensorflow-datasets in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums

======================================================================
FAIL: test_style_conformance (test.easyconfigs.styletests.StyleTest)
Check the easyconfigs for style
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/styletests.py", line 68, in test_style_conformance
    self.assertEqual(result, 0, error_msg)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/base/testing.py", line 116, in assertEqual
    raise AssertionError("%s:\nDIFF%s:\n%s" % (msg, limit, ''.join(diff[:self.ASSERT_MAX_DIFF])))
AssertionError: There shouldn't be any code style errors (and/or warnings), found 2:
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:175:121: E501 line too long (127 > 120 characters)
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:196:121: E501 line too long (143 > 120 characters)

: 2 != 0:
DIFF:
- 2

----------------------------------------------------------------------
Ran 16129 tests in 658.385s

FAILED (failures=3)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

Flamefire · 2023-01-04T11:37:54Z

Happy new year and thanks for your effort! I started bisecting myself but I kept running into other bugs for commits between releases and feedback was so slow that I got tired of it and haven't touched it in a few weeks.

I know your pain. It helped that I knew the bug was already in 2.8.4 but not in 2.7.1 and I had 2 powerful machines (unintentional pun: PPC-machines) for compiling and I was also using ccache with same hacked ad-hoc config.

I wouldn't mind taking a look at something untested and could run tests on my end as well.

Let me finish my changes to 2.8.4 first (which lacks the CUDA version and that runs into the same issue) and then we can tackle this one once the underlying issue is fixed.

VRehnberg · 2023-02-15T15:18:54Z

Closing in favor of #17092

adding easyconfigs: TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb and pa…

afef7a5

…tches: TensorFlow-2.9.1_fix-protobuf-include-def.patch

VRehnberg marked this pull request as draft November 11, 2022 11:19

VRehnberg mentioned this pull request Nov 11, 2022

{lib}[foss/2022a] Tensorflow v2.9.1 #16008

Closed

smoors mentioned this pull request Dec 19, 2022

{tools}[foss/2021b] jax v0.3.23 w/ Python 3.9.6 + CUDA 11.4.1 #16871

Merged

Delete TensorFlow-2.9.1_fix-protobuf-include-def.patch

dbaea00

This patch doesn't actually solve the problem and is the wrong solution anyway.

boegel added the update label Jan 24, 2023

boegel added this to the 4.x milestone Jan 24, 2023

VRehnberg closed this Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620

{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620

VRehnberg commented Nov 11, 2022

VRehnberg commented Nov 11, 2022

boegelbot commented Nov 11, 2022

surak commented Nov 16, 2022

VRehnberg commented Nov 16, 2022 via email

surak commented Nov 16, 2022

surak commented Nov 16, 2022

VRehnberg commented Nov 17, 2022

VRehnberg commented Nov 17, 2022

VRehnberg commented Nov 17, 2022

VRehnberg commented Nov 18, 2022 •

edited

Loading

surak commented Nov 22, 2022

VRehnberg commented Nov 23, 2022

Flamefire commented Dec 21, 2022 •

edited

Loading

Flamefire commented Jan 4, 2023

VRehnberg commented Jan 4, 2023

boegelbot commented Jan 4, 2023

Flamefire commented Jan 4, 2023

VRehnberg commented Feb 15, 2023

{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620

{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620

Conversation

VRehnberg commented Nov 11, 2022

VRehnberg commented Nov 11, 2022

boegelbot commented Nov 11, 2022

surak commented Nov 16, 2022

VRehnberg commented Nov 16, 2022 via email

surak commented Nov 16, 2022

surak commented Nov 16, 2022

VRehnberg commented Nov 17, 2022

VRehnberg commented Nov 17, 2022

VRehnberg commented Nov 17, 2022

VRehnberg commented Nov 18, 2022 • edited Loading

surak commented Nov 22, 2022

VRehnberg commented Nov 23, 2022

Flamefire commented Dec 21, 2022 • edited Loading

Flamefire commented Jan 4, 2023

VRehnberg commented Jan 4, 2023

boegelbot commented Jan 4, 2023

Flamefire commented Jan 4, 2023

VRehnberg commented Feb 15, 2023

VRehnberg commented Nov 18, 2022 •

edited

Loading

Flamefire commented Dec 21, 2022 •

edited

Loading