Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add and fix patches for PyTorch 1.9.0 on POWER #15919

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jul 28, 2022

(created using eb --new-pr)

The patch introduced a C&P bug resulting in:

Vec256<float> C10_ALWAYS_INLINE pow(const Vec256<float>& exp) const {
    return {Sleef_powf4_u10vsx(_vec0, b._vec0), Sleef_powf4_u10vsx(_vec1, b._vec1)};
 }

I.e. b was used instead of exp.

Although in the referenced PR I left that function untouched it seems to be required as without it the test_binary_ufuncs fails but succeeds with it.
Added a commit to make the change correct.

Micket
Micket previously approved these changes Jul 28, 2022
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Micket Micket added the bug fix label Jul 28, 2022
@Micket Micket added this to the next release (4.6.1?) milestone Jul 28, 2022
@Flamefire
Copy link
Contributor Author

Build now works although 2 tests fail on PPC:

== building and installing PyTorch/1.9.0-foss-2020b...
== fetching files...
== ... (took 1 secs)
== creating build dir, resetting environment...
== unpacking...
== ... (took 3 secs)
== patching...
== ... (took 5 secs)
== preparing...
== ... (took 1 min 33 secs)
== configuring...
== ... (took 4 secs)
== building...
== ... (took 54 mins 31 secs)
== testing...
== ... (took 3 hours 1 min 22 secs)
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/s3248973-EasyBuild/PyTorch/1.9.0/foss-2020b): build failed (first 300 chars): 2 tests (out of 55803) failed:
* distributed/rpc/test_tensorpipe_agent
* test_binary_ufuncs (took 3 hours 57 mins 42 secs)
== Results of the build can be found in the log file(s) /tmp/easybuild-tmplog/easybuild-PyTorch-1.9.0-20220728.161426.vOeAA.log

@Flamefire Flamefire marked this pull request as draft July 29, 2022 12:14
Micket
Micket previously approved these changes Jul 29, 2022
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm (mainly because i don't know any better) (I'll try to find a machine to run build report from)

@Micket
Copy link
Contributor

Micket commented Jul 29, 2022

Oops, missed that it was marked draft!

@Flamefire Flamefire marked this pull request as ready for review July 29, 2022 13:59
@Flamefire
Copy link
Contributor Author

Done now. Just noticed a mistake and wanted to wait for the custom test first. Started a job testing this on PPC now.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml5 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/a839027a57069711c8a3ed1f61691fdb for a full test report.

@boegel boegel changed the title PyTorch 1.9.0: Fix PPC patch fix patch for PyTorch 1.9.0 on POWER Aug 3, 2022
@Flamefire Flamefire changed the title fix patch for PyTorch 1.9.0 on POWER Add and fix patches for PyTorch 1.9.0 on POWER Aug 5, 2022
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
taurusml2 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/11cef6c1050b7665d4124aaed082ee68 for a full test report.

@Flamefire
Copy link
Contributor Author

Ok, this works now, the failing EC is PyTorch-1.9.0-fosscuda-2020b-imkl.eb as IMKL isn't available for PPC, so that is ok.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 4 out of 4 (3 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/d0f4b4c8ce7f13c1410ea14a74ae7e07 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 10, 2022

One failing test for PyTorch/1.9.0-fosscuda-2020b + PyTorch-1.9.0-fosscuda-2020b-imkl.eb on our Intel Cascade Lake V100 system (test report coming up), perhaps this one is also flaky: test_autograd

======================================================================
FAIL: test_thread_shutdown (__main__.TestAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_autograd.py", line 4360, in test_thread_shutdown
    self.assertRegex(s, "PYTORCH_API_USAGE torch.autograd.thread_shutdown")
AssertionError: Regex didn't match: 'PYTORCH_API_USAGE torch.autograd.thread_shutdown' not found in 'PYTORCH_API_USAGE torch.python.import\nPYTORCH_API_USAGE c10d.python.import\nPYTORCH_API_USAGE tensor.create\n'

----------------------------------------------------------------------
Ran 951 tests in 23.361s

FAILED (failures=1, skipped=312, expected failures=1)
test_autograd failed!

@boegel
Copy link
Member

boegel commented Sep 11, 2022

Test report by @boegel
FAILED
Build succeeded for 2 out of 4 (3 easyconfigs in total)
node3307.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/3fe5074fc31e2fd614025199814ce7ab for a full test report.

@boegel boegel changed the title Add and fix patches for PyTorch 1.9.0 on POWER add and fix patches for PyTorch 1.9.0 on POWER Sep 11, 2022
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't let the failing test_autograd test block this PR, since it also occurs without the changes in this PR.

@Flamefire Is this something worth following up on?

@boegel
Copy link
Member

boegel commented Sep 11, 2022

Going in, thanks @Flamefire!

@Flamefire
Copy link
Contributor Author

Yes, looks worth to double check at least: #16233

@Flamefire Flamefire deleted the 20220728155542_new_pr_PyTorch190 branch September 11, 2022 08:49
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 2 out of 4 (3 easyconfigs in total)
taurusml27 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/5b6c26a1ef289988bf456f65e14d2ce3 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants