Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create $XDG_CACHE_HOME for PyTorch tests #2806

Merged
merged 2 commits into from
Nov 23, 2022

Conversation

Flamefire
Copy link
Contributor

The path must exist or PyTorch will show errors/warnings like:

UserWarning: Specified kernel cache directory could not be created! This disables kernel caching.

The path must exist or PyTorch will show errors/warnings like:

> UserWarning: Specified kernel cache directory could not be created! This disables kernel caching.
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.10.0-fosscuda-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusml22 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/c761c3d11bf2a0140a56aa2e933ccefd for a full test report.

@boegel boegel added the bug fix label Oct 18, 2022
@boegel boegel added this to the next release (4.6.2?) milestone Oct 18, 2022
@boegel
Copy link
Member

boegel commented Oct 19, 2022

Test report by @boegel

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/db3002446c8f4cdc98e99d5ba0d5a7e8 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 19, 2022

Test report by @boegel

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3539.doduo.os - Linux RHEL 8.4, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/9c5ce58f5e0f91440e4e211a565f5234 for a full test report.

@Flamefire
Copy link
Contributor Author

I'm not sure why the 2 ECs failed for you but I'm quite certain not due to the change here which should be correct by inspection (and I guess some document may tell us that $XDG_CACHE_HOME must exist, so this fixes a bug)

Especially as the test build on PPC passed I'd assume this is ok. ;-)

@boegel
Copy link
Member

boegel commented Oct 20, 2022

@Flamefire I agree with you, but I'm being cautious here: we're very close to the next EasyBuild release, and I don't want to merge a PR last-minute which breaks the installation of PyTorch.

I wouldn't expect that making sure that $XDG_CACHE_HOME exists causes trouble, but it does seem like the behavior is slightly different when $XDG_CACHE_HOME does exist (kernel caching is not disabled), so it doesn't seem impossible to me that this affects a handful of tests...

@akesandgren
Copy link
Contributor

akesandgren commented Oct 28, 2022

My testbuild of PyTorch-1.10.0-fosscuda-2020b.eb hangs on "python -s -c from multiprocessing.resource_tracker import main;main(26)" (for 11h then I killed it...)

(with this easyblock but I do not believe that is it related)
Will try again...

Same problem again without this change.

@Flamefire
Copy link
Contributor Author

Could this now be merged?

@branfosj
Copy link
Member

Test report by @branfosj

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.10.0-foss-2021a.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u36b.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/f31aeda337f3cec583eed3ac1525dd8d for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.9.0-fosscuda-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0212u17a.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), 1 x NVIDIA Tesla P100-PCIE-16GB, 470.57.02, Python 3.6.8
See https://gist.github.com/ed8b85d63e0b5a7b44b1075285fcf52b for a full test report.

Copy link
Member

@branfosj branfosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to go ahead and merge this.

The failures are in tests that we've seen issues with in various PyTorch versions, so I do not think they are sufficient reason to prevent merging this.

@branfosj
Copy link
Member

Going in, thanks @Flamefire!

@branfosj branfosj merged commit 4dc6537 into easybuilders:develop Nov 23, 2022
@Flamefire Flamefire deleted the pytorch_create_cache_dir branch November 23, 2022 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants