HIP issue list as discussed in the offline meeting #2973

KaiSzuttor · 2019-07-05T10:01:20Z

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/134161 (connected to Lbgpu node vel #2878)
unreadable error message at runtime

KaiSzuttor · 2019-07-23T07:45:19Z

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/137077

mkuron · 2019-07-23T08:26:25Z

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/134161

That one was due to incorrect #pragma unroll use and caused reduced performance (but no crash) on CUDA too. Fixed by #2982.

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/137077

Please merge the latest master branch, that issue has been fixed since #2937.

mkuron · 2019-07-31T15:10:03Z

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/142467

That one was reproducible on an Nvidia 2080, but we don't have that in CI. So one more case where AMD actually helped find a bug that can affect Nvidia too.

jngrad · 2019-12-13T15:34:12Z

Memory access fault by GPU node-4 (Agent handle: 0x55be70acd450) on address 0x7f70b50ec000. Reason: Page not present or supervisor privilege.
- https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/191997
- https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/191285

jngrad · 2020-03-12T16:15:35Z

The ROCm library rocFFT broke on multiple occasions:

ROCm 2.9: rocFFT from ROCm 2.9.0 is broken docker#139, had to write a complex fix and wait for a patch upstream
ROCm 3.0: Update images and add Ubuntu 20.04 docker#157, url in /etc/apt/sources.list.d/rocm.list was wrong
ROCm 3.1: Support for ROCm 3.1.0 docker#156, we have to wait for a patch upstream

fweik · 2020-04-01T11:04:22Z

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/218922

mkuron · 2020-04-01T11:14:20Z

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/218922

That's an odd one. Why does that test even call into HIP code?

       Start   1: save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1
  1/149 Test   #1: save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 ..............***Exception: Child aborted  1.33 sec
Memory access fault by GPU node-5 (Agent handle: 0x563058748d20) on address 0x7f94057e0000. Reason: Page not present or supervisor privilege.

jngrad · 2020-04-01T17:41:46Z

It happened again today, dedicated ticket: #3620

fweik · 2020-04-01T18:49:11Z

@mkuron I think the gpu initialization code if always run, to detect the devices present and so on, but I haven't checked.

mkuron · 2020-04-02T08:25:31Z

ROCm 3.3 was released last night (not sure what happened to 3.2). It's installed for testing on lama. It's still broken in multiple ways:

ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/ required because the hipcc_cmake_linker_helper has not been fixed.
tests succeed, but hang during HSA::hsa_shut_down()

At least they fixed the cudaMemcpyToSymbol thing that broke the EK and LB tests. The shutdown thing is probably our own fault though; it is probably related to the order of destruction for static/global variables and library unloading.

jngrad · 2020-09-16T14:58:30Z

DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm #3895 took me 5 man-hours to troubleshoot, and probably as much for @mkuron; we ended up using lama as a runner in CI

mkuron · 2020-09-16T15:34:49Z

#3895

No idea about that one, it's either a hardware or driver issue, and a heisenbug too. We don't have anyone here who understands the Barnes-Hut code, so our debugging abilities are rather limited.

jngrad · 2020-09-23T21:38:03Z

Build system breaks on ROCm 3.8 #3909 broken AMD packages on lama, fixed by re-installing packages

jngrad · 2020-10-16T12:52:44Z

the lama backup broke down several times before, during and after the summer school (Sep 30, twice on Oct 6, Oct 12), putting us in a situation where CI could not pass and PRs could not get merged

mkuron · 2020-10-16T13:51:02Z

lama

The first two cases were a due to a broken SSD, the other one was due to a crashed graphics driver.

KaiSzuttor · 2020-10-19T06:55:51Z

as long as we do not have redundancy for testing rocm, we cannot use it in CI.

Closes #2973, closes #3895, follow-up to espressomd/docker#190 Due to their fast release cycle, ROCm packages are not stable enough for ESPResSo. We are currently supporting ROCm 3.0 to 3.7, which means supporting two compilers (HCC and HIP-Clang) and keeping patches for each ROCm release in our codebase. Maintaining these patches and the CMake build system for ROCm is time-consuming. The ROCm job also has a tendency to break the CI pipeline (#2973), sometimes due to random or irreproducible software bugs in ROCm, sometimes due to failing hardware in both the main CI runner and backup CI runner. The frequency of false positives in CI is too large compared to the number of true positives. The last true positives were 5da80a9 (April 2020) and #2973 (comment) (July 2019). There are no known users of ESPResSo on AMD GPUs according to the [May 2020 user survey](https://lists.nongnu.org/archive/html/espressomd-users/2020-05/msg00001.html). The core team has decided to drop support for ROCm ([ESPResSo meeting 2020-10-20](https://github.com/espressomd/espresso/wiki/Espresso-meeting-2020-10-20)). The Intel image cannot be built automatically in the espressomd/docker pipeline. Re-building it manually is a time-consuming procedure, usually several hours, due to long response time from the licensing server and the size of the Parallel Studio XE source code. When a new Intel compiler version is available, it usually requires the expertise of two people to update the dockerfile. The core team has decided to remove the Intel job from CI and use the Fedora job to test ESPResSo with MPICH.

jngrad mentioned this issue Oct 22, 2020

Remove ROCm integration #3966

Merged

kodiakhq bot closed this as completed in #3966 Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP issue list as discussed in the offline meeting #2973

HIP issue list as discussed in the offline meeting #2973

KaiSzuttor commented Jul 5, 2019 •

edited

Loading

KaiSzuttor commented Jul 23, 2019

mkuron commented Jul 23, 2019 •

edited

Loading

mkuron commented Jul 31, 2019

jngrad commented Dec 13, 2019

jngrad commented Mar 12, 2020

fweik commented Apr 1, 2020

mkuron commented Apr 1, 2020

jngrad commented Apr 1, 2020

fweik commented Apr 1, 2020

mkuron commented Apr 2, 2020 •

edited

Loading

jngrad commented Sep 16, 2020

mkuron commented Sep 16, 2020

jngrad commented Sep 23, 2020

jngrad commented Oct 16, 2020

mkuron commented Oct 16, 2020

KaiSzuttor commented Oct 19, 2020

HIP issue list as discussed in the offline meeting #2973

HIP issue list as discussed in the offline meeting #2973

Comments

KaiSzuttor commented Jul 5, 2019 • edited Loading

KaiSzuttor commented Jul 23, 2019

mkuron commented Jul 23, 2019 • edited Loading

mkuron commented Jul 31, 2019

jngrad commented Dec 13, 2019

jngrad commented Mar 12, 2020

fweik commented Apr 1, 2020

mkuron commented Apr 1, 2020

jngrad commented Apr 1, 2020

fweik commented Apr 1, 2020

mkuron commented Apr 2, 2020 • edited Loading

jngrad commented Sep 16, 2020

mkuron commented Sep 16, 2020

jngrad commented Sep 23, 2020

jngrad commented Oct 16, 2020

mkuron commented Oct 16, 2020

KaiSzuttor commented Oct 19, 2020

KaiSzuttor commented Jul 5, 2019 •

edited

Loading

mkuron commented Jul 23, 2019 •

edited

Loading

mkuron commented Apr 2, 2020 •

edited

Loading