-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm #3895
Comments
Currently affecting the following PRs: #3891 #3896 These tests fail randomly since yesterday:
The failures are reproducible with the docker image on |
Reproducible as far back as 79b53f6. Before that, we used a different docker image and a different compiler (HCC). |
Hmm then why did it just start coming up in CI this week? @psci2195 any ideas what might be causing it? I don‘t know enough about the algorithm to understand your code. |
coyote10 was just rebooted and the issue appears resolved. Looks like a hardware or driver glitch. |
Rebooting the runner seems to have fixed the issue. Before the reboot:
After the reboot: all good |
Rebooting the runner only made the issue irreproducible in docker in a ssh terminal. It's still failing in CI pipelines. The 4.1.4 release might be affected too. In docker in a ssh terminal, there is a random MPI deadlock with |
We're temporarily running the ROCm jobs on |
We exchanged one Vega 56 from
We will leave the GPUs in this configuration for now. If the issue resurfaces on the Vega 56 but not on the Vega 58 in |
Failing again on |
Closes #2973, closes #3895, follow-up to espressomd/docker#190 Due to their fast release cycle, ROCm packages are not stable enough for ESPResSo. We are currently supporting ROCm 3.0 to 3.7, which means supporting two compilers (HCC and HIP-Clang) and keeping patches for each ROCm release in our codebase. Maintaining these patches and the CMake build system for ROCm is time-consuming. The ROCm job also has a tendency to break the CI pipeline (#2973), sometimes due to random or irreproducible software bugs in ROCm, sometimes due to failing hardware in both the main CI runner and backup CI runner. The frequency of false positives in CI is too large compared to the number of true positives. The last true positives were 5da80a9 (April 2020) and #2973 (comment) (July 2019). There are no known users of ESPResSo on AMD GPUs according to the [May 2020 user survey](https://lists.nongnu.org/archive/html/espressomd-users/2020-05/msg00001.html). The core team has decided to drop support for ROCm ([ESPResSo meeting 2020-10-20](https://github.com/espressomd/espresso/wiki/Espresso-meeting-2020-10-20)). The Intel image cannot be built automatically in the espressomd/docker pipeline. Re-building it manually is a time-consuming procedure, usually several hours, due to long response time from the licensing server and the size of the Parallel Studio XE source code. When a new Intel compiler version is available, it usually requires the expertise of two people to update the dockerfile. The core team has decided to remove the Intel job from CI and use the Fedora job to test ESPResSo with MPICH.
https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/pipelines/13105
The text was updated successfully, but these errors were encountered: