Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Update CI docker and suppress warnings #2333

Merged
merged 23 commits into from
Aug 28, 2023
Merged

[CI] Update CI docker and suppress warnings #2333

merged 23 commits into from
Aug 28, 2023

Conversation

junliume
Copy link
Collaborator

No description provided.

@junliume
Copy link
Collaborator Author

junliume commented Aug 20, 2023

@JehandadKhan @atamazov
We have seen many issues with this update and some of them look like:

[2023-08-20T06:24:00.398Z] /long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1034:49: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
[2023-08-20T06:24:00.398Z]       __hip_assert(false && "invalid tile size");
[2023-08-20T06:24:00.398Z]                                                 ^

another issue:

[2023-08-20T06:24:00.885Z] MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, HIP version 5.7.23302, MIOpen version 3.00.0.a9f822391
[2023-08-20T06:24:00.885Z] MIOpen(HIP): Info [PrintVersionImpl] COMgr v.2.5.0, USE_HIP_PCH: 1
[2023-08-20T06:24:00.885Z] /home/jenkins/workspace/MLLibs_MIOpen_ci_rocm57/src/include/miopen/hipoc_kernel.hpp:61:36: runtime error: constructor call on misaligned address 0x7fff9a5b698c for type 'const void *', which requires 8 byte alignment
[2023-08-20T06:24:00.885Z] 0x7fff9a5b698c: note: pointer points here

which might be sanitizer checks related. by making this change

static const int second_index = (sizeof(T) + padding + 7) & ~7;

is not helping unfortunately.

@atamazov
Copy link
Contributor

@junliume BTW I see that submodule fin updated from ebf9b355da09e10cb77e4b163a02e57be954095e to b2f3f4db3c3d7dd757e6d9e68719a780d8114dfa. Diff seems empty. Please revert this change.

@cderb
Copy link
Contributor

cderb commented Aug 21, 2023

I see that submodule fin updated from ebf9b355da09e10cb77e4b163a02e57be954095e to b2f3f4db3c3d7dd757e6d9e68719a780d8114dfa. Diff seems empty. Please revert this change.

This is ok @atamazov. The branch update returns Fin to the tip of fin develop after it was branched for PR #2295.

@cderb
Copy link
Contributor

cderb commented Aug 21, 2023

We have seen many issues with this update and some of them look like:

@junliume I've been addressing many of the rocm 5.7 related make check issues in an MIOpen-internal branch, I'll send you.

@atamazov
Copy link
Contributor

@junliume

@JehandadKhan @atamazov We have seen many issues with this update and some of them look like:

[2023-08-20T06:24:00.398Z] /long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1034:49: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
[2023-08-20T06:24:00.398Z]       __hip_assert(false && "invalid tile size");
[2023-08-20T06:24:00.398Z]                                                 ^

This seems like a a problem in the implementation of the precompiled HIP headers. I do not know what __hip_assert is, but I suspect that it should not expand to nothing when NDEBUG is defined. If I am right, then the simplest way to fix this in the HIP compiler/library is to define that macro like this state-of-the-art thing:

#ifndef NDEBUG
#define __hip_assert(ex) ... // normal definition of assert
#else
#define __hip_assert(ex) (void(0))
#endif

The code like (void(0)); should not produce any warning. Alternatively, hipRTC_header.h should observe NDEBUG at line 1034 and do not use __hip_assert when NDEBUG is defined.

Again, AFAICS this happens in the PCH, not in our code. I recommend informing the guys who are responsible for PCH, then silence the warning in MIOpen permanently (I do not think it's worth a full-featured "workaround" stuff) and forgetting about this.

@JehandadKhan
Copy link
Collaborator

@junliume

@JehandadKhan @atamazov We have seen many issues with this update and some of them look like:

[2023-08-20T06:24:00.398Z] /long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1034:49: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
[2023-08-20T06:24:00.398Z]       __hip_assert(false && "invalid tile size");
[2023-08-20T06:24:00.398Z]                                                 ^

This seems like a a problem in the implementation of the precompiled HIP headers. I do not know what __hip_assert is, but I suspect that it should not expand to nothing when NDEBUG is defined. If I am right, then the simplest way to fix this in the HIP compiler/library is to define that macro like this state-of-the-art thing:

#ifndef NDEBUG
#define __hip_assert(ex) ... // normal definition of assert
#else
#define __hip_assert(ex) (void(0))
#endif

The code like (void(0)); should not produce any warning. Alternatively, hipRTC_header.h should observe NDEBUG at line 1034 and do not use __hip_assert when NDEBUG is defined.

Again, AFAICS this happens in the PCH, not in our code. I recommend informing the guys who are responsible for PCH, then silence the warning in MIOpen permanently (I do not think it's worth a full-featured "workaround" stuff) and forgetting about this.

@atamazov I agree, this should be fixed in the runtime headers

@junliume
Copy link
Collaborator Author

@junliume BTW I see that submodule fin updated from ebf9b355da09e10cb77e4b163a02e57be954095e to b2f3f4db3c3d7dd757e6d9e68719a780d8114dfa. Diff seems empty. Please revert this change.

This is to keep track of FIN's develop branch. I see this happens several times:

  1. When updating MIOpen, FIN commit hash is updated to a "branch" commit hash,
  2. When that branch commit hash is committed to FIN, there will be no difference between its develop branch and the one used to commit in MIOpen.

I think we need a standard procedure to update FIN :) notably, Clang formatting will change one file in FIN and I wanted to make that change a while ago.

new(buffer) T(x); // NOLINT (clang-analyzer-cplusplus.PlacementNew)
new(buffer + second_index) U(y);
}
char buffer[second_index + sizeof(U)] = {};
alignas(U) char buffer[second_index + sizeof(U)] = {};
Copy link
Contributor

@atamazov atamazov Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may change the layout of kernel arguments in memory. Without alignas(U), the required buffer alignment is 1 (i.e., no alignment is required). Therefore, the alignment of KernelArgsPair instances is also 1, and several such instances reside in memory without any gaps.

With alignas(U), the padding required for KernelArgsPair is alginof(U), which may lead to gaps between instances of KernelArgsPair.

If you see kernel failures, then please revert this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atamazov I do see lots of failures on gfx906 with

handlehip.cpp:80: Failed getting available memory: invalid argument

However, it looks more likely a runtime issue since it cannot be reproduced on other newer ASICs. But I will revert this change and try again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junliume This is not related and I see the same on navi21.

Reverting this change won't resolve the issue with getting available memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junliume It seems like this change is indeed correct and should be kept.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atamazov can you verify it happens on Navi21?
For Vega nodes some have problems but some other nodes do not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junliume Of course, this happens on Navi21, as I've reported a while ago at #2307 (comment) where you can find the dirty hacks for this. I am working on more or less regular W/A which should be suitable for merging into develop.

@atamazov
Copy link
Contributor

@junliume I am working on this. Please expect update soon.

@atamazov
Copy link
Contributor

atamazov commented Aug 24, 2023

@junliume I think that https://github.com/atamazov/MIOpen/blob/ci_rocm57_ata1 is ready for merging here.

  • Fixed potential UB in KernelArgs detected by sanitizer.
  • WORKAROUND_FAULTY_HIPMEMGETINFO_VEGA_NAVI2X introduced.
    • If hipMemGetInfo fails, issues warning message and returns "fake" 16 GiB value. For now, this allows for the continued use of gfx906 and Nav21 nodes in development and testing.
    • W/A is GPU-agnostic because we do not know the GPU type in the place where the W/A is applied.
    • W/A may be considered as potentially dangerous because it always camouflages hipMemGetInfo failures. However, we're safe as long as hipMemGetInfo works fine (as it should on all officially supported GPUs, as the HIP runtime must be well tested before use with MIOpen).
  • [CI] Fix: Applied WORKAROUND_ISSUE_1148 to a 'Fp16 Hip All gfx1030' stage.
    • This is expected to prevent random failures on Navi21 nodes. My Navi21 system fails quite consistently (but randomly ;) without this W/A.
  • [CI] Quality: Removed unnecessary params.TARGET_NAVI21 from AnyGPU stages.
  • Merged recent changes from ci_rocm57, except b59403d
  • Merged recent changes from develop, conflicts resolved.

🌀 Testing

Tested on Navi21 and gfx906/60 (Radeon VII) systems. Note that my Navi21 system has the "hipMemGetInfo" issue, but my Radeon VII doesn't. Tried the followed tests on both systems:

  • Debug build with sanitizers, Smoke FP32 test - PASSED
  • Release build, Full FP16 test - PASSED

@atamazov
Copy link
Contributor

@junliume
Copy link
Collaborator Author

  • WORKAROUND_FAULTY_HIPMEMGETINFO_VEGA_NAVI2X

@atamazov I am not reproducing the hipGetMemoryInfo() error with 5.7 RC3 docker (at least so far), but indeed many CI nodes are not as stable now, and I have asked these base OS to be updated.

@atamazov
Copy link
Contributor

@junliume Are you going to merge https://github.com/atamazov/MIOpen/blob/ci_rocm57_ata1 into this branch? If yes, then I will double-check if hipMemGetInfo problem is resolved in RC3 and narrow (or remove) the W/A.

@junliume
Copy link
Collaborator Author

junliume commented Aug 25, 2023

@junliume Are you going to merge https://github.com/atamazov/MIOpen/blob/ci_rocm57_ata1 into this branch? If yes, then I will double-check if hipMemGetInfo problem is resolved in RC3 and narrow (or remove) the W/A.

I wish to and need to figure out how to sync from ci_rocm57_ata1 ...

Update: done.

@atamazov
Copy link
Contributor

@junliume hipMemGetInfo problem is NOT resolved in 5.7 RC3. This PR can be merged as is.

@atamazov
Copy link
Contributor

@junliume @JehandadKhan [Notice] Interesting detail:
Both RC1 and RC3 report HIP version 5.7.31921 in hip_version.h which is not good. However, CMake (hip-config-version.cmake) reports different numbers:

  • RC1: 5.7.23302
  • RC3: 5.7.23341

"Thanks" to some problems with hip_version.h in some old versions of ROCm we use CMake machinery as a workaround, and the library is able to distinguish between RC1 and RC3 (even if this is not necessary yet).

@junliume
Copy link
Collaborator Author

Ping @JehandadKhan and @atamazov for review after CI has passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants