Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hip mfma tests #246

Open
wants to merge 92 commits into
base: develop
Choose a base branch
from
Open

hip mfma tests #246

wants to merge 92 commits into from

Conversation

CRobeck
Copy link
Member

@CRobeck CRobeck commented May 20, 2022

This PR adds basic functionality test of leveraging the matrix cores on AMD gfx908 and gfx90a hardware for dense matrix products.

CRobeck and others added 30 commits April 26, 2022 12:07
@MrBurmark
Copy link
Member

This is looking much better.
The main thing to do now is to convert it to run in parallel on the gpu. I think its fine if what each thread does and the block size is different between the different tunings, as long as they're still similar enough to think of as different tunings of the same algorithm.

constexpr Index_type Ne = m_Ne;
constexpr Index_type NeNe = m_Ne * m_Ne;

dim3 gridDim (1, 1, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right?

Copy link
Member Author

@CRobeck CRobeck Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mfma instructions operate on a per-wavefront basis, as opposed to per thread. We're using 4 groups of 16 threads for each outer product, so we only need single block per grid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that once we saturate flops for a CU we'll be leaving flops on the table.

hipDeviceProp_t devProp;
hipGetDeviceProperties(&devProp, 0);
std::string gcnArchName(devProp.gcnArchName);
std::string hipArch = gcnArchName.substr(0, 6);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 0, 6 the right thing for all architectures, aren't there 7 digit gpu names like gfx10##?

Copy link
Member Author

@CRobeck CRobeck Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Currently we're only using it to test gfx908 and gfx90a features but if we want to use this function more generally (e.g. for testing xnack-ness) I suppose we should really grab the entire string and then sub select it based on what we're going to query. The full name example would be like:
gfx908:sramecc-:xnack-
gfx1010:sramecc-:xnack-

Comment on lines +39 to +44
#define MAT_FUSED_MUL_ADD_BODY \
Real_type dot = 0; \
for (Index_type k = 0; k < Ne; ++k) { \
dot += A[row*Ne + k + ii*(Ne*Ne)] * B[k*Ne + col + ii*(Ne*Ne)]; \
} \
D[row*Ne + col + ii*(Ne*Ne)] = dot; \
Copy link
Member

@MrBurmark MrBurmark Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be doing D = A*B or D = A*B+C?

Copy link
Member Author

@CRobeck CRobeck Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mfma instructions are for computing D = A x B + C but we're assuming for this case C is zeros and ignored. It's defined this way so we can expand future cases with a non-trivial C matrix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the kernel is confusing then if we're not actually doing the ADD part.


startTimer();
for (Index_type irep = 0; irep < run_reps; ++irep) {
for(Index_type ii = 0; ii != (N/(Ne*Ne)); ++ii){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is N/(Ne*Ne) the number of elements, should we make it a named quantity?

@@ -122,7 +122,7 @@ void DEL_DOT_VEC_2D::runHipVariantImpl(VariantID vid)

const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size);

hipLaunchKernelGGL((lambda_hip_forall<block_size, decltype(deldotvec2d_lambda)>),
hipLaunchKernelGGL((lambda_hip_forall_1D<block_size, decltype(deldotvec2d_lambda)>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this update be in another PR to keep this focused on the Kernel? I try to keep cuda and hip in sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert it and/or move it to a new branch, it got pulled in from one of your review comments and I foresee adding a lambda_hip_forall using 2D thread indexing and wanted to get in front of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not rename it here.

Comment on lines +38 to +39
const Index_type N_Elem = N/(Ne*Ne);
for(Index_type ii = 0; ii != N_Elem; ++ii){
Copy link
Member

@MrBurmark MrBurmark Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to parallelize over elements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is to mirror an FE mass matrix solve we need to assume that each element is independent and shares no common data (they could but this is the worst case).

@rhornung67 rhornung67 mentioned this pull request Jul 10, 2023
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants