Heyi fused grad accumulation #138

eliotwang · 2024-09-26T01:18:11Z

hipblaslt implementation in fused_weight_gradient_dense

…umulation.md

fix typo

wenchenvincent · 2024-09-26T04:46:56Z

csrc/megatron/fused_weight_gradient_dense_16bit_prec_cuda.cu

+    //cublasHandle_t handle = at::cuda::getCurrentCUDABlasHandle();
+    //if(g_hipblas_handle == nullptr)
+    //   CHECK_HIPBLAS_ERROR(hipblasCreate(&g_hipblas_handle));
+    hipblasLtHandle_t handle = at::cuda::getCurrentCUDABlasHandle();


Since this is hipblasLt, I think we need to use getCurrentCUDABlasLtHandle?

correct. cuda handles can be used interchangeably, but hipblas and hipblaslt handles cannot.

wenchenvincent · 2024-09-26T05:15:15Z

csrc/megatron/fused_weight_gradient_dense_cuda.cu

+    hipblaslt_ext::Gemm gemm(
+        handle, transa, transb, HIP_R_16BF, HIP_R_16BF, HIP_R_32F, HIP_R_32F, HIPBLAS_COMPUTE_32F);
+
+    hipblaslt_ext::GemmEpilogue


This is using hipblasLtExt API. We haven't used that a lot. We mostly use hipblasLt API. I will need to check with hipblaslt team regarding the difference (in performance implication). And as hipblasLt API is more commonly used, I expect it would be more stable and have fewer bugs.

…rge to utils_test

eliotwang · 2024-09-26T12:55:59Z

update summary:

replace hipblasltext API with hipblaslt API
create and destroy handle per call
add test_weight_grad.py to tests/L0/run_transformer/ following utils test rules

tests/L0/run_transformer/test_weight_grad.py

csrc/megatron/fused_weight_gradient_dense_cuda.cu

wenchenvincent · 2024-09-26T19:00:27Z

csrc/megatron/fused_weight_gradient_dense_cuda.cu

+    hipblasLtMatmulDesc_t matmul;
+    CHECK_HIPBLASLT_ERROR(hipblasLtMatmulDescCreate(&matmul, HIPBLAS_COMPUTE_32F, HIP_R_32F));
+    CHECK_HIPBLASLT_ERROR(hipblasLtMatmulDescSetAttribute(
+        matmul, HIPBLASLT_MATMUL_DESC_TRANSA, &transa, sizeof(int32_t)));


Not sure if transa is int32_t or not. But I guess we can just use sizeof(transa).

wenchenvincent · 2024-09-26T19:01:52Z

csrc/megatron/fused_weight_gradient_dense_cuda.cu

+    const int                        request_solutions = 1;
+    hipblasLtMatmulHeuristicResult_t heuristicResult[request_solutions];
+    int                              returnedAlgoCount = 0;
+    CHECK_HIPBLASLT_ERROR(hipblasLtMatmulAlgoGetHeuristic(handle,


@jeffdaily Do we do any autotuning on the hipblasLt path? I am wondering if we need to do autotuning here.

tests/L0/run_transformer/test_weight_grad.py

wenchenvincent · 2024-09-26T19:17:36Z

tests/L0/run_transformer/test_weight_grad.py

+import math, pdb
+from torch.testing._internal import common_utils
+
+torch.backends.cuda.matmul.allow_tf32 = False


@jeffdaily Do we have this default to be False on rocm?

eliotwang · 2024-09-27T13:12:19Z

update summary:

use getCurrentCUDABlasLtHandle to get hipblaslt handle;
Update test files to accommodate to the unittest infrastructure and style used by apex; https://github.com/eliotwang/apex/blob/heyi_fused_grad_accumulation/tests/L0/run_transformer/test_weight_grad.py
run test with: python tests/L0/run_test.py --include run_transformer
Replace custom cosin_similarity with torch.all_close, set tolerance threshold for each test case
Set TENSILE_DB=0x8000 to tell different func call between hipblas and hipblaslt;
hipblas:
Cijk_Ailk_Bjlk_BBS_BH_MT128x64x16_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB512_LPA0_LPB16_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA2_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT8_16_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW1_VSn1_VW4_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGMn16
hipblaslt:
Cijk_Ailk_Bjlk_BBS_BH_Bias_AS_SAV_UserArgs_MT64x32x32_MI16x16x1_SN_LDSB0_AFC1_AFEM1_AFEM1_ASEM1_CLR1_CADS0_EPS0_GRVWA8_GRVWB4_GSUAMB_ISA942_IU1_K1_LBSPPA512_LBSPPB256_LBSPPM0_LPA32_LPB16_LPM0_LRVW4_LWPMn1_MIAV0_MIWT2_1_MO40_NTn1_NTA0_NTB0_NTC0_NTD0_NTM0_NEPBS16_NLCA1_NLCB1_ONLL1_PGR2_PLR1_PKA1_SIA3_SS1_SPO0_SRVW0_SSO0_SVW2_TLDS0_USFGROn1_VSn1_VWA2_VWB1_WSGRA0_WSGRB0_WS64_WG32_8_1
Remove dc_tensor, use d_weight as both the input and output;
Remove batch_count>1 case;
Replace sizeof(int32_t) sizeof(transa);
Rename vars as gradient of output and gradient of weight to avoid confusion;

wenchenvincent · 2024-10-07T15:35:14Z

tests/L0/run_transformer/test_weight_grad.py

-            else:
-                print("========FAIL======")
-
+            grad_weight = grad_weight.view(-1)


We don't need to reshape the tensor now as torch.allclose can take 2d tensors.

wenchenvincent · 2024-10-07T15:39:18Z

@pruthvistony @jithunnair-amd Do we have any CI setup for Apex?

pragupta · 2024-10-16T19:41:36Z

@pruthvistony @jithunnair-amd Do we have any CI setup for Apex?

@wenchenvincent -- there's no CI for apex.

pragupta · 2024-10-16T20:17:03Z

Regarding the tolerance for UTs, I looked at matmul UTs in PyTorch and it seems there are some toleranceOverride decorators which define the tolerance level for various operators and dtypes. Pleasea see lines like these in this file: https://github.com/pytorch/pytorch/blob/68272ab5967f448ed6d2986039a0ef0ddf0e1b37/test/test_matmul_cuda.py#L119

eliotwang and others added 3 commits September 26, 2024 08:54

update megatron fused weight gradient, add test script

b391686

update megatron fused weight gradient, add README_heyi_fused_grad_acc…

7e58f5a

…umulation.md

Update README_heyi_fused_grad_accumulation.md

f686c6b

fix typo

pruthvistony requested review from pruthvistony, jeffdaily, pragupta, jithunnair-amd and ramcherukuri September 26, 2024 02:34