Use rocprofv2 instead of rocprof. #1672

pcf000 · 2024-10-08T01:48:10Z

Use rocprofv2 instead of rocprof.
Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it.
Account for nrepeats in a smarter way -- count the rows, while verifying.
getFusionTestInfo and runFusionKernel turn out to be mostly the same.
Invent --rocprof-version to switch between rocprof and rocprofv2.
Change default to rocprofv2.

Abstract the boilerplate for collecting results from a process. Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it. Account for nrepeats in a smarter way -- count the rows, while verifying. Don't do attention perfRunner.py on gfx110x. Don't run the CK benchmarking for gfx110x, because ck-benchmark-driver won't compile. getFusionTestInfo and runFusionKernel turn out to be mostly the same. Invent --rocprof-version to switch between rocprof and rocprofv2. Change default to rocprofv2.

pcf000 · 2024-10-08T01:50:03Z

mlir/utils/performance/perfRunner.py

-MIOPENDRIVER = '/opt/rocm/bin/MIOpenDriver'
-BENCHMARKING_RESULT_FILE_NAME = 'results.stats.csv'
-BENCHMARKING_METRICS_FILE_NAME = 'results.csv'
+BENCHMARKINGV1_RESULT_FILE_NAME = 'results.stats.csv'


Of course things move around for rocprofv2, and there's no way I found to make them the same. In particular, that "pmc_1" directory inserts itself because of either --kernel-trace or the stats in -i, I forget which.

pcf000 · 2024-10-08T01:50:50Z

mlir/utils/performance/perfRunner.py

@@ -129,6 +139,9 @@ def create_paths(config_file_path, mlir_build_dir_path) -> Paths:

 # utility functions.
 def getNanoSeconds(fileName):
+    pass


I'm going to assign V1 or V2 to getNanoSeconds. I don't really need this "pass" implementation.

pcf000 · 2024-10-08T01:52:00Z

mlir/utils/performance/perfRunner.py

    if not os.path.exists(fileName):
-        result = "NaN"
-        return result
+        return np.nan


We had not been consistent and used "nan", "NaN", and np.nan in different places.

pcf000 · 2024-10-08T01:52:29Z

mlir/utils/performance/perfRunner.py

    with open(fileName, 'r') as csv_file:
        reader = csv.DictReader(csv_file, delimiter = ',')
        header = reader.fieldnames
        if 'LDSBankConflict' not in header:
            return np.nan
-
-        result = []
+        sum = 0


Counting the rows and accumulating the sum, vs accumulating a list and then calling sum and len on it.

pcf000 · 2024-10-08T01:53:23Z

mlir/utils/performance/perfRunner.py

@@ -209,10 +278,10 @@ def getMilliseconds(output):

    return float(result.group(1))

-def runPipeline(proc_specs):
+def runPipeline(proc_specs, initial_stdin=subprocess.DEVNULL):


"initial_stdin" exists to send some mlir text into the first stage, see below.

pcf000 · 2024-10-08T01:57:41Z

mlir/utils/performance/perfRunner.py

@@ -1048,50 +1117,7 @@ def findRunCommand(filename):
    print("WARNING: cannot find valid RUN command in ", filename)
    return None, None

-# Extract testVector and test function name from the test file
-def getFusionTestInfo(filename, paths: Paths):


I had missed getFusionTestInfo when I updated everything to use runPipeline. When I started updating it, I noticed that everything up to the tuningKey process was identical with runFusionKernel, so I abstracted the common part into makeBasicFusionPipeline.

Then I realised that it was doing redundant work, because we'd collect tests and call getFusionTestInfo on them, then loop through the collected tests and call runFusionKernel on those, and that would do the basic pipeline twice. I recast it to do the basics once and save the mlir, and merged the collection and running loops so it doesn't have to save the mlir for very long. More notes below.

pcf000 · 2024-10-08T01:58:23Z

mlir/utils/performance/perfRunner.py


+def runFusionKernel(mlirfile, rocmlirGenArgs, paths: Paths):


Now takes as input a file of the mlir from the basic-fusion-pipeline.

pcf000 · 2024-10-08T02:01:42Z

mlir/utils/performance/perfRunner.py

-        if not futName:
-            print("\tCannot find rocmlir-gen with -fut")
-            continue
+    # Prepare test cases


Merged the two loops into one, with the split-k hack moved up first. Go through each .mlir file, extract the useful RUN: command if present, and do makeBasicFusionPipeline with it to make the initial mlir code. We have a temp file to hold that code, because I couldn't make a good way to save it as a string and then send it to another pipeline later without having a file.

pcf000 · 2024-10-08T02:03:07Z

mlir/utils/performance/perfRunner.py

+                op = 'gemm'
+                config = GemmConfiguration.fromCommandLine(commandLine, arch, numCU)
+
+            # Find the best perf_config


Look for the perf-config for the kernel configuration, or make a dummy NaN one.

pcf000 · 2024-10-08T02:05:44Z

mlir/utils/performance/perfRunner.py

+                        perfResults[testVector] = oneEntry
+                    continue
+
+            # Run fusion test


Run the kernel, reusing the mlir in the temp file, and record its time. We anticipate duplicates -- eg, in the bert tests there are 24 .mlir file but only eight unique kernels -- and just take the best-performing. Given natural variation, times will be close but the winning file can be different every time.

pcf000 · 2024-10-08T02:06:39Z

mlir/utils/performance/perfRunner.py

-        oneEntry['Fusion/MLIR'] = oneEntry['TFlops']/oneEntry['MLIR TFlops']
-        oneEntry['FileName'] = filename
-        perfResults[testVector] = oneEntry
+            # Run gemm or conv op with the same configuration


Run generated kernel in the usual way for reference.

pcf000 · 2024-10-08T02:07:34Z

mlir/utils/performance/perfRunner.py

-    xdlop_supported_gpus_str = xdlop_supported_gpus[0]
-    for gpu in xdlop_supported_gpus[1:]:
-        xdlop_supported_gpus_str += '|' + gpu
+    xdlop_supported_gpus_str = '|'.join(xdlop_supported_gpus)


pcf000 · 2024-10-08T02:07:54Z

mlir/utils/performance/perfRunner.py

-                p1.kill()
-                print("MIOpen tuning timed out")
-                _, errs = p1.communicate()
+            runPipeline([MIOpenDriverCommand])


Missed another one. Last one, I swear.

pcf000 · 2024-10-08T02:08:53Z

mlir/utils/performance/perfRunner.py

@@ -1285,7 +1308,7 @@ def getNumCU(chip):
        rocminfo = subprocess.check_output("/opt/rocm/bin/rocminfo",
                                           stderr=subprocess.PIPE)
    except subprocess.CalledProcessError as e:
-        print(e.stderr.decode('utf-8'))
+        print(f"Process error: {e.stderr.decode('utf-8')}")


Still trying to identify the cause of the intermittent rocminfo failures. Highlight this case in the log file.

pcf000 · 2024-10-08T02:09:25Z

mlir/utils/performance/perfRunner.py

    parsed_args = parser.parse_args(args)

+    global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS


Swap functions, options, and filenames based on the option.

... why are we keeping rocprof v1 support?

codecov · 2024-10-08T05:06:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.94%. Comparing base (43ddbd2) to head (390b3fc).
Report is 19 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1672      +/-   ##
===========================================
- Coverage    78.04%   77.94%   -0.11%     
===========================================
  Files           98       98              
  Lines        26501    26501              
  Branches      3809     3809              
===========================================
- Hits         20682    20655      -27     
- Misses        4293     4313      +20     
- Partials      1526     1533       +7

Flag	Coverage Δ
mfma	`77.94% <ø> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

krzysz00

(Haven't fully read the PR, minor thoughts)

krzysz00 · 2024-10-08T14:42:43Z

mlir/utils/performance/perfRunner.py

@@ -894,6 +964,7 @@ def benchmarkExternal(cls, commandLine, paths: Paths, arch, numCU):
        benchmarkArgs = config.generateMlirDriverCommandLine("")
        # remove the result file generated by rocprof in previous benchmarking
        os.system("rm -f "+BENCHMARKING_RESULT_FILE_NAME)
+        os.system("rm -f "+BENCHMARKING_METRICS_FILE_NAME)


We can call rm just once

krzysz00 · 2024-10-08T14:43:46Z

mlir/utils/performance/perfRunner.py

    parsed_args = parser.parse_args(args)

+    global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS


... why are we keeping rocprof v1 support?

pcf000 · 2024-10-08T15:44:53Z

Keeping rocprof V1 in case we have discrepancies to investigate (though since we don't keep much history, that's probably not a big concern) and in case rocprofv2 stops supporting an architecture before we do. Mostly because it was pretty easy to do and might be helpful.

krzysz00 · 2024-10-08T20:32:44Z

mlir/utils/performance/perfRunner.py

        return result

+def getNanoSecondsV2(fileName):


Overall comment: perhaps class Profiler: is in order here, instead of hot-swapping functions onto variables?

Made Profiler classes to handle the V1/V2 switch more cleanly. Made tuningRunner.py use Profiler to get consistent arguments. Some places in tuningRunner.py use runPipeline, some don't yet.

pcf000 requested review from jerryyin and sjw36 as code owners October 8, 2024 01:48

pcf000 commented Oct 8, 2024

View reviewed changes

krzysz00 reviewed Oct 8, 2024

View reviewed changes

Partial response to review comments and CI failure.

390b3fc

Made Profiler classes to handle the V1/V2 switch more cleanly. Made tuningRunner.py use Profiler to get consistent arguments. Some places in tuningRunner.py use runPipeline, some don't yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use rocprofv2 instead of rocprof. #1672

Use rocprofv2 instead of rocprof. #1672

pcf000 commented Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

krzysz00 Oct 8, 2024

codecov bot commented Oct 8, 2024 •

edited

Loading

krzysz00 left a comment

krzysz00 Oct 8, 2024

krzysz00 Oct 8, 2024

pcf000 commented Oct 8, 2024

krzysz00 Oct 8, 2024


		def runFusionKernel(mlirfile, rocmlirGenArgs, paths: Paths):

		parsed_args = parser.parse_args(args)

		global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS

Use rocprofv2 instead of rocprof. #1672

Are you sure you want to change the base?

Use rocprofv2 instead of rocprof. #1672

Conversation

pcf000 commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 8, 2024 • edited Loading

Codecov Report

krzysz00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcf000 commented Oct 8, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 8, 2024 •

edited

Loading