add more capture replay controls (#337)

* minor changes to kernel arg maps * add more capture replay controls * simplify capture replay controls * move image metadata capturing * fix capture replay scripts * fix CL_PROGRAM_BINARIES query * verified image capture and playback is working * fix copyright date after rebase * fix docs and tidy up a few more things * remove stale comment * disable logging in several cases when capture is skipped These were a little too verbose in common cases. * move buffer and image dumping for replay back into replay directory
intel · Feb 18, 2024 · bf71ea5 · bf71ea5
1 parent ed72581
commit bf71ea5
Show file tree

Hide file tree

Showing 8 changed files with 895 additions and 682 deletions.
diff --git a/docs/capture_single_kernels.md b/docs/capture_single_kernels.md
@@ -17,52 +17,61 @@ To replay the captured kernels, you will need the following Python packages:
 
 ## Step by Step for Automatic Capturing
 
-* Set one of the two controls:
-  * `DumpReplayKernelName`, if you want to capture a kernel by its name.
-  * `DumpReplayKernelEnqueue`, if you want to capture a kernel by its enqueue number.
-* Then, simply run the program as usual!
-* Example on Linux: `CLI_DumpReplayKernelName=${NameOfKernel} cliloader /path/to/executable`
+1. Set the top-level control to enable kernel capturing and replay: `CaptureReplay`
+2. Set any additional controls to capture a specific range of kernels, or specific kernel names.  For example:
+    * `CaptureReplayMinEnqueue` and `CaptureReplayMaxEnqueue`, to capture a specific range of kernel enqueues.
+    * `CaptureReplayKernelName`, to capture a specific kernel name.
+    * `CaptureReplayUniqueKernels`, to capture only unique kernel and dispatch parameter combinations.
+    * `CaptureReplayNumKernelEnqueuesSkip`, to skip initial captures.
+    * `CaptureReplayNumKernelEnqueuesCapture`, to capture a limited number of kernel enqueues.
+3. Then, simply run the program as usual!
+
+For more details, please see the Capture and Replay Controls section in the [controls](controls.md) documentation.
 
 ## Step by Step for Automatic Capturing and Validation
 
-* Copy the [capture_and_validate.py](../scripts/capture_and_validate.py) script to the place where you run the app from.
-  * Not strictly necessary, but makes life easier.
-* Run this script with the following arguments:
-  - One of `--num EnqueueNumberToBeCaptured` or `--name NameOfKernelToBeCaptured`
-  - `-cli "/path/to/cliloader"`
-  - `--p "/path/to/program"`
-  - `--a ArgsForProgram`
+Use the [capture_and_validate.py](../scripts/capture_and_validate.py) script to capture a workload and validate that the replayed results match.
+
+Arguments for the capture and validate script are:
 
-Please make sure to follow this order of arguments!
+* `-c` or `--cliloader`: Path to `cliloader`.  This can be a full path, or a relative path, or just `cliloader` if `cliloader` is already in the system path.
+* `-p` or `--program`: The command to execute the program to capture.
+* `-a` or `--args`: Any optional arguments to pass to the program to capture.
+* Either one of:
+    * `-k` or `--kernel_name`: The kernel name to capture.
+    * `-n` or `--enqueue_number`: The enqueue number that should be captured.
 
-This will then run the program using `cliloader` with the given arguments, capture the the specified kernel, and verify that the buffers calculated by the standalone replay agree with the buffers calculated by the original program.
+The capture and validate script will then run the program using `cliloader` with the given arguments to capture the the specified kernel or enqueue number.
+The script will then verify that the buffers calculated by the standalone replay agree with the buffers calculated by the original program.
 If the buffers don't agree, it will show a message in the terminal.
 
 ## Supported Features
 
 * OpenCL Buffers
   * These may be aliased, then only one buffer is used.
     * Only true if the buffers use the same memory address, so not when using sub-buffers and having offsets.
-  * `__local` kernel arguments, i.e. those set by `clSetKernelArg(kernel, arg_index, local_size, nullptr)`.
+  * `__local` kernel arguments, i.e. those set by `clSetKernelArg(kernel, arg_index, local_size, NULL)`.
   * Device only buffers, i.e. those with `CL_MEM_HOST_NO_ACCESS`.  When kernel capture is enabled, any device-only access flags are removed.
 * OpenCL Images
+  * 2D, and 3D images are supported.
 * OpenCL Samplers
-* Build/replay from source
-* Build/replay from a device binary
+* OpenCL Kernels from source or IL
+* OpenCL Kernels from device binary
 
 ## Limitations (incomplete)
 
-* Does not work with OpenCL pipes
-* Untested for out-of-order queues
-* Sub-buffers are not dealt with explicitly, this may affect the results for both debugging and performance
-* The capture and validate script doesn't work with GUI apps
+* Does not work with OpenCL SVM or USM.
+* Does not work with OpenCL pipes.
+* Untested for out-of-order queues.
+* Sub-buffers are not dealt with explicitly, this may affect the results for both debugging and performance.
+* The capture and validate script may not work with some GUI apps.
 
 ## Advice
 
-* Use the following environment variables for `pyopencl`: `PYOPENCL_NO_CACHE=1` and `PYOPENCL_COMPILER_OUTPUT=1`
-* Minimize usage of other controls, to prevent unexpected behavior.
+* Use the following environment variables for `pyopencl`: `PYOPENCL_NO_CACHE=1` and `PYOPENCL_COMPILER_OUTPUT=1`.
+* Minimize usage of other controls, to prevent unexpected behavior, however:
   * Consider enabling `InitializeBuffers` for more predictable results between runs.
-  * Only set one of `DumpReplayKernelName` and `DumpReplayKernelEnqueue`.
+* When executing the capture and validate script consider removing any other kernel captures, or verifying that the validate script is using the correct capture.
 * Always make sure to check if your results make sense.
 * For some apps using `cliloader` doesn't work properly.  If this happens for your application, please try other [install](install.md) options.
 

diff --git a/docs/controls.md b/docs/controls.md
@@ -477,14 +477,6 @@ If set to a nonzero value, the Intercept Layer for OpenCL Applications will dump
 
 If set to a nonzero value, the Intercept Layer for OpenCL Applications will dump kernel ISA binaries for every kernel, if supported.  Currently, kernel ISA binaries are only supported for Intel GPU devices.  Kernel ISA binaries can be decoded into ISA text with a disassembler.  The filename will have the form "CLI\_\<Program Number\>\_\<Unique Program Hash Code\>\_\<Compile Count\>\_\<Unique Build Options Hash Code\>\_\<Device Type\>\_\<Kernel Name\>.isabin".
 
-##### `DumpReplayKernelEnqueue` (int)
-
-If set to a positive value, the Intercept Layer for OpenCL Applications will dump in /Replay/Enqueue\_*/ a standalone (i.e. runs completely independent from the original program from which is was captured) playable set of files for the specified enqueue number which can be used for debugging or profiling. When a program was build from source code, it will dump that one, otherwise it will dump the device binary. It is advised to not use this setting directly, but use /scripts/capture\_and\_validate.py.
-
-##### `DumpReplayKernelName` (string)
-
-If set, the Intercept Layer for OpenCL Applications for dump the specified kernel the first time it is encountered so that it can be replayed independently. It is advised to not use this setting directly, but use /scripts/capture\_and\_validate.py
-
 ### Controls for Emulating Features
 
 ##### `Emulate_cl_khr_extended_versioning` (bool)
@@ -613,6 +605,36 @@ If set to a nonzero value, the Intercept Layer for OpenCL Applications will try
 
 If set to a nonzero value, the Intercept Layer for OpenCL Applications will try to automatically partition parent devices into sub-devices with the specified number of compute units.
 
+### Capture and Replay Controls
+
+##### `CaptureReplay` (bool)
+
+This is the top-level control for kernel capture and replay.
+
+##### `CaptureReplayMinEnqueue` (cl_uint)
+
+The Intercept Layer for OpenCL Applications will only enable kernel capture and replay when the enqueue counter is greater than this value, inclusive.
+
+##### `CaptureReplayMaxEnqueue` (cl_uint)
+
+The Intercept Layer for OpenCL Applications will stop kernel capture and replay when the encounter is greater than this value, meaning that only enqueues less than this value, inclusive, will be captured.
+
+##### `CaptureReplayKernelName` (string)
+
+If set, the Intercept Layer for OpenCL Applications will only enable kernel capture and replay when the kernel name equals this name.
+
+##### `CaptureReplayUniqueKernels` (bool)
+
+If set, the Intercept Layer for OpenCL Applications will only enable kernel capture and replay if the kernel signature (i.e. hash + kernelname) has not been seen already.
+
+##### `CaptureReplayNumKernelEnqueuesSkip` (cl_uint)
+
+The Intercept Layer for OpenCL Applications will skip this many kernel enqueues before enabling kernel capture and replay.
+
+##### `CaptureReplayNumKernelEnqueuesCapture` (cl_uint)
+
+The Intercept Layer for OpenCL Applications will only capture this many kernel enqueues.
+
 ### AubCapture Controls
 
 ##### `AubCapture` (bool)

diff --git a/intercept/scripts/run.py b/intercept/scripts/run.py
@@ -1,7 +1,8 @@
-
+#
 # Copyright (c) 2023-2024 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
+#
 
 import numpy as np
 import pyopencl as cl
@@ -18,9 +19,16 @@ def get_image_metadata(idx: int):
     with open(filename) as metadata:
         lines = metadata.readlines()
 
-    shape = [int(lines[0]),
-                   int(lines[1]),
-                   int(lines[2])]
+    image_type = int(lines[8])
+    if image_type in [cl.mem_object_type.IMAGE1D]:
+        shape = [int(lines[0])]
+    elif image_type in [cl.mem_object_type.IMAGE2D]:
+        shape = [int(lines[0]), int(lines[1])]
+    elif image_type in [cl.mem_object_type.IMAGE3D]:
+        shape = [int(lines[0]), int(lines[1]), int(lines[2])]
+    else:
+        print('Unsupported image type for playback!')
+        shape = [int(lines[0]), int(lines[1]), int(lines[2])]
 
     format = cl.ImageFormat(int(lines[7]), int(lines[6]))
     return format, shape
@@ -42,6 +50,12 @@ def sampler_from_string(ctx, sampler_descr):
                     help='How often the kernel should be enqueued')
 args = parser.parse_args()
 
+# Read the enqueue number from the file
+with open('./enqueueNumber.txt') as file:
+    enqueue_number = file.read().splitlines()[0]
+
+padded_enqueue_num = str(enqueue_number).rjust(4, "0")
+
 arguments = {}
 argument_files = gl.glob("./Argument*.bin")
 for argument in argument_files:
@@ -51,10 +65,11 @@ def sampler_from_string(ctx, sampler_descr):
 buffer_idx = []
 input_buffers = {}
 output_buffers = {}
-buffer_files = gl.glob("./Buffer*.bin")
+buffer_files = gl.glob("./Pre/Enqueue_" + padded_enqueue_num + "*.bin")
 input_buffer_ptrs = defaultdict(list)
 for buffer in buffer_files:
-    idx = int(re.findall(r'\d+', buffer)[0])
+    start = buffer.find("_Arg_")
+    idx = int(re.findall(r'\d+', buffer[start:])[0])
     buffer_idx.append(idx)
     input_buffers[idx] = np.fromfile(buffer, dtype='uint8').tobytes()
     input_buffer_ptrs[arguments[idx]].append(idx)
@@ -63,10 +78,11 @@ def sampler_from_string(ctx, sampler_descr):
 image_idx = []
 input_images = {}
 output_images = {}
-image_files = gl.glob("./Image*.raw")
+image_files = gl.glob("./Pre/Enqueue_" + padded_enqueue_num + "*.raw")
 input_images_ptrs = defaultdict(list)
 for image in image_files:
-    idx = int(re.findall(r'\d+', image)[0])
+    start = image.find("_Arg_")
+    idx = int(re.findall(r'\d+', image[start:])[0])
     image_idx.append(idx)
     input_images[idx] = np.fromfile(image, dtype='uint8').tobytes()
     input_images_ptrs[arguments[idx]].append(idx)
@@ -86,13 +102,12 @@ def sampler_from_string(ctx, sampler_descr):
 
 # Check if all input pointer addresses are unique
 if len(tmp_args) != len(set(tmp_args)):
-    print("Some of the buffers are aliasing, we will replicate this behavior")
+    print("Some of the buffers are aliasing, we will replicate this behavior.")
 
 ctx = cl.create_some_context()
 queue = cl.CommandQueue(ctx)
 devices = ctx.get_info(cl.context_info.DEVICES)
 
-# TODO Samplers
 samplers = {}
 sampler_files = gl.glob("./Sampler*.txt")
 for sampler in sampler_files:
@@ -120,19 +135,19 @@ def sampler_from_string(ctx, sampler_descr):
     gpu_images[idx] = cl.Image(ctx, mf.COPY_HOST_PTR, format, shape, hostbuf=input_images[idx])
 
 with open("buildOptions.txt", 'r') as file:
-    flags = [line.rstrip() for line in file]
-    print(f"Using flags: {flags}")
+    options = [line.rstrip() for line in file]
+    print(f"Using build options: {options}")
 
-with open('knlName.txt') as file:
-        knl_name = file.read()
+with open('kernelName.txt') as file:
+    kernel_name = file.read()
 
 if os.path.isfile("kernel.cl"):
-    print("Using kernel source code")
+    print("Using kernel source")
     with open("kernel.cl", 'r') as file:
         kernel = file.read()
-    prg = cl.Program(ctx, kernel).build(flags)
+    prg = cl.Program(ctx, kernel).build(options)
 else:
-    print("Using device binary")
+    print("Using kernel device binary")
     binary_files = gl.glob("./DeviceBinary*.bin")
     binaries = []
     for file in binary_files:
@@ -141,50 +156,49 @@ def sampler_from_string(ctx, sampler_descr):
     # Try the binaries to find one that works
     for idx in range(len(binaries)):
         try:
-            prg = cl.Program(ctx, [devices[0]], [binaries[idx]]).build(flags)
-            getattr(prg, knl_name)
+            prg = cl.Program(ctx, [devices[0]], [binaries[idx]]).build(options)
+            getattr(prg, kernel_name)
             break
         except Exception as e:
             pass
 
-knl = getattr(prg, knl_name)
+kernel = getattr(prg, kernel_name)
 for pos, argument in arguments.items():
-    knl.set_arg(pos, argument)
+    kernel.set_arg(pos, argument)
 
 for pos, buffer in gpu_buffers.items():
     for idx in pos:
-        knl.set_arg(idx, buffer)
+        kernel.set_arg(idx, buffer)
 
 for pos, image in gpu_images.items():
-    knl.set_arg(pos, image)
+    kernel.set_arg(pos, image)
 
 for pos, size in local_sizes.items():
-    knl.set_arg(pos, cl.LocalMemory(size))
+    kernel.set_arg(pos, cl.LocalMemory(size))
 
 for pos, sampler in samplers.items():
-    knl.set_arg(pos, sampler)
+    kernel.set_arg(pos, sampler)
 
 gws = []
 lws = []
-gws_offset = []
+gwo = []
 
 with open("worksizes.txt", 'r') as file:
     lines = file.read().splitlines()
 
 gws.extend([int(value) for value in lines[0].split()])
 lws.extend([int(value) for value in lines[1].split()])
-gws_offset.extend([int(value) for value in lines[2].split()])
+gwo.extend([int(value) for value in lines[2].split()])
 
-print(f"Global Worksize: {gws}")
-print(f"Local Worksize: {lws}")
-print(f"Global Worksize Offsets: {gws_offset}")
+print(f"Global Work Size: {gws}")
+print(f"Local Work Size: {lws}")
+print(f"Global Work Offsets: {gwo}")
 
 if lws == [0] or lws == [0, 0] or lws == [0, 0, 0]:
     lws = None
 
 for _ in range(args.repetitions):
-    cl.enqueue_nd_range_kernel(queue, knl, gws, lws, gws_offset)
-
+    cl.enqueue_nd_range_kernel(queue, kernel, gws, lws, gwo)
 
 for pos in gpu_buffers.keys():
     if len(pos) == 1:
@@ -196,9 +210,15 @@ def sampler_from_string(ctx, sampler_descr):
 for pos in gpu_images.keys():
     cl.enqueue_copy(queue, output_images[pos], gpu_images[pos], region=shape, origin=(0,0,0))
 
+if not os.path.exists("./Test"):
+    os.makedirs("./Test")
+
 for pos, cpu_buffer in output_buffers.items():
-    cpu_buffer.tofile("output_buffer" + str(pos) + ".bin")
+    outbuf = "./Test/Enqueue_" + padded_enqueue_num + "_Kernel_" + kernel_name + "_Arg_" + str(pos) + "_Buffer.bin"
+    print(f"Writing buffer output to file: {outbuf}")
+    cpu_buffer.tofile(outbuf)
 
 for pos, cpu_image in output_images.items():
-    cpu_image.tofile("output_image" + str(pos) + ".raw")
-
+    outimg = "./Test/Enqueue_" + padded_enqueue_num + "_Kernel_" + kernel_name + "_Arg_" + str(pos) + "_Image.raw"
+    print(f"Writing image output to file: {outimg}")
+    cpu_image.tofile(outimg)