Skip to content

Commit

Permalink
add more capture replay controls (#337)
Browse files Browse the repository at this point in the history
* minor changes to kernel arg maps

* add more capture replay controls

* simplify capture replay controls

* move image metadata capturing

* fix capture replay scripts

* fix CL_PROGRAM_BINARIES query

* verified image capture and playback is working

* fix copyright date after rebase

* fix docs and tidy up a few more things

* remove stale comment

* disable logging in several cases when capture is skipped

These were a little too verbose in common cases.

* move buffer and image dumping for replay back into replay directory
  • Loading branch information
bashbaug authored Feb 18, 2024
1 parent ed72581 commit bf71ea5
Show file tree
Hide file tree
Showing 8 changed files with 895 additions and 682 deletions.
57 changes: 33 additions & 24 deletions docs/capture_single_kernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,52 +17,61 @@ To replay the captured kernels, you will need the following Python packages:

## Step by Step for Automatic Capturing

* Set one of the two controls:
* `DumpReplayKernelName`, if you want to capture a kernel by its name.
* `DumpReplayKernelEnqueue`, if you want to capture a kernel by its enqueue number.
* Then, simply run the program as usual!
* Example on Linux: `CLI_DumpReplayKernelName=${NameOfKernel} cliloader /path/to/executable`
1. Set the top-level control to enable kernel capturing and replay: `CaptureReplay`
2. Set any additional controls to capture a specific range of kernels, or specific kernel names. For example:
* `CaptureReplayMinEnqueue` and `CaptureReplayMaxEnqueue`, to capture a specific range of kernel enqueues.
* `CaptureReplayKernelName`, to capture a specific kernel name.
* `CaptureReplayUniqueKernels`, to capture only unique kernel and dispatch parameter combinations.
* `CaptureReplayNumKernelEnqueuesSkip`, to skip initial captures.
* `CaptureReplayNumKernelEnqueuesCapture`, to capture a limited number of kernel enqueues.
3. Then, simply run the program as usual!

For more details, please see the Capture and Replay Controls section in the [controls](controls.md) documentation.

## Step by Step for Automatic Capturing and Validation

* Copy the [capture_and_validate.py](../scripts/capture_and_validate.py) script to the place where you run the app from.
* Not strictly necessary, but makes life easier.
* Run this script with the following arguments:
- One of `--num EnqueueNumberToBeCaptured` or `--name NameOfKernelToBeCaptured`
- `-cli "/path/to/cliloader"`
- `--p "/path/to/program"`
- `--a ArgsForProgram`
Use the [capture_and_validate.py](../scripts/capture_and_validate.py) script to capture a workload and validate that the replayed results match.

Arguments for the capture and validate script are:

Please make sure to follow this order of arguments!
* `-c` or `--cliloader`: Path to `cliloader`. This can be a full path, or a relative path, or just `cliloader` if `cliloader` is already in the system path.
* `-p` or `--program`: The command to execute the program to capture.
* `-a` or `--args`: Any optional arguments to pass to the program to capture.
* Either one of:
* `-k` or `--kernel_name`: The kernel name to capture.
* `-n` or `--enqueue_number`: The enqueue number that should be captured.

This will then run the program using `cliloader` with the given arguments, capture the the specified kernel, and verify that the buffers calculated by the standalone replay agree with the buffers calculated by the original program.
The capture and validate script will then run the program using `cliloader` with the given arguments to capture the the specified kernel or enqueue number.
The script will then verify that the buffers calculated by the standalone replay agree with the buffers calculated by the original program.
If the buffers don't agree, it will show a message in the terminal.

## Supported Features

* OpenCL Buffers
* These may be aliased, then only one buffer is used.
* Only true if the buffers use the same memory address, so not when using sub-buffers and having offsets.
* `__local` kernel arguments, i.e. those set by `clSetKernelArg(kernel, arg_index, local_size, nullptr)`.
* `__local` kernel arguments, i.e. those set by `clSetKernelArg(kernel, arg_index, local_size, NULL)`.
* Device only buffers, i.e. those with `CL_MEM_HOST_NO_ACCESS`. When kernel capture is enabled, any device-only access flags are removed.
* OpenCL Images
* 2D, and 3D images are supported.
* OpenCL Samplers
* Build/replay from source
* Build/replay from a device binary
* OpenCL Kernels from source or IL
* OpenCL Kernels from device binary

## Limitations (incomplete)

* Does not work with OpenCL pipes
* Untested for out-of-order queues
* Sub-buffers are not dealt with explicitly, this may affect the results for both debugging and performance
* The capture and validate script doesn't work with GUI apps
* Does not work with OpenCL SVM or USM.
* Does not work with OpenCL pipes.
* Untested for out-of-order queues.
* Sub-buffers are not dealt with explicitly, this may affect the results for both debugging and performance.
* The capture and validate script may not work with some GUI apps.

## Advice

* Use the following environment variables for `pyopencl`: `PYOPENCL_NO_CACHE=1` and `PYOPENCL_COMPILER_OUTPUT=1`
* Minimize usage of other controls, to prevent unexpected behavior.
* Use the following environment variables for `pyopencl`: `PYOPENCL_NO_CACHE=1` and `PYOPENCL_COMPILER_OUTPUT=1`.
* Minimize usage of other controls, to prevent unexpected behavior, however:
* Consider enabling `InitializeBuffers` for more predictable results between runs.
* Only set one of `DumpReplayKernelName` and `DumpReplayKernelEnqueue`.
* When executing the capture and validate script consider removing any other kernel captures, or verifying that the validate script is using the correct capture.
* Always make sure to check if your results make sense.
* For some apps using `cliloader` doesn't work properly. If this happens for your application, please try other [install](install.md) options.

Expand Down
38 changes: 30 additions & 8 deletions docs/controls.md
Original file line number Diff line number Diff line change
Expand Up @@ -477,14 +477,6 @@ If set to a nonzero value, the Intercept Layer for OpenCL Applications will dump

If set to a nonzero value, the Intercept Layer for OpenCL Applications will dump kernel ISA binaries for every kernel, if supported. Currently, kernel ISA binaries are only supported for Intel GPU devices. Kernel ISA binaries can be decoded into ISA text with a disassembler. The filename will have the form "CLI\_\<Program Number\>\_\<Unique Program Hash Code\>\_\<Compile Count\>\_\<Unique Build Options Hash Code\>\_\<Device Type\>\_\<Kernel Name\>.isabin".

##### `DumpReplayKernelEnqueue` (int)

If set to a positive value, the Intercept Layer for OpenCL Applications will dump in /Replay/Enqueue\_*/ a standalone (i.e. runs completely independent from the original program from which is was captured) playable set of files for the specified enqueue number which can be used for debugging or profiling. When a program was build from source code, it will dump that one, otherwise it will dump the device binary. It is advised to not use this setting directly, but use /scripts/capture\_and\_validate.py.

##### `DumpReplayKernelName` (string)

If set, the Intercept Layer for OpenCL Applications for dump the specified kernel the first time it is encountered so that it can be replayed independently. It is advised to not use this setting directly, but use /scripts/capture\_and\_validate.py

### Controls for Emulating Features

##### `Emulate_cl_khr_extended_versioning` (bool)
Expand Down Expand Up @@ -613,6 +605,36 @@ If set to a nonzero value, the Intercept Layer for OpenCL Applications will try

If set to a nonzero value, the Intercept Layer for OpenCL Applications will try to automatically partition parent devices into sub-devices with the specified number of compute units.

### Capture and Replay Controls

##### `CaptureReplay` (bool)

This is the top-level control for kernel capture and replay.

##### `CaptureReplayMinEnqueue` (cl_uint)

The Intercept Layer for OpenCL Applications will only enable kernel capture and replay when the enqueue counter is greater than this value, inclusive.

##### `CaptureReplayMaxEnqueue` (cl_uint)

The Intercept Layer for OpenCL Applications will stop kernel capture and replay when the encounter is greater than this value, meaning that only enqueues less than this value, inclusive, will be captured.

##### `CaptureReplayKernelName` (string)

If set, the Intercept Layer for OpenCL Applications will only enable kernel capture and replay when the kernel name equals this name.

##### `CaptureReplayUniqueKernels` (bool)

If set, the Intercept Layer for OpenCL Applications will only enable kernel capture and replay if the kernel signature (i.e. hash + kernelname) has not been seen already.

##### `CaptureReplayNumKernelEnqueuesSkip` (cl_uint)

The Intercept Layer for OpenCL Applications will skip this many kernel enqueues before enabling kernel capture and replay.

##### `CaptureReplayNumKernelEnqueuesCapture` (cl_uint)

The Intercept Layer for OpenCL Applications will only capture this many kernel enqueues.

### AubCapture Controls

##### `AubCapture` (bool)
Expand Down
90 changes: 55 additions & 35 deletions intercept/scripts/run.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@

#
# Copyright (c) 2023-2024 Intel Corporation
#
# SPDX-License-Identifier: MIT
#

import numpy as np
import pyopencl as cl
Expand All @@ -18,9 +19,16 @@ def get_image_metadata(idx: int):
with open(filename) as metadata:
lines = metadata.readlines()

shape = [int(lines[0]),
int(lines[1]),
int(lines[2])]
image_type = int(lines[8])
if image_type in [cl.mem_object_type.IMAGE1D]:
shape = [int(lines[0])]
elif image_type in [cl.mem_object_type.IMAGE2D]:
shape = [int(lines[0]), int(lines[1])]
elif image_type in [cl.mem_object_type.IMAGE3D]:
shape = [int(lines[0]), int(lines[1]), int(lines[2])]
else:
print('Unsupported image type for playback!')
shape = [int(lines[0]), int(lines[1]), int(lines[2])]

format = cl.ImageFormat(int(lines[7]), int(lines[6]))
return format, shape
Expand All @@ -42,6 +50,12 @@ def sampler_from_string(ctx, sampler_descr):
help='How often the kernel should be enqueued')
args = parser.parse_args()

# Read the enqueue number from the file
with open('./enqueueNumber.txt') as file:
enqueue_number = file.read().splitlines()[0]

padded_enqueue_num = str(enqueue_number).rjust(4, "0")

arguments = {}
argument_files = gl.glob("./Argument*.bin")
for argument in argument_files:
Expand All @@ -51,10 +65,11 @@ def sampler_from_string(ctx, sampler_descr):
buffer_idx = []
input_buffers = {}
output_buffers = {}
buffer_files = gl.glob("./Buffer*.bin")
buffer_files = gl.glob("./Pre/Enqueue_" + padded_enqueue_num + "*.bin")
input_buffer_ptrs = defaultdict(list)
for buffer in buffer_files:
idx = int(re.findall(r'\d+', buffer)[0])
start = buffer.find("_Arg_")
idx = int(re.findall(r'\d+', buffer[start:])[0])
buffer_idx.append(idx)
input_buffers[idx] = np.fromfile(buffer, dtype='uint8').tobytes()
input_buffer_ptrs[arguments[idx]].append(idx)
Expand All @@ -63,10 +78,11 @@ def sampler_from_string(ctx, sampler_descr):
image_idx = []
input_images = {}
output_images = {}
image_files = gl.glob("./Image*.raw")
image_files = gl.glob("./Pre/Enqueue_" + padded_enqueue_num + "*.raw")
input_images_ptrs = defaultdict(list)
for image in image_files:
idx = int(re.findall(r'\d+', image)[0])
start = image.find("_Arg_")
idx = int(re.findall(r'\d+', image[start:])[0])
image_idx.append(idx)
input_images[idx] = np.fromfile(image, dtype='uint8').tobytes()
input_images_ptrs[arguments[idx]].append(idx)
Expand All @@ -86,13 +102,12 @@ def sampler_from_string(ctx, sampler_descr):

# Check if all input pointer addresses are unique
if len(tmp_args) != len(set(tmp_args)):
print("Some of the buffers are aliasing, we will replicate this behavior")
print("Some of the buffers are aliasing, we will replicate this behavior.")

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
devices = ctx.get_info(cl.context_info.DEVICES)

# TODO Samplers
samplers = {}
sampler_files = gl.glob("./Sampler*.txt")
for sampler in sampler_files:
Expand Down Expand Up @@ -120,19 +135,19 @@ def sampler_from_string(ctx, sampler_descr):
gpu_images[idx] = cl.Image(ctx, mf.COPY_HOST_PTR, format, shape, hostbuf=input_images[idx])

with open("buildOptions.txt", 'r') as file:
flags = [line.rstrip() for line in file]
print(f"Using flags: {flags}")
options = [line.rstrip() for line in file]
print(f"Using build options: {options}")

with open('knlName.txt') as file:
knl_name = file.read()
with open('kernelName.txt') as file:
kernel_name = file.read()

if os.path.isfile("kernel.cl"):
print("Using kernel source code")
print("Using kernel source")
with open("kernel.cl", 'r') as file:
kernel = file.read()
prg = cl.Program(ctx, kernel).build(flags)
prg = cl.Program(ctx, kernel).build(options)
else:
print("Using device binary")
print("Using kernel device binary")
binary_files = gl.glob("./DeviceBinary*.bin")
binaries = []
for file in binary_files:
Expand All @@ -141,50 +156,49 @@ def sampler_from_string(ctx, sampler_descr):
# Try the binaries to find one that works
for idx in range(len(binaries)):
try:
prg = cl.Program(ctx, [devices[0]], [binaries[idx]]).build(flags)
getattr(prg, knl_name)
prg = cl.Program(ctx, [devices[0]], [binaries[idx]]).build(options)
getattr(prg, kernel_name)
break
except Exception as e:
pass

knl = getattr(prg, knl_name)
kernel = getattr(prg, kernel_name)
for pos, argument in arguments.items():
knl.set_arg(pos, argument)
kernel.set_arg(pos, argument)

for pos, buffer in gpu_buffers.items():
for idx in pos:
knl.set_arg(idx, buffer)
kernel.set_arg(idx, buffer)

for pos, image in gpu_images.items():
knl.set_arg(pos, image)
kernel.set_arg(pos, image)

for pos, size in local_sizes.items():
knl.set_arg(pos, cl.LocalMemory(size))
kernel.set_arg(pos, cl.LocalMemory(size))

for pos, sampler in samplers.items():
knl.set_arg(pos, sampler)
kernel.set_arg(pos, sampler)

gws = []
lws = []
gws_offset = []
gwo = []

with open("worksizes.txt", 'r') as file:
lines = file.read().splitlines()

gws.extend([int(value) for value in lines[0].split()])
lws.extend([int(value) for value in lines[1].split()])
gws_offset.extend([int(value) for value in lines[2].split()])
gwo.extend([int(value) for value in lines[2].split()])

print(f"Global Worksize: {gws}")
print(f"Local Worksize: {lws}")
print(f"Global Worksize Offsets: {gws_offset}")
print(f"Global Work Size: {gws}")
print(f"Local Work Size: {lws}")
print(f"Global Work Offsets: {gwo}")

if lws == [0] or lws == [0, 0] or lws == [0, 0, 0]:
lws = None

for _ in range(args.repetitions):
cl.enqueue_nd_range_kernel(queue, knl, gws, lws, gws_offset)

cl.enqueue_nd_range_kernel(queue, kernel, gws, lws, gwo)

for pos in gpu_buffers.keys():
if len(pos) == 1:
Expand All @@ -196,9 +210,15 @@ def sampler_from_string(ctx, sampler_descr):
for pos in gpu_images.keys():
cl.enqueue_copy(queue, output_images[pos], gpu_images[pos], region=shape, origin=(0,0,0))

if not os.path.exists("./Test"):
os.makedirs("./Test")

for pos, cpu_buffer in output_buffers.items():
cpu_buffer.tofile("output_buffer" + str(pos) + ".bin")
outbuf = "./Test/Enqueue_" + padded_enqueue_num + "_Kernel_" + kernel_name + "_Arg_" + str(pos) + "_Buffer.bin"
print(f"Writing buffer output to file: {outbuf}")
cpu_buffer.tofile(outbuf)

for pos, cpu_image in output_images.items():
cpu_image.tofile("output_image" + str(pos) + ".raw")

outimg = "./Test/Enqueue_" + padded_enqueue_num + "_Kernel_" + kernel_name + "_Arg_" + str(pos) + "_Image.raw"
print(f"Writing image output to file: {outimg}")
cpu_image.tofile(outimg)
Loading

0 comments on commit bf71ea5

Please sign in to comment.