-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Scheduling Tips
Derek Gerstmann edited this page Mar 1, 2024
·
2 revisions
The following table outlines all the scheduling directives that can be combined and applied to a pipeline stage:
Scheduling Directive | Description | Typical Usage |
---|---|---|
compute_root() | Compute all of this function once ahead of time. Equivalent to `compute_at(LoopLevel::root())` | ... |
compute_at() | Schedule a function to be computed within the iteration over a given LoopLevel. | ... |
compute_with() | Schedule the iteration over this stage to be fused with another stage 's' from outermost loop to a given LoopLevel. | ... |
memoize() | Cache a computed version of this function across invocations of the Func. | ... |
async() | Produce this Func asynchronously in a separate thread. | ... |
split() | Split a dimension into inner and outer sub-dimensions with the given names, where the inner dimension iterates from 0 to factor-1. | ... |
fuse() | Join two dimensions into a single fused dimension. The fused dimension covers the product of the extents of the inner and outer dimensions given. | ... |
serial() | Mark a dimension to be traversed serially. This is the default. | ... |
parallel() | Mark a dimension to be traversed in parallel | ... |
vectorize() | Mark a dimension to be computed all-at-once as a single vector. The dimension should have constant extent - e.g. because it is the inner dimension following a split by a constant factor. | ... |
unroll() | Mark a dimension to be completely unrolled. The dimension should have constant extent - e.g. because it is the inner dimension following a split by a constant factor. | ... |
partition() | Set the loop partition policy. Loop partitioning can be useful to optimize boundary conditions (such as clamp_edge). Loop partitioning splits a for loop into three for loops: a prologue, a steady-state, and an epilogue. The default policy is Auto. | ... |
never_partition() | Set the loop partition policy to Never for some number of Vars and RVars. | ... |
always_partition() | Set the loop partition policy to Always for a vector of Vars and RVars. | ... |
bound() | Statically declare that the range over which a function should be evaluated is given by the second and third arguments. This can let Halide perform some optimizations. | |
tile() | Split two dimensions at once by the given factors, and then reorder the resulting dimensions to be xi, yi, xo, yo from innermost outwards. This gives a tiled traversal. */ | ... |
reorder() | Reorder variables to have the given nesting order, from innermost out | ... |
rename() | Rename a dimension. Equivalent to split with a inner size of one. | ... |
atomic() | Issue atomic updates for this Func. This allows parallelization on associative RVars. | ... |
specialize() | Specialize a Func. This creates a special-case version of the Func where the given condition is true. The most effective conditions are those of the form param == value, and boolean Params. | ... |
gpu_threads() | Tell Halide that the following dimensions correspond to GPU thread indices. This is useful if you compute a producer function within the block indices of a consumer function, and want to control how that function's dimensions map to GPU threads. If the selected target is not an appropriate GPU, this just marks those dimensions as parallel. | ... |
gpu_lanes() | The given dimension corresponds to the lanes in a GPU warp. GPU warp lanes are distinguished from GPU threads by the fact that all warp lanes run together in lockstep, which permits lightweight communication of data from one lane to another. | ... |
gpu_single_thread() | Tell Halide to run this stage using a single gpu thread and block. This is not an efficient use of your GPU, but it can be useful to avoid copy-back for intermediate update stages that touch a very small part of your Func. | ... |
gpu_blocks() | Tell Halide that the following dimensions correspond to GPU block indices. This is useful for scheduling stages that will run serially within each GPU block. If the selected target is not ptx, this just marks those dimensions as parallel. | ... |
gpu() | Tell Halide that the following dimensions correspond to GPU block indices and thread indices. If the selected target is not ptx, these just mark the given dimensions as parallel. The dimensions are consumed by this call, so do all other unrolling, reordering, etc first. | ... |
gpu_tile() | Short-hand for tiling a domain and mapping the tile indices to GPU block indices and the coordinates within each tile to GPU thread indices. Consumes the variables given, so do all other scheduling first. | ... |
allow_race_conditions() | Specify that race conditions are permitted for this Func, which enables parallelizing over RVars even when Halide cannot prove that it is safe to do so. Use this with great caution, and only if you can prove to yourself that this is safe, as it may result in a non-deterministic routine that returns different values at different times or on different machines. | ... |
hexagon() | Schedule for execution on Hexagon. When a loop is marked with Hexagon, that loop is executed on a Hexagon DSP. | ... |
prefetch() | Prefetch data written to or read from a Func or an ImageParam by a subsequent loop iteration, at an optionally specified iteration offset. | ... |
This is meant to be a set of recipes and approaches to use when scheduling Halide pipelines. Several of these methods are covered in the tutorials
.
... TBD ...