Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit the necessity the various control knobs in L0 adapter #1528

Open
pbalcer opened this issue Apr 18, 2024 · 5 comments
Open

Revisit the necessity the various control knobs in L0 adapter #1528

pbalcer opened this issue Apr 18, 2024 · 5 comments
Labels
level-zero L0 adapter specific issues

Comments

@pbalcer
Copy link
Contributor

pbalcer commented Apr 18, 2024

As mentioned in #1454, L0 adapter uses a lot of environment variables to control its behavior. This results in the codebase having a lot of conditional execution flows, making it hard to follow at times. It also makes it difficult for the user to understand which variables need to be set for best performance on their systems.
Ideally, the number of knobs controllable through env variables would be reduced to a minimum set where there are unavoidable trade-offs related to the platform and the environment. As a first step towards that, we need to revisit the existing L0 adapter variables to see which ones can be deprecated or removed entirely.

List of existing variables:

Environment Variable Comments
UR_L0_DEBUG Used to enable ZE validation layers and enable some debug output. Most of the functionality can be supplanted by the new logger.
UR_L0_LEAKS_DEBUG Used to enable leak tracking. Could be rolled into ZE_DEBUG and/or supplanted by the logger.
UR_L0_SERIALIZE Controls call serialization.
UR_L0_TRACK_INDIRECT_ACCESS_MEMORY This tracks all memory allocations live at the time of a kernel execution and defers memory deallocation until the kernel execution has finished. In its current form this has quite a bit of overhead, but could possibly be optimized to use some form of epoch-based reclamation, at which point we could deprecate the variable.
UR_L0_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING Is already marked as deprecated by SYCL.
UR_L0_USE_NATIVE_USM_MEMCPY2D This was an introduced as a workaround for a bug in the driver, see intel/llvm#9973. We should probably remove the variable and decide the behavior based on automatically detected L0 driver version.
UR_L0_MAX_NUMBER_OF_EVENTS_PER_EVENT_POOL Controls # of events created per event pool. This should probably remain with a sane default.
UR_L0_COMMANDLISTS_CLEANUP_THRESHOLD Threshold for cleaning up events in a command list. Same as above.
UR_L0_USE_COPY_ENGINE Controls which copy engines are used. This is sometimes useful for avoiding hardware limitations when SYCL/UR is running alongside something else.
UR_L0_USE_IMMEDIATE_COMMANDLISTS Overwrites the default type of command lists used for a queue. Should probably be removed at some point, but is currently useful for performance profiling. The workloads that want to use regular command lists should select them programmatically.
UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS This is needed to get best performance on some systems, but is a trade-off against correctness.
UR_L0_USE_DRIVER_INORDER_LISTS A recently introduced feature flag. Should be removed at some point once the feature is proven to be stable.
SYCL_USM_HOSTPTR_IMPORT This controls whether host buffers are imported into USM on creation. This has been shown to increase performance across some workloads. More evaluation is likely needed across wider range of benchmarks to see whether this can by enabled by default and the flag removed.
SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS This is enabled by default and is the correct behavior. Should probably be deprecated and then removed.
UR_L0_IN_ORDER_BARRIER_BY_SIGNAL This is an optimization that's enabled by default. Deprecate and remove.
UR_L0_DISABLE_EVENTS_CACHING Disables event caching. I can't find where this is needed or even possibly useful (except maybe for some debugging).
UR_L0_REUSE_DISCARDED_EVENTS This is enabled by default and probably could be deprecated.
SYCL_PI_LEVEL_ZERO_FILTER_EVENT_WAIT_LIST This is a disabled by default optimization that checks whether an event has completed prior to appending an operation on a command list that has the event as a dependency. Performance profiling needs to be done to determine whether we can enable or remove this option.
UR_L0_DEVICE_SCOPE_EVENTS Controls creation of proxy-host events to avoid host visible events. Disabled by default, and might not have any tangible benefits. Candidate for deprecation.
UR_L0_USE_COPY_ENGINE_FOR_FILL Controls whether the copy engine is used in memory fill operations. Defaults to off. Its fairly uninvasive, and likely useful for performance debugging.
SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION Controls page migration between subdevices. Defaults to enabled. This is probably tricky for the end-users to decide on. This should probably be a programmatic API.
UR_L0_USE_COPY_ENGINE_FOR_D2D_COPY Control whether the copy engine is used for device-to-device transfers. Defaults to off. This is again fairly uninvasive. Probably requires a lot of benchmarking to decide what's the best option for each platform.
UR_L0_EAGER_INIT This controls whether command lists are created upfront. Defaults to off. I think we should just enable it by default and remove the option. I'm not sure whether we save anything here but some negligible amount of memory by doing lazy initialization.
UR_L0_QUEUE_FINISH_HOLD_LOCK This is a recently introduced feature flag to avoid deadlocks between queue and its events. To be removed once the feature is proven to be stable.
UR_L0_COPY_BATCH_SIZE, UR_L0_BATCH_SIZE These variables control how batching is performed with normal command lists. Defaults to dynamic adjustment. This should probably be deprecated, but, on the other hand, might be useful for performance profiling. The code is fairly uninvasive.
UR_L0_USE_COMPUTE_ENGINE Same as UR_L0_USE_COPY_ENGINE, this is often useful in debugging for hardcoding which copy engines are used.
UR_L0_USE_COPY_ENGINE_FOR_IN_ORDER_QUEUE Controls whether copy engine is used for inorder queues. Defaults to enabled. Should probably stay, like other similar options.
UR_L0_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD This is the number of events collected for the immediate command list prior to cleanup. Defaults to 1024. This cleanup can be painful, especially for out-of-order command lists. This should probably be removed, as users have no visibility on how to set this properly. Upcoming optimizations should negate any benefit from this variable.
UR_L0_USM_ALLOCATOR see below.
UR_L0_DISABLE_USM_ALLOCATOR see below.
UR_L0_USM_ALLOCATOR_TRACE Allocator related options are going to be eventually moved to UMF, so these variables will be removed or mapped to UMF ones once that happens.
UR_L0_USM_RESIDENT Controls whether allocations are forced to be resident for device/host/shared memory. Defaults to making device memory resident. This is something that ideally software should control though a programmatic API.
@pbalcer pbalcer added the level-zero L0 adapter specific issues label Apr 18, 2024
@igchor
Copy link
Member

igchor commented Apr 18, 2024

Allocator related options are going to be eventually moved to UMF, so these variables will be removed or mapped to UMF ones once that happens.

This will not be so simple. Those env variables are applicable to higher level than UMF memory pool (they allow to control pooling per memory-type, etc.). There has been some work to create a better abstractions for managing different memory types through a pool manager:

Unless we implement this usm pool manager in UMF (which we could do I think), I think those variables will have to stay in UR.

@MichalMrozek
Copy link
Contributor

I think env variables needs to be moved to single file and be added via wrapper, here is sample how it is done in compute-runtime:

https://github.com/intel/compute-runtime/blob/master/shared/source/debug_settings/debug_variables_base.inl

  • all env are read at init time
  • after that they are immutable and are not read , this is important for performance
  • each env has documentation what it does and how to use it
  • in one place it is very easy to parse all envs and try those out

Sample
DECLARE_DEBUG_VARIABLE(int32_t, EnableDirectSubmission, -1, "-1: default (disabled), 0: disable, 1:enable. Enables direct submission of command buffers bypassing KMD")

Declaring env variables basically means:

  • type is defined, here it is int32_t
  • name is defined, here it is EnableDirectSubmission
  • default value is defined , here it is -1
  • then there is description what it does and what each value do

@MichalMrozek
Copy link
Contributor

For deprecation, I would start with those that have high complexity impact, one that strives out quickly is
UR_L0_TRACK_INDIRECT_ACCESS_MEMORY

It does conditional execution in many places, in hot submit path as well.
Right now it is disabled, we may want to try deprecating it to see if anyone uses it.

@pbalcer
Copy link
Contributor Author

pbalcer commented Apr 19, 2024

I think env variables needs to be moved to single file ...

That's the idea behind #1454.

high complexity impact, one that strives out quickly is UR_L0_TRACK_INDIRECT_ACCESS_MEMORY

Yes, but I was unsure whether this is a feature anyone cares about. It seems odd to have a memory-safety related feature behind an env variable, so my tentative plan was to redesign this so that deferred deallocations are integrated in the normal memory management flow, without introducing any overheads. Should be doable.

@MichalMrozek
Copy link
Contributor

Well if it is not used today by default then I question whether it is really needed and it adds heavy complexity to the code.
In SYCL it is responsibility of the user to make sure nothing is touching the memory, so I do not see a need for defered destruction.

void sycl::free(void* ptr, const context& syclContext)

Frees an allocation. The memory pointed to by ptr must have been allocated using one of the USM allocation routines. syclContext must be the same context that was used to allocate the memory. The memory is freed without waiting for commands operating on it to be completed. If commands that use this memory are in-progress or are enqueued the behavior is undefined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level-zero L0 adapter specific issues
Projects
None yet
Development

No branches or pull requests

3 participants