-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] Initial support for dynamic shared memory #8466
Conversation
src/runtime/thread_storage_scope.h
Outdated
@@ -96,6 +98,8 @@ struct StorageScope { | |||
return "global" + tag; | |||
case StorageRank::kShared: | |||
return "shared" + tag; | |||
case StorageRank::kDynShared: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us use the same storage scope as shared, but introduce a different tag. e.g. shared.dyn. The scope rank was used to rank the memory hierachy and needs to also made consistent with the thread rank. In this case dynamic shared memory is a special case of shared memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed dynamic shared memory to be a tagged memory for kShared
, but I had to add a check e->scope.tag != ".dyn"
in a few places in lower_device_storage_access_info.cc
and storage_rewrite.cc
to workaround error/segfault.
@@ -240,7 +240,7 @@ namespace attr { | |||
* | |||
* Call(f, | |||
* [arg1, arg2, ..., arg_n, | |||
* work_size_1, work_size_2, ... work_size_m]) | |||
* work_size_1, work_size_2, ... work_size_m, dyn_shmem_size]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us introduce a special meta-data to indicate that dynamic shared memory is used. This is to make sure the calling convention is backward compatible when dyn shared memory is not provided.
constexpr const char* kDeviceUseDynSharedMemory = "tir.device_use_dyn_shared_memory";
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/runtime/thread_storage_scope.h
Outdated
@@ -223,6 +232,7 @@ class ThreadAxisConfig { | |||
w.work_size[arg_index_map_[i]] = size; | |||
} | |||
} | |||
w.dyn_shmem_size = static_cast<size_t>(x.values[base_ + arg_index_map_.size()].v_int64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider backward compatibility for kernels where dynamic shared memory is not used.
Update https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L103 to add a use_dyn_shared_memory
(default to false) field, update json reader to use DeclareOptionalField, to optionally read this field in. Only decode the value when the flag is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/runtime/file_utils.cc
Outdated
} | ||
|
||
bool FunctionInfo::Load(dmlc::Stream* reader) { | ||
if (!reader->Read(&name)) return false; | ||
if (!reader->Read(&arg_types)) return false; | ||
if (!reader->Read(&thread_axis_tags)) return false; | ||
if (!reader->Read(&use_dyn_shared_memory)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen I didn't find a way to optionally read this attribute for the binary reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think. let us think about another way to do backward compact then.
Since binary backward compatibility is more important to us here.
Here is one solution that also helps future backward compatibility.
Let us rename thread_axis_tags => launch_param_tags
, and use dyn_shared_memory
as a tag for the launching parameter of dynamic shared memory. We can then change the ThreadAxisConfig => LaunchParamConfig
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us try the ^ change and do a test to make sure we can load back old cuda binaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do the rename thread_axis_tags => launch_param_tags
, the JSON reader would break backward compat at helper.DeclareField("launch_param_tags", &launch_param_tags);
.
Or shall we do the rename but keep the old attribute name for JSON reader/writer? e.g.
helper.DeclareField("thread_axis_tags", &launch_param_tags);
.
Or shall we have both new and old attribute names as optional field, e.g.
helper.DeclareOptionalField("launch_param_tags", &launch_param_tags);
helper.DeclareOptionalField("thread_axis_tags", &launch_param_tags); // for backward compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with the latter solution, and confirmed that an old saved module can be loaded via both json and binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the thread_axis_tags later
src/runtime/file_utils.cc
Outdated
} | ||
|
||
bool FunctionInfo::Load(dmlc::Stream* reader) { | ||
if (!reader->Read(&name)) return false; | ||
if (!reader->Read(&arg_types)) return false; | ||
if (!reader->Read(&thread_axis_tags)) return false; | ||
if (!reader->Read(&use_dyn_shared_memory)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us try the ^ change and do a test to make sure we can load back old cuda binaries
@tqchen Backward compat issue has been addressed. |
* send dyn shmem size to runtime * add dyn shared storage scope * associate buffer var and its storage scoe in split_host_device * tried NVPTX but failed with INVALID_PTX error * test stub * dynamic shmem reduce working * log2 issue fixed * nvptx working * refactor llvm shmem allocation * make linkage argument * support rocm too * send dyn shmem param to hip runtime * remove alloc map from split_host_device.cc * remove attr::storage_scope from split_host_device * lint fix * formatting * update calling convention doc * minor update to test * remove log * remove kDynShared, dyn.shared -> shared.dyn * support backward compat * update json/binary reader/writer * thread_axis_tags -> launch_param_tags * ThreadAxisConfig -> LaunchParamConfig * remove use_dynamic_shared_memory from FunctionInfo meta data * revert change in test_tir_ir_builder.py * make sure kUseDynamicSharedMemoryTag is the last tag * remove continue * update doc string following name change * more comment update following name change Co-authored-by: masa <masa@pop-os.localdomain> Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
* send dyn shmem size to runtime * add dyn shared storage scope * associate buffer var and its storage scoe in split_host_device * tried NVPTX but failed with INVALID_PTX error * test stub * dynamic shmem reduce working * log2 issue fixed * nvptx working * refactor llvm shmem allocation * make linkage argument * support rocm too * send dyn shmem param to hip runtime * remove alloc map from split_host_device.cc * remove attr::storage_scope from split_host_device * lint fix * formatting * update calling convention doc * minor update to test * remove log * remove kDynShared, dyn.shared -> shared.dyn * support backward compat * update json/binary reader/writer * thread_axis_tags -> launch_param_tags * ThreadAxisConfig -> LaunchParamConfig * remove use_dynamic_shared_memory from FunctionInfo meta data * revert change in test_tir_ir_builder.py * make sure kUseDynamicSharedMemoryTag is the last tag * remove continue * update doc string following name change * more comment update following name change Co-authored-by: masa <masa@pop-os.localdomain> Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
Only one allocation is allowed for now. See the test case for usage.
@tqchen @junrushao1994 @vinx13