Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Initial support for dynamic shared memory #8466

Merged
merged 30 commits into from
Jul 22, 2021
Merged

Conversation

masahi
Copy link
Member

@masahi masahi commented Jul 14, 2021

Only one allocation is allowed for now. See the test case for usage.

@tqchen @junrushao1994 @vinx13

@masahi masahi marked this pull request as ready for review July 14, 2021 11:23
@@ -96,6 +98,8 @@ struct StorageScope {
return "global" + tag;
case StorageRank::kShared:
return "shared" + tag;
case StorageRank::kDynShared:
Copy link
Member

@tqchen tqchen Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us use the same storage scope as shared, but introduce a different tag. e.g. shared.dyn. The scope rank was used to rank the memory hierachy and needs to also made consistent with the thread rank. In this case dynamic shared memory is a special case of shared memory

Copy link
Member Author

@masahi masahi Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed dynamic shared memory to be a tagged memory for kShared, but I had to add a check e->scope.tag != ".dyn" in a few places in lower_device_storage_access_info.cc and storage_rewrite.cc to workaround error/segfault.

@@ -240,7 +240,7 @@ namespace attr {
*
* Call(f,
* [arg1, arg2, ..., arg_n,
* work_size_1, work_size_2, ... work_size_m])
* work_size_1, work_size_2, ... work_size_m, dyn_shmem_size])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us introduce a special meta-data to indicate that dynamic shared memory is used. This is to make sure the calling convention is backward compatible when dyn shared memory is not provided.

constexpr const char* kDeviceUseDynSharedMemory = "tir.device_use_dyn_shared_memory";

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -223,6 +232,7 @@ class ThreadAxisConfig {
w.work_size[arg_index_map_[i]] = size;
}
}
w.dyn_shmem_size = static_cast<size_t>(x.values[base_ + arg_index_map_.size()].v_int64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider backward compatibility for kernels where dynamic shared memory is not used.

Update https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L103 to add a use_dyn_shared_memory(default to false) field, update json reader to use DeclareOptionalField, to optionally read this field in. Only decode the value when the flag is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/tir/transforms/split_host_device.cc Show resolved Hide resolved
@tqchen
Copy link
Member

tqchen commented Jul 16, 2021

Thanks @masahi I left some comments, also cc @Hzfengsy @vinx13 so they are also aware

@tqchen tqchen added the status: need update need update based on feedbacks label Jul 16, 2021
}

bool FunctionInfo::Load(dmlc::Stream* reader) {
if (!reader->Read(&name)) return false;
if (!reader->Read(&arg_types)) return false;
if (!reader->Read(&thread_axis_tags)) return false;
if (!reader->Read(&use_dyn_shared_memory)) return false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen I didn't find a way to optionally read this attribute for the binary reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think. let us think about another way to do backward compact then.

Since binary backward compatibility is more important to us here.

Here is one solution that also helps future backward compatibility.

Let us rename thread_axis_tags => launch_param_tags, and use dyn_shared_memory as a tag for the launching parameter of dynamic shared memory. We can then change the ThreadAxisConfig => LaunchParamConfig.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us try the ^ change and do a test to make sure we can load back old cuda binaries

Copy link
Member Author

@masahi masahi Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do the rename thread_axis_tags => launch_param_tags, the JSON reader would break backward compat at helper.DeclareField("launch_param_tags", &launch_param_tags); .

Or shall we do the rename but keep the old attribute name for JSON reader/writer? e.g.
helper.DeclareField("thread_axis_tags", &launch_param_tags);.

Or shall we have both new and old attribute names as optional field, e.g.

helper.DeclareOptionalField("launch_param_tags",  &launch_param_tags);
helper.DeclareOptionalField("thread_axis_tags", &launch_param_tags);  // for backward compatibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with the latter solution, and confirmed that an old saved module can be loaded via both json and binary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the thread_axis_tags later

}

bool FunctionInfo::Load(dmlc::Stream* reader) {
if (!reader->Read(&name)) return false;
if (!reader->Read(&arg_types)) return false;
if (!reader->Read(&thread_axis_tags)) return false;
if (!reader->Read(&use_dyn_shared_memory)) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us try the ^ change and do a test to make sure we can load back old cuda binaries

@masahi
Copy link
Member Author

masahi commented Jul 20, 2021

@tqchen Backward compat issue has been addressed.

@masahi masahi merged commit e95f10f into apache:main Jul 22, 2021
@masahi
Copy link
Member Author

masahi commented Jul 22, 2021

Thanks @tqchen @vinx13 @yzh119 @Hzfengsy

ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
* send dyn shmem size to runtime

* add dyn shared storage scope

* associate buffer var and its storage scoe in split_host_device

* tried NVPTX but failed with INVALID_PTX error

* test stub

* dynamic shmem reduce working

* log2 issue fixed

* nvptx working

* refactor llvm shmem allocation

* make linkage argument

* support rocm too

* send dyn shmem param to hip runtime

* remove alloc map from split_host_device.cc

* remove attr::storage_scope from split_host_device

* lint fix

* formatting

* update calling convention doc

* minor update to test

* remove log

* remove kDynShared, dyn.shared -> shared.dyn

* support backward compat

* update json/binary reader/writer

* thread_axis_tags -> launch_param_tags

* ThreadAxisConfig -> LaunchParamConfig

* remove use_dynamic_shared_memory from FunctionInfo meta data

* revert change in test_tir_ir_builder.py

* make sure kUseDynamicSharedMemoryTag is the last tag

* remove continue

* update doc string following name change

* more comment update following name change

Co-authored-by: masa <masa@pop-os.localdomain>
Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* send dyn shmem size to runtime

* add dyn shared storage scope

* associate buffer var and its storage scoe in split_host_device

* tried NVPTX but failed with INVALID_PTX error

* test stub

* dynamic shmem reduce working

* log2 issue fixed

* nvptx working

* refactor llvm shmem allocation

* make linkage argument

* support rocm too

* send dyn shmem param to hip runtime

* remove alloc map from split_host_device.cc

* remove attr::storage_scope from split_host_device

* lint fix

* formatting

* update calling convention doc

* minor update to test

* remove log

* remove kDynShared, dyn.shared -> shared.dyn

* support backward compat

* update json/binary reader/writer

* thread_axis_tags -> launch_param_tags

* ThreadAxisConfig -> LaunchParamConfig

* remove use_dynamic_shared_memory from FunctionInfo meta data

* revert change in test_tir_ir_builder.py

* make sure kUseDynamicSharedMemoryTag is the last tag

* remove continue

* update doc string following name change

* more comment update following name change

Co-authored-by: masa <masa@pop-os.localdomain>
Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need update need update based on feedbacks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants