Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
【问题】
fc
kernel 是基于cl::Buffer
实现,性能不佳softmax
在处理二维tensor时,性能不佳,原因是并行度很低,比如维度为 1x1000 的 tensor,axis=1,只分配了一个线程来计算【本PR工作】
fc
,input/output/bias 使用cl::Image2d
存储,weight 使用cl::Buffer
存储,且 weight 的读取方式是half16
,具体参见 [OpenCL][Kernel] Use FC replace conv1x1 #6365 ;对应单测支持 fp32/fp16 两种精度验证softmax
,针对处理二维tensor时性能不佳的问题,调整线程分配方式为对 axis 轴所在的数据以32进行分块处理,因此使用了 local memory,核心思想是并行 reduce;同时为了高效处理channel非4整除情况,使用mask
来避免使用if/else
判断【效果】
MobileNetV1 模型中有一个
fc
和一个softmax
,在包含 mali 和 adreno gpu 6 个设备上测试 kernel 耗时,如下表(耗时单位 ms)。fc
可提速 1 ~ 3 倍,softmax
可提速 44% ~ 302%单独在 845 上测试不同N值下的 FC 性能:
【TODO】
由于这两个 kernel 的输出都是 2 维的,当对其输出 tensor 的维度扩充为 4 维时,不是按照 opencl converter 中定义的对高维度pad 1,而是对低维度 pad 1,因此对 precision profile 会有影响,待解决此处。后续计划统一将 opencl converter 改为对低维度 pad 1。