Table of Contents
gm-ocl
开发记录- <2022-02-24 Thu>
- <2022-02-25 Fri>
- <2022-03-02 Wed>
- <2022-03-07 Mon>
- <2022-03-09 Wed>
- <2022-03-15 Tue>
- <2022-03-16 Wed>
- <2022-03-17 Thu>
- <2022-03-18 Fri>
- <2022-03-26 Sat>
- <2022-03-28 Mon>
- <2022-03-29 Tue>
- <2022-03-30 Wed>
- <2022-03-31 Thu>
- <2022-04-01 Fri>
- <2022-04-02 Sat>
- <2022-04-03 Sun>
- <2022-04-06 Wed>
- <2022-04-07 Thu>
- <2022-04-10 Sun>
- <2022-04-11 Mon>
- <2022-04-12 Tue>
- <2022-04-13 Wed>
- <2022-04-14 周四>
- <2022-04-18 周一>
- <2022-04-19 周二>
- <2022-04-21 周四>
- <2022-04-22 Fri>
- <2022-04-26 周二>
- <2022-04-27 周三>
- <2022-04-28 周四>
- <2022-05-05 周四>
- <2022-05-06 周五>
- <2022-05-18 Wed>
拷贝ImageMagick
中的m4/ax_have_opencl.m4
,在configure.ac
中添加:
# Enable support for OpenCL
no_cl=yes
AX_HAVE_OPENCL([C])
if test "X$no_cl" != 'Xyes'; then
MAGICK_FEATURES="OpenCL $MAGICK_FEATURES"
fi
在根目录下运行autoreconf -vif
重新生成configure
,但出现错误:
$ autoreconf -vif
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
autoreconf: running: libtoolize --copy --force
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'.
libtoolize: copying file 'config/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
autoreconf: running: /usr/bin/autoconf --force
configure.ac:1057: error: possibly undefined macro: AC_MSG_RESULT
If this token and others are legitimate, please use m4_pattern_allow.
See the Autoconf documentation.
autoreconf: /usr/bin/autoconf failed with exit status: 1
试了比如安装pkg-config
,autoconf-archive
都不能解决,开发环境archlinux
的autoconf
的版本是2.71
,换到了ubuntu
的2.69
的环境下也不行。
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal
$ autoconf --version
autoconf (GNU Autoconf) 2.69
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+/Autoconf: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>, <http://gnu.org/licenses/exceptions.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by David J. MacKenzie and Akim Demaille.
可以将ax_have_opencl.m4
的所有AC_MSG_RESULT
全部删除(注:注释掉AC_MSG_RESULT
的话还会出现其它错误导致不能生成configure
),虽然能成功生成configure
,但是运行./configure
时会出现如下错误:
$ ./configure
...
...
checking for gcc option to support OpenMP... -fopenmp
./configure: line 7812: syntax error near unexpected token `OpenCL,'
./configure: line 7812: ` AX_CHECK_FRAMEWORK(OpenCL,'
临时解决的方法是将configure
中的AX_CHECK_FRAMEWORK(...)
的代码段删除。
AX_CHECK_FRAMEWORK
的答案来自:
$ grep -rn 'AX_CHECK_FRAMEWORK'
ImageMagick/m4/ax_check_framework.m4:6:dnl @synopsis AX_CHECK_FRAMEWORK(framework, [action-if-found], [action-if-not-found])
ImageMagick/m4/ax_check_framework.m4:15:AC_DEFUN([AX_CHECK_FRAMEWORK],[
ImageMagick/m4/ax_have_opencl.m4:94: AX_CHECK_FRAMEWORK([OpenCL], [
即,将ImageMagick
的m4/ax_check_framework.m4
拷贝过来即可,惊奇的是AX_CHECK_FRAMEWORK
的问题解决,之前AC_MSG_RESULT
的问题同时也迎刃而解。
可以删除configure
文件后再运行autoreconf
来代替autoreconf -vif
,这样生成的configure
可以最大限度的与原configure
保持一致。
因为增加了两个新文件accelerate-private.h
和accelerate.c
,AccelerateResizeImage()
就位于其中,因为magick/Makefile.am
中没有增加这两个文件,所以在autoreconf
时生成的Makefile.in
就不知道新增了两个文件,因此在configure
时生成的Makefile
中就不会编译AccelerateResizeImage()
函数,因此出现的链接问题。
$ rm configure
$ autoreconf
configure.ac:119: warning: AM_INIT_AUTOMAKE: two- and three-arguments forms are deprecated. For more info, see:
configure.ac:119: https://www.gnu.org/software/automake/manual/automake.html#Modernize-AM_005fINIT_005fAUTOMAKE-invocation
% lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Iris Plus Graphics G1 (Ice Lake) (rev 07)
% clinfo
Number of platforms 0
% sudo pacman -S intel-compute-runtime
发现直接运行gm display
没有崩溃问题,调试时却经常发生,有时SIGABRT
,有时SIGSEGV
,猜测可能是同步问题:
/*
We need this to get a proper performance benchmark, the operations
are executed asynchronous.
*/
if (is_cpu == MagickFalse)
{
CacheInfo
*cache_info;
MagickCLCacheInfo
cl_info;
cache_info=(CacheInfo *) resizedImage->cache;
cl_info=GetCacheInfoOpenCL(cache_info);
if (cl_info != (MagickCLCacheInfo) NULL)
openCL_library->clWaitForEvents(cl_info->event_count,
cl_info->events);
}
if (i > 0)
StopAccelerateTimer(&timer);
if (bluredImage != (Image *) NULL)
DestroyImage(bluredImage);
if (unsharpedImage != (Image *) NULL)
DestroyImage(unsharpedImage);
if (resizedImage != (Image *) NULL)
DestroyImage(resizedImage);
经常遇到的是:resizedImage
为0
(SIGSEGV
),DestroyImage(resizedImage);
(SIGABRT
)。
在vscode
上的堆栈输出如下:
libc.so.6!__pthread_kill_implementation (Unknown Source:0)
libc.so.6!raise (Unknown Source:0)
libc.so.6!abort (Unknown Source:0)
libigdrcl.so![Unknown/Just-In-Time compiled code] (Unknown Source:0)
libOpenCL.so!clCreateBuffer (Unknown Source:0)
AcquireMagickCLCacheInfo(MagickCLDevice device, Quantum * pixels, const magick_int64_t length) (gm-ocl/magick/opencl.c:569)
GetAuthenticOpenCLBuffer(const Image * image, MagickCLDevice device, ExceptionInfo * exception) (gm-ocl/magick/pixel_cache.c:5252)
因为堆栈显示问题最终是出在__pthread_kill_implementation
上,所以我的调查方向一直在线程同步上,代码调试了一遍又一遍却始终没有找到问题。心灰意冷并已经产生从头再开始的想法了。
注意到:
libigdrcl.so![Unknown/Just-In-Time compiled code] (Unknown Source:0)
谷歌了下libigdrcl.so
,难道是安装intel
驱动的问题?目前使用的是intel-compute-runtime
,现改为intel-opencl-runtime
,参考自:“GPGPU - ArchWiki”。
$ sudo pacman -Rns intel-compute-runtime
$ yay -S intel-opencl-runtime
我对比了intel-compute-runtime
和intel-opencl-runtime
的clinfo
输出差异发现intel-compute-runtime
支持GPU
,intel-opencl-runtime
支持CPU
。
那代码到底有没有问题?clCreateBuffer()
调用失败的原因要不要深究?
// tasks.json
{
// See https://go.microsoft.com/fwlink/?LinkId=733558
// for the documentation about the tasks.json format
"version": "2.0.0",
"tasks": [
{
"label": "build with opencl",
"type": "shell",
"command": "make",
"problemMatcher": [],
"group": {
"kind": "build",
"isDefault": true
}
}
]
}
// launch.json
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "(gdb) Launch",
"type": "cppdbg",
"request": "launch",
"program": "${workspaceFolder}/utilities/gm",
"args": ["display", "~/temp/bg1a.jpg"],
"stopAtEntry": false,
"cwd": "${fileDirname}",
"environment": [
{
"name": "MAGICK_OCL_DEVICE",
"value": "true"
}
],
"externalConsole": false,
"MIMode": "gdb",
"setupCommands": [
{
"description": "Enable pretty-printing for gdb",
"text": "-enable-pretty-printing",
"ignoreFailures": true
},
{
"description": "Set Disassembly Flavor to Intel",
"text": "-gdb-set disassembly-flavor intel",
"ignoreFailures": true
}
]
}
]
}
正在用的dev.sh
内容如下:
#!/bin/bash
./configure CFLAGS='-g -O0' LDFLAGS='-ldl' --enable-opencl --prefix=$HOME/usr/local
使用MAGICK_OCL_DEVICE=GPU
且在已经安装了opencl-compute-runtime
的情况下会产生两个问题:
gm
运行卡死,无法操作,CPU
使用率居高不下,或者gm
运行崩溃,产生如下提示:
$ gm display ~/temp/bg1a.jpg
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)
我在这里找到了一些有用的信息:“crash in NEO::DrmAllocation::makeBOsResident
or in checkAllocationsForOverlapping
when using more than one opencl block in gnuradio gr-clenabled”。
下载了compute-runtime-22.09.22577
的源代码:
// compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
OsHandleStorage HostPtrManager::prepareOsStorageForAllocation(MemoryManager &memoryManager, size_t size, const void *ptr, uint32_t rootDeviceIndex) {
std::lock_guard<decltype(allocationsMutex)> lock(allocationsMutex);
auto requirements = HostPtrManager::getAllocationRequirements(rootDeviceIndex, ptr, size);
UNRECOVERABLE_IF(checkAllocationsForOverlapping(memoryManager, &requirements) == RequirementsStatus::FATAL);
auto osStorage = populateAlreadyAllocatedFragments(requirements);
if (osStorage.fragmentCount > 0) {
if (memoryManager.populateOsHandles(osStorage, rootDeviceIndex) != MemoryManager::AllocationStatus::Success) {
memoryManager.cleanOsHandles(osStorage, rootDeviceIndex);
osStorage.fragmentCount = 0;
}
}
return osStorage;
}
host_ptr_manager.cpp:250
就是:
UNRECOVERABLE_IF(checkAllocationsForOverlapping(memoryManager, &requirements) == RequirementsStatus::FATAL);
参考链接的:
Is flag CL_USE_HOST_PTR used to create buffers?
Are such buffers mapped (clEnqueueMapBuffer) and returned pointers used to create other buffers? Or passed as a ptr to EnqueueReadWriteBuffer/Image() calls?
我加了pixels
指针的输出发现:
$ gm display ~/temp/bg1a.jpg
14:45:53 0:1.237436 0.740u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648dc333c70
14:45:53 0:1.244275 0.800u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648db4beac0
14:45:53 0:1.334968 1.370u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648dd446b00
14:45:53 0:1.336437 1.390u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648db646ae0
14:45:53 0:1.432968 2.040u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648dd446b00
14:45:53 0:1.433129 2.060u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648db646ae0
14:45:53 0:1.544831 2.780u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648de446b50
14:45:53 0:1.544873 2.790u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648dc762150
14:45:53 0:1.659341 3.630u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648df446be0
14:45:53 0:1.659420 3.630u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648dd935030
14:45:53 0:1.778589 4.520u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648df446be0
14:45:53 0:1.778667 4.530u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648dd935030
14:45:54 0:2.242840 8.140u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x7f85b9757010
14:45:54 0:2.247728 8.200u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648dc8faa00
14:45:54 0:2.256726 8.280u 69908 opencl.c/CopyMagickCLCacheInfo/1552/User:
clEnqueueMapBuffer return pixels: 0x5648dc8faa00
14:46:12 0:20.409439 8.320u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x5648db646ae0
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)
发现0x5648db646ae0
指针在第三次调用clCreateBuffer
时崩溃。怎么再次崩溃时的输出不一样:
$ gm display ~/temp/bg1a.jpg
15:09:02 0:1.357361 1.150u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55728b02e720
15:09:02 0:1.363826 1.210u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55728a1b9570
15:09:02 0:1.460255 1.810u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55728ce2e740
15:09:02 0:1.461296 1.830u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55728b341620
15:09:02 0:1.552602 2.460u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55728c1415d0
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)
这次没有崩溃,但是却发现clEnqueueMapBuffer
返回的指针被clCreateBuffer
创建缓存却没有崩溃:
$ gm display ~/temp/bg1a.jpg
15:11:13 0:1.283573 1.150u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf5096990
15:11:13 0:1.289888 1.200u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf4161c00
15:11:13 0:1.385760 1.820u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf6296b20
15:11:13 0:1.392014 1.850u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf4161c00
15:11:14 0:1.479689 2.450u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf6296b20
15:11:14 0:1.485056 2.470u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf4161c00
15:11:14 0:1.581975 3.120u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf7296b60
15:11:14 0:1.582017 3.120u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf528a550
15:11:14 0:1.700207 4.010u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf8196bc0
15:11:14 0:1.700267 4.010u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf528a550
15:11:14 0:1.816203 4.880u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf8196bc0
15:11:14 0:1.816260 4.880u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf528a550
15:11:14 0:2.285724 8.520u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x7fe42d322010
15:11:14 0:2.290441 8.550u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf54eb600
15:11:14 0:2.300859 8.630u 71783 opencl.c/CopyMagickCLCacheInfo/1552/User:
clEnqueueMapBuffer return pixels: 0x55dcf54eb600
15:11:25 0:13.121552 8.660u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf5570380
15:11:25 0:13.135136 8.670u 71783 opencl.c/CopyMagickCLCacheInfo/1552/User:
clEnqueueMapBuffer return pixels: 0x55dcf5570380
15:11:25 0:13.291565 8.680u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer pixels: 0x55dcf54eb600
15:11:25 0:13.300287 8.690u 71783 opencl.c/CopyMagickCLCacheInfo/1552/User:
clEnqueueMapBuffer return pixels: 0x55dcf54eb600
这跟上面链接的说法不符呀!难道是因为clEnqueueMapBuffer
中没有使用CL_USE_HOST_PTR
标记?所以不存在上面所说的这个问题?
怪事儿,我为什么没有在cl.h
里找到CL_USE_HOST_PTR
的定义?
在IM
上添加了相同的输出代码,表现和上面的类似,但IM
却能工作的很好,看来得从其它方法再入手了。
注:IM
的日志配置文件:usr/local/etc/ImageMagick-7/log.xml
。
还是因为内存重叠的原因:我在opencl.c
的AcquireMagickCLCacheInfo()
函数中调用clCreateBuffer()
之前添加了如下的输出代码:
LogMagickEvent(UserEvent, GetMagickModule(),
"clCreateBuffer - req: %d, pixels: %p, len: %d",
device->requested, pixels, length);
当出现问题时有如下log
:
[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
10:51:31 0:1.368105 1.030u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer - req: 1, pixels: 0x55d124b58490, len: 15728640
10:51:31 0:1.374360 1.060u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer - req: 2, pixels: 0x55d123c23730, len: 1536000
10:51:31 0:1.469657 1.670u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer - req: 2, pixels: 0x55d125c58510, len: 15728640
10:51:31 0:1.475944 1.720u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer - req: 3, pixels: 0x55d123c23730, len: 1536000
10:51:31 0:1.566470 2.310u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer - req: 3, pixels: 0x55d125c58510, len: 15728640
10:51:31 0:1.571902 2.360u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
clCreateBuffer - req: 4, pixels: 0x55d124b23740, len: 1536000
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)
注意看第一行和最后一行的输出发现,最后一行的地址加偏移已经大于第一行的地址,说明此时内存重叠,所以出现clCreateBuffer()
的调用崩溃。这个问题有点难解了,涉及要修改整个GM
的内存布局?
(> (+ #x55d124b23740 1536000) #x55d124b58490)
在前一篇的基础上继续分析,因为clCreateBuffer()
返回的地址即GetAuthenticOpenCLBuffer()
的返回值(它在ComputeResizeImage()
函数中被调用),当ComputeResizeImage()
结束时,调用clReleaseMemObject()
将会减少该内存计数,当计数为0
时clCreateBuffer()
创建的内存才被释放。
为了打印内存的引用计数,增加了clGetMemObjectInfo()
函数:
// opencl-private.h
typedef CL_API_ENTRY cl_int
(CL_API_CALL *MAGICKpfn_clGetMemObjectInfo)(cl_mem memobj,
cl_mem_info param_name,size_t param_value_size,void *param_value,
size_t *param_value_size_ret)
CL_API_SUFFIX__VERSION_1_0;
MAGICKpfn_clGetMemObjectInfo clGetMemObjectInfo;
比如ReleaseOpenCLMemObject()
函数被改成了:
MagickPrivate void ReleaseOpenCLMemObject(cl_mem memobj)
{
cl_uint refcnt=0;
openCL_library->clGetMemObjectInfo(memobj, CL_MEM_REFERENCE_COUNT, sizeof(cl_uint), &refcnt, NULL);
LogMagickEvent(UserEvent, GetMagickModule(),
"b4 ReleaseOpenCLMemObject(%p) refcnt: %d", memobj, refcnt);
cl_int ret=openCL_library->clReleaseMemObject(memobj);
openCL_library->clGetMemObjectInfo(memobj, CL_MEM_REFERENCE_COUNT, sizeof(cl_uint), &refcnt, NULL);
LogMagickEvent(UserEvent, GetMagickModule(),
"af ReleaseOpenCLMemObject(%p) refcnt: %d", memobj, refcnt);
}
我有输出如下:
[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
13:51:33 0:1.510679 1.480u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
clCreateBuffer -- req: 1, pixles: 0x5588e97712e0, len: 15728640
13:51:33 0:1.522527 1.570u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
clCreateBuffer return: 0x5588e774f5c0, refcnt: 1
13:51:33 0:1.522589 1.570u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
clCreateBuffer -- req: 2, pixles: 0x5588e88fc130, len: 1536000
13:51:33 0:1.524313 1.590u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
clCreateBuffer return: 0x5588e77064b0, refcnt: 1
13:51:33 0:1.528273 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
b4 ReleaseOpenCLMemObject(0x5588e774f5c0) refcnt: 2
13:51:33 0:1.528351 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
af ReleaseOpenCLMemObject(0x5588e774f5c0) refcnt: 1
13:51:33 0:1.528410 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
b4 ReleaseOpenCLMemObject(0x5588e77064b0) refcnt: 2
13:51:33 0:1.528465 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
af ReleaseOpenCLMemObject(0x5588e77064b0) refcnt: 1
13:51:33 0:1.528520 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
b4 ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:33 0:1.528582 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
af ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:34 0:1.740225 2.900u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
clCreateBuffer -- req: 2, pixles: 0x5588eb571300, len: 15728640
13:51:34 0:1.742399 2.910u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
clCreateBuffer return: 0x5588e87fead0, refcnt: 1
13:51:34 0:1.742615 2.910u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
clCreateBuffer -- req: 3, pixles: 0x5588e9a841e0, len: 1536000
13:51:34 0:1.744057 2.930u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
clCreateBuffer return: 0x5588e87feda0, refcnt: 1
13:51:34 0:1.745895 2.930u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
b4 ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 2
13:51:34 0:1.745964 2.930u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
af ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:34 0:1.746004 2.930u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
b4 ReleaseOpenCLMemObject(0x5588e87feda0) refcnt: 2
13:51:34 0:1.746081 2.930u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
af ReleaseOpenCLMemObject(0x5588e87feda0) refcnt: 1
13:51:34 0:1.746126 2.930u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
b4 ReleaseOpenCLMemObject(0x5588e87756a0) refcnt: 1
13:51:34 0:1.746185 2.930u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
af ReleaseOpenCLMemObject(0x5588e87756a0) refcnt: 1
13:51:34 0:1.946106 4.270u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
clCreateBuffer -- req: 3, pixles: 0x5588ea9841f0, len: 15728640
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)
测试地址重叠:
(> (+ #x5588ea9841f0 15728640) #x5588eb571300)
(- #x5588eb571300 #x5588ea9841f0)
解析下:最后一行0x5588ea9841f0
调用clCreateBuffer()
时崩溃,它的地址与0x5588eb571300
重叠,而0x5588eb571300
申请的cl_mem
地址为:0x5588e87fead0
,最后一次调用ReleaseOpenCLMemObject()
后它的引用计数为1
,这说明0x5588eb571300
还没被释放而0x5588ea9841f0
又开始申请造成内存重叠。
发现一处问题,上面输出中有如下:
13:51:33 0:1.528520 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
b4 ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:33 0:1.528582 1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
af ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
ReleaseOpenCLMemObject(0x5588e87fead0)
调用前后引用计数没有减少。难道clReleaseMemObject()
调用失败了?
原来是因为当对象已经销毁后再调用clGetMemObjectInfo()
将会返回-38
的错误,即CL_INVALID_MEM_OBJECT
。
我可能解决了这个问题,将问题定位在了RunOpenCLBenchmark()
的结尾DestroyImage(resizedImage);
处,即在DestroyCacheInfo()
中应该有清除OpenCL
相关内存的代码。
在IM
中number_channels
成员出现频率有点高,经调试发现IM
中图片对象初始化时通过调用OpenPixelCache()
然后在InitializePixelChannelMap()
中设置number_channels
的值。这个函数的内部大量使用了GM
中没有类型PixelChannel
和PixelTrait
,不太好把它给搬到GM
中。
查看PixelChannel
的定义发现了它的一个特点是:虽然它是enum
类型,但每个成员都被指派了具体的值,且发现有多个成员共用一个值的情况。以此参照仍然没有在GM
中找到类似定义,PixelChannel
的定义:
typedef enum
{
UndefinedPixelChannel = 0,
RedPixelChannel = 0,
CyanPixelChannel = 0,
GrayPixelChannel = 0,
LPixelChannel = 0,
LabelPixelChannel = 0,
YPixelChannel = 0,
aPixelChannel = 1,
GreenPixelChannel = 1,
MagentaPixelChannel = 1,
CbPixelChannel = 1,
bPixelChannel = 2,
BluePixelChannel = 2,
YellowPixelChannel = 2,
CrPixelChannel = 2,
BlackPixelChannel = 3,
AlphaPixelChannel = 4,
IndexPixelChannel = 5,
ReadMaskPixelChannel = 6,
WriteMaskPixelChannel = 7,
MetaPixelChannel = 8,
CompositeMaskPixelChannel = 9,
IntensityPixelChannel = MaxPixelChannels, /* ???? */
CompositePixelChannel = MaxPixelChannels, /* ???? */
SyncPixelChannel = MaxPixelChannels+1 /* not a real channel */
} PixelChannel; /* must correspond to ChannelType */
我模仿IM
的InitializePixelChannelMap()
函数写了calc_image_number_channels()
,虽然number_channels
的值对于同一张测试图片bg1a.jpg
来说均为3
,但是在IM
中值3
显示正确,而在GM
中3 + 1
才能正确,所以我在ComputeResizeImage()
中将calc_image_number_channels()
的返回值加上了1
:
number_channels=(cl_uint) calc_image_number_channels(image)+1;
这只是临时方案,估计下面要更改抄过来的kernel
函数。
在AccelerateResizeImage()
中有这样的一段代码被注释掉了:
// if ((gpuSupportedResizeWeighting(GetResizeFilterWeightingType(
// resizeFilter)) == MagickFalse) ||
// (gpuSupportedResizeWeighting(GetResizeFilterWindowWeightingType(
// resizeFilter)) == MagickFalse))
// return((Image *) NULL);
我认为这段代码可以省略,因为它的目的就是为了检查IM
的ResizeFilter
类型中的ResizeWeightingFunctionType
类型的成员值:
struct _ResizeFilter
{
double
(*filter)(const double,const ResizeFilter *),
(*window)(const double,const ResizeFilter *),
support, /* filter region of support - the filter support limit */
window_support, /* window support, usally equal to support (expert only) */
scale, /* dimension scaling to fit window support (usally 1.0) */
blur, /* x-scale (blur-sharpen) */
coefficient[7]; /* cubic coefficents for BC-cubic filters */
ResizeWeightingFunctionType
filterWeightingType,
windowWeightingType;
size_t
signature;
};
是否在supportedResizeWeighting
数组中:
static const ResizeWeightingFunctionType supportedResizeWeighting[] =
{
BoxWeightingFunction,
TriangleWeightingFunction,
HannWeightingFunction,
HammingWeightingFunction,
BlackmanWeightingFunction,
CubicBCWeightingFunction,
SincWeightingFunction,
SincFastWeightingFunction,
LastWeightingFunction
};
很明显这个数组就是IM
支持GPU
的窗函数集合。在GM
中我只在ResizeImage()
函数中找到相近的定义:
static const FilterInfo
filters[SincFilter+1] =
{
{ Box, 0.0 },
{ Box, 0.0 },
{ Box, 0.5 },
{ Triangle, 1.0 },
{ Hermite, 1.0 },
{ Hanning, 1.0 },
{ Hamming, 1.0 },
{ Blackman, 1.0 },
{ Gaussian, 1.25 },
{ Quadratic, 1.5 },
{ Cubic, 2.0 },
{ Catrom, 2.0 },
{ Mitchell, 2.0 },
{ Lanczos, 3.0 },
{ BlackmanBessel, 3.2383 },
{ BlackmanSinc, 4.0 }
};
明显可见,GM
中处理相对于IM
简单多了,所以上面的代码我仍然保持它被注释状态。
有一个星期没碰了,今天突然发现:
[ysouyno@arch ~]$ export MAGICK_OCL_DEVICE=true
[ysouyno@arch ~]$ gm display ~/temp/bg1a.jpg
Abort was called at 39 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.11.22682/shared/source/built_ins/built_ins.cpp
gm display: abort due to signal 6 (SIGABRT) "Abort"...
Aborted (core dumped)
[ysouyno@arch ~]$ clinfo
Abort was called at 39 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.11.22682/shared/source/built_ins/built_ins.cpp
Aborted (core dumped)
连clinfo
都运行不了了,看来肯定不是我代码的问题。发现intel-compute-runtime
的版本也已经更新了。
这个标题有点长,可能文章的内容也有点长,但是思路越来越清晰。先来看PixelChannelMap
的结构体定义:
typedef struct _PixelChannelMap
{
PixelChannel
channel;
PixelTrait
traits;
ssize_t
offset;
} PixelChannelMap;
PixelChannelMap
在IM
的_Image
结构体中对应成员channel_map
:
PixelChannelMap
*channel_map;
首先要说的是,channel_map
在逐个计算像素的过程中非常重要,拿IM
的resize.c:HorizontalFilter()
函数为例:
for (i=0; i < (ssize_t) GetPixelChannels(image); i++)
{
double
alpha,
gamma,
pixel;
PixelChannel
channel;
PixelTrait
resize_traits,
traits;
ssize_t
j;
ssize_t
k;
channel=GetPixelChannelChannel(image,i);
traits=GetPixelChannelTraits(image,channel);
resize_traits=GetPixelChannelTraits(resize_image,channel);
if ((traits == UndefinedPixelTrait) ||
(resize_traits == UndefinedPixelTrait))
continue;
if (((resize_traits & CopyPixelTrait) != 0) ||
(GetPixelWriteMask(resize_image,q) <= (QuantumRange/2)))
{
j=(ssize_t) (MagickMin(MagickMax(bisect,(double) start),(double)
stop-1.0)+0.5);
k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[j-start].pixel-contribution[0].pixel);
SetPixelChannel(resize_image,channel,p[k*GetPixelChannels(image)+i],
q);
continue;
}
pixel=0.0;
if ((resize_traits & BlendPixelTrait) == 0)
{
/*
No alpha blending.
*/
for (j=0; j < n; j++)
{
k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[j].pixel-contribution[0].pixel);
alpha=contribution[j].weight;
pixel+=alpha*p[k*GetPixelChannels(image)+i];
}
SetPixelChannel(resize_image,channel,ClampToQuantum(pixel),q);
continue;
}
/*
Alpha blending.
*/
gamma=0.0;
for (j=0; j < n; j++)
{
k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[j].pixel-contribution[0].pixel);
alpha=contribution[j].weight*QuantumScale*
GetPixelAlpha(image,p+k*GetPixelChannels(image));
pixel+=alpha*p[k*GetPixelChannels(image)+i];
gamma+=alpha;
}
gamma=PerceptibleReciprocal(gamma);
SetPixelChannel(resize_image,channel,ClampToQuantum(gamma*pixel),q);
}
这个循环中的GetPixelChannels()
函数就是返回number_channels
的值:
static inline size_t GetPixelChannels(const Image *magick_restrict image)
{
return(image->number_channels);
}
即处理每个像素的所有channel
,通过GetPixelChannelChannel()
函数以通道的offset
成员获得对应的channel
:
static inline PixelChannel GetPixelChannelChannel(
const Image *magick_restrict image,const ssize_t offset)
{
return(image->channel_map[offset].channel);
}
返回的channel
是PixelChannel
类型,这个定义在上面的文章中已经给出了,见:“PixelChannel”,看定义我之前还在奇怪为什么enum
类型指定了好多相同的0
值,1
值,现在终于明白了。即比如RGB
和CMYK
两种形式R
和C
都是第0
个通道,G
和M
都是第1
个通道,依次类推。CMYK
中的K
就是black
,在PixelChannel
中对应的是BlackPixelChannel
。
重点就是SetPixelChannel()
这个函数:
static inline void SetPixelChannel(const Image *magick_restrict image,
const PixelChannel channel,const Quantum quantum,
Quantum *magick_restrict pixel)
{
if (image->channel_map[channel].traits != UndefinedPixelTrait)
pixel[image->channel_map[channel].offset]=quantum;
}
这里忽略理解traits
,最后一个参数pixel
就是处理像素后的目标地址,代码中的channel
是通过循环i
获取的,offset
是通过channel
获取的,最终计算出了pixel
的真正地址,然后将计算好的quantum
赋值进去。
最后,channel_map
中的channel
和offset
是在哪里初始化的?我对比了代码发现只在:
static inline void SetPixelChannelAttributes(
const Image *magick_restrict image,const PixelChannel channel,
const PixelTrait traits,const ssize_t offset)
{
if ((ssize_t) channel >= MaxPixelChannels)
return;
if (offset >= MaxPixelChannels)
return;
image->channel_map[offset].channel=channel;
image->channel_map[channel].offset=offset;
image->channel_map[channel].traits=traits;
}
而SetPixelChannelAttributes()
只在InitializePixelChannelMap()
函数中会被调用,InitializePixelChannelMap()
这个函数有点熟悉,在之前了解number_channels
的文章中做过了介绍,这个函数内部计算并初始化了number_channels
的值。
我对GM
和IM
进行了比较,GM
对IM
进行了精简,代码:
for (y=0; y < (long) destination->rows; y++)
{
double
weight;
DoublePixelPacket
pixel;
long
j;
register long
i;
pixel=zero;
if ((destination->matte) || (destination->colorspace == CMYKColorspace))
{
double
transparency_coeff,
normalize;
normalize=0.0;
for (i=0; i < n; i++)
{
j=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[i].pixel-contribution[0].pixel);
weight=contribution[i].weight;
transparency_coeff = weight * (1 - ((double) p[j].opacity/TransparentOpacity));
pixel.red+=transparency_coeff*p[j].red;
pixel.green+=transparency_coeff*p[j].green;
pixel.blue+=transparency_coeff*p[j].blue;
pixel.opacity+=weight*p[j].opacity;
normalize += transparency_coeff;
}
normalize = 1.0 / (AbsoluteValue(normalize) <= MagickEpsilon ? 1.0 : normalize);
pixel.red *= normalize;
pixel.green *= normalize;
pixel.blue *= normalize;
q[y].red=RoundDoubleToQuantum(pixel.red);
q[y].green=RoundDoubleToQuantum(pixel.green);
q[y].blue=RoundDoubleToQuantum(pixel.blue);
q[y].opacity=RoundDoubleToQuantum(pixel.opacity);
}
else
{
for (i=0; i < n; i++)
{
j=(long) (y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[i].pixel-contribution[0].pixel));
weight=contribution[i].weight;
pixel.red+=weight*p[j].red;
pixel.green+=weight*p[j].green;
pixel.blue+=weight*p[j].blue;
}
q[y].red=RoundDoubleToQuantum(pixel.red);
q[y].green=RoundDoubleToQuantum(pixel.green);
q[y].blue=RoundDoubleToQuantum(pixel.blue);
q[y].opacity=OpaqueOpacity;
}
if ((indexes != (IndexPacket *) NULL) &&
(source_indexes != (IndexPacket *) NULL))
{
i=Min(Max((long) (center+0.5),start),stop-1);
j=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[i-start].pixel-contribution[0].pixel);
indexes[y]=source_indexes[j];
}
}
从上面的代码可看出,在GM
中只处理了matte
通道或CMYKColorsapce
,它们都是四通道的。但是看到代码中的indexes
,不知道这个在GM
中的意义及是否在IM
中有相关对应。
可以在GM
中通过搜索indexes_valid
来找到一些有用的信息:
/*
Indexes are valid if the image storage class is PseudoClass or the
colorspace is CMYK.
*/
cache_info->indexes_valid=((image->storage_class == PseudoClass) ||
(image->colorspace == CMYKColorspace));
原来是这样,这里的PseudoClass
就是pseudocolor
,可以在wiki
的Indexed color
中找到介绍,之所以叫做索引颜色,是因为为了节省内存或磁盘空间,颜色信息不是直接由图片的像素所携带,而是存放在一个单独的颜色表中或者调色板中。
从上面贴出来的IM
和GM
的代码对比发现,下面两段代码类似:
// IM
if (((resize_traits & CopyPixelTrait) != 0) ||
(GetPixelWriteMask(resize_image,q) <= (QuantumRange/2)))
{
j=(ssize_t) (MagickMin(MagickMax(bisect,(double) start),(double)
stop-1.0)+0.5);
k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[j-start].pixel-contribution[0].pixel);
SetPixelChannel(resize_image,channel,p[k*GetPixelChannels(image)+i],
q);
continue;
}
// GM
if ((indexes != (IndexPacket *) NULL) &&
(source_indexes != (IndexPacket *) NULL))
{
i=Min(Max((long) (center+0.5),start),stop-1);
j=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
(contribution[i-start].pixel-contribution[0].pixel);
indexes[y]=source_indexes[j];
}
依然参考InitializePixelChannelMap()
的代码,结合刚刚知道的GM
对于PseudoClass
和CMYKColorspace
时indexes
有效,在IM
中则CopyPixelTrait
对应IndexPixelChannel
:
if (image->colorspace == CMYKColorspace)
SetPixelChannelAttributes(image,BlackPixelChannel,trait,n++);
if (image->alpha_trait != UndefinedPixelTrait)
SetPixelChannelAttributes(image,AlphaPixelChannel,CopyPixelTrait,n++);
if (image->storage_class == PseudoClass)
SetPixelChannelAttributes(image,IndexPixelChannel,CopyPixelTrait,n++);
if ((image->channels & ReadMaskChannel) != 0)
SetPixelChannelAttributes(image,ReadMaskPixelChannel,CopyPixelTrait,n++);
if ((image->channels & WriteMaskChannel) != 0)
SetPixelChannelAttributes(image,WriteMaskPixelChannel,CopyPixelTrait,n++);
if ((image->channels & CompositeMaskChannel) != 0)
SetPixelChannelAttributes(image,CompositeMaskPixelChannel,CopyPixelTrait,
n++);
不同的是IM
中CMYKColorspace
没有CopyPixelTrait
特性。
小结一下:以目前的开发状态,将IM
中的CopyPixelTrait
与GM
中的indexes
对应起来。
在“关于IM
中的number_channels
成员(一)”的结尾提到将计算出来的number_channels
值加1
才能显示正确的图形,之前说它是临时方案,看来这次要将它变成永久的了。
number_channels=(cl_uint) calc_image_number_channels(image)+1;
通过详细阅读GM
和IM
的HorizontalFilter()
函数发现IM
的最内层循环是通过:
static inline size_t GetPixelChannels(const Image *magick_restrict image)
{
return(image->number_channels);
}
来获得的,而GM
中没有这么做,它比IM
少了刚刚提到的这一层循环,改为固定设置四个通道的值。代码在前面的笔记中已给出,见:“对IM
的number_channels
及PixelChannelMap
结构体中的channel
和offset
成员的理解”。
我觉得这样的话反而简单了,我需要修改kernel
函数和GM
的自身处理相匹配,或者去掉number_channels
成员,用固定值4
代替?
kernel
函数我相信现在修改它没什么难度,这两天翻来覆去的看accelerate-kernels-private.h:ResizeVerticalFilter()
函数并和HorizontalFilter()
进行对比理解,现在已经胸有成竹了。
今天开始尝试对number_channels
进行处理,我的想法是既然number_channels
在GM
中已经失去了它的意义,那倒不如直接把它删除掉,在它曾经出现过的地方用数字4
代替。当然也不是所有number_channels
的地方都要修改,要看具体情况而定。
首先,我从IM
中重新拷贝了accelerate-kernels-private.h
文件,因为之前那些的修改是基于错误的number_channels
逻辑的。其次number_channels
的传参部分不能删除,因为原GM
中对于图片是否有透明通道或者是否是CMYK
的格式有做判断,因此将它改为matte_or_cmyk
来表明是否是四通道,1
表示有,0
表示没有,这样的话在accelerate-kernels-private.h
的ResizeHorizontalFilter()
函数中alpha_index
就可以删除了。最后在WriteAllChannels()
处的调用,必须以4
传入,因为即使原图不是四通道,它的第四个通道也要赋值以保证图片计算的正确性。
经过这些修改之后测试图片显示正常。
当将图片不断缩小到宽高为1x1
时会出现如下问题:
gm: magick/image.c:1407: DestroyImage: Assertion `image->signature == MagickSignature' failed.
Aborted (core dumped)
这是因为在ComputeResizeImage()
函数中当缩小到1x1
时失败,outputReady
为0
导致DestroyImage(filteredImage);
的调用,但是在销毁filteredImage
后并没有将其赋0
导致。
看了一下GM
的源码:
/*
Free memory and set pointer to NULL
*/
#define MagickFreeMemory(memory) \
{ \
void *_magick_mp=memory; \
MagickFree(_magick_mp); \
memory=0; \
}
这里明明将传入的指针赋0
了。难道这段代码不起作用?
其实这个宏是有作用的,但要看怎么使用它。因为DestroyImage()
是函数调用,实际上传入的指针是一个副本,将副本赋0
并不影响原来的值,同时也要理解宏和函数调用的不同,这里有两种情况需要考虑:
有如下测试代码:
#include <stdio.h>
#include <stdlib.h>
typedef void (*MagickFreeFunc)(void *ptr);
static MagickFreeFunc FreeFunc = free;
void MagickFree(void *memory) {
if (memory != (void *)NULL)
(FreeFunc)(memory);
}
#define MagickFreeMemory(memory) \
{ \
printf("&memory: %p\n", &memory); \
printf(" memory: %p\n", memory); \
void *_magick_mp = memory; \
MagickFree(_magick_mp); \
memory = 0; \
}
void destroy_image(char *image) { MagickFreeMemory(image); }
int main() {
char *image = (char *)malloc(1024);
printf("&image : %p\n", &image);
printf(" image : %p\n", image);
destroy_image(image);
printf("&image : %p\n", &image);
printf(" image : %p\n", image);
return 0;
}
这是GM
中的代码使用方式,输出如下:
% ./a.out
&image : 0x7ffc8a46dfb0
image : 0x55711df782a0
&memory: 0x7ffc8a46df88
memory: 0x55711df782a0
&image : 0x7ffc8a46dfb0
image : 0x55711df782a0
这里指针并没有变化,因为是函数调用,如果将代码中的destroy_image()
改为宏MagickFreeFunc
,则是想要的效果:
#include <stdio.h>
#include <stdlib.h>
typedef void (*MagickFreeFunc)(void *ptr);
static MagickFreeFunc FreeFunc = free;
void MagickFree(void *memory) {
if (memory != (void *)NULL)
(FreeFunc)(memory);
}
#define MagickFreeMemory(memory) \
{ \
printf("&memory: %p\n", &memory); \
printf(" memory: %p\n", memory); \
void *_magick_mp = memory; \
MagickFree(_magick_mp); \
memory = 0; \
}
void destroy_image(char *image) { MagickFreeMemory(image); }
int main() {
char *image = (char *)malloc(1024);
printf("&image : %p\n", &image);
printf(" image : %p\n", image);
MagickFreeMemory(image);
printf("&image : %p\n", &image);
printf(" image : %p\n", image);
return 0;
}
% ./a.out
&image : 0x7ffd7247ae08
image : 0x5572b59cf2a0
&memory: 0x7ffd7247ae08
memory: 0x5572b59cf2a0
&image : 0x7ffd7247ae08
image : (nil)
所以要想解决这个core dumped
的问题,就老老实实地按照GM
的代码风格,调用完DestroyImage()
后再紧接着赋一次0
。
今天安装了最新的intel-compute-runtime
,看来已经修复了core dumped
问题,见:“又一个闪退问题”中提到的问题。
% sudo pacman -Ss intel-compute-runtime
[sudo] password for ysouyno:
community/intel-compute-runtime 22.12.22749-1 [installed]
Intel(R) Graphics Compute Runtime for oneAPI Level Zero and OpenCL(TM) Driver
试运行了一下我的最新代码,发现有opencl
编译错误:
error: use of type 'double' requires cl_khr_fp64 support
先只是简单的将double
换成float
来解决这个问题。
分析IM
的accelerate.c:resizeHorizontalFilter()
的源代码发现它的scale
变量计算后只停留在此函数内,并没有往下传递进kernel
函数,关于scale
的计算代码是不是多余的?从目前我理解到的IM
的逻辑来看,我认为它是多余的。因为向下传递给kernel
函数的是resizeFilterScale
变量,这个变量的值不依赖scale
变量,而是通过传参获取现有的结构体中的值,且它进入kernel
函数ResizeHorizontalFilter()
后通过调用getResizeFilterWeight()
函数再以filterType
获得计算函数来进一步计算scale
值,进而最终返回weight
值。
另外发现在kernel
函数ResizeHorizontalFilter()
的开始部分scale
又被计算了一次,因此我觉得可以确认accelerate.c:resizeHorizontalFilter()
中的scale
变量是多余的。
我在GM
中应该怎么处理呢?考虑到GPU
并行运行的影响,scale
的值不依赖各个work-group
或work-item
。因此我认为将scale
赋值给resizeFilterScale
传进kernel
函数不会影响计算结果,那这样的话kernel
函数中的scale
计算就显得有点多余了。
备注:代码写着写着,发现个严重问题,OpenCL
不支持函数指针,那怎么把过滤函数传进kernel
函数呢?
因为OpenCL
不支持传递函数指针,所以增加了过滤函数的类型参数进行传参,涉及了一系列函数调用的参数修改。
在resizeHorizontalFilter()
内部计算好scale
的值,采用GM
的计算方法,虽然它和IM
的计算方法差不多。将kernel
函数中的scale
计算代码移除,同时核函数ResizeHorizontalFilter()
的support
也通过参数传入,它和scale
一样,计算放在了resizeHorizontalFilter()
中,另发现核函数ResizeHorizontalFilter()
中的resizeFilterBlur
变量已经不再使用。
所有修改见此次commit
的上个commit
,修改代码比较多,但愿没引出新的问题。
不太好给AcquireCriticalMemory()
添加异常处理,GM
中定义好的内存分配失败的异常就那么几个,查找所有调用AcquireCriticalMemory()
的地方,发现有给StringInfo
,有给ImageInfo
,还有给MagickCLCacheInfo
等等分配内存的,在每个调用AcquireCriticalMemory()
的地方抛出异常是可行的,可以使用GM
中已定义好的异常类型,比如StringInfo
可以用UnableToAllocateString
来代替,ImageInfo
可以用UnableToAllocateImage
,MagickCLCacheInfo
可能需要增加一个异常类型;或者在AcquireCriticalMemory()
函数内部处理异常,这正是IM
的处理方式,但是这样的话在AcquireCriticalMemory()
内部不能明确表达出是哪种类型操作产生的异常。当然可以通过增加参数来解决,但是处理起来同样很麻烦。
目前我修改的函数是这样的,固定了它的类型为UnableToAllocateModuleInfo
,没有把此修改放到源代码里,目前仅存在笔记里:
MagickExport void *AcquireCriticalMemory(const size_t len)
{
void
*memory;
// Fail if memory request cannot be fulfilled.
memory=MagickMalloc(len);
if (memory == (void *) NULL)
MagickFatalError3(ResourceLimitFatalError,MemoryAllocationFailed,
UnableToAllocateModuleInfo);
return(memory);
}
了解了一下GM
的异常处理,可以这么来用:
MagickExport void *AcquireCriticalMemory(const size_t len)
{
void
*memory;
ExceptionInfo
exception;
GetExceptionInfo(&exception);
// Fail if memory request cannot be fulfilled.
memory=MagickMalloc(len);
if (memory == (void *) NULL)
ThrowException(&exception,ResourceLimitError,MemoryAllocationFailed,
"AcquireCriticalMemory");
return(memory);
}
但是发现如果memory
为空时ThrowException()
并不能结束掉程序,它最终调用的是ThrowLoggedException()
函数将其记录下来。也可以这么使用:
MagickExport void *AcquireCriticalMemory(const size_t len)
{
void
*memory;
// Fail if memory request cannot be fulfilled.
memory=MagickMalloc(len);
if (memory == (void *) NULL)
MagickFatalError(ResourceLimitFatalError,MemoryAllocationFailed,
"ocl: AcquireCriticalMemory");
return(memory);
}
这是我当前使用的方案,当指针为0
时它结束掉程序,打印出如下的信息:
[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
gm display: Memory allocation failed (ocl: AcquireCriticalMemory) [Resource temporarily unavailable].
如果能把输出的信息弄得再详细点儿就更好了。
值得注意的是在GM
中有MagickFatalError()
,MagickFatalError2()
和MagickFatalError3()
三个功能相似的函数,MagickFatalError()
可以使用字符串做为参数,MagickFatalError2()
也可以使用字符串做为参数,但是它与MagickFatalError()
的具体应用场景还不太了解,MagickFatalError3()
只能使用预定义的异常类型。
我正在处理所有标注了TODO(ocl)
的代码,在DestroyMagickCLCacheInfoAndPixels()
函数里的代码:
// RelinquishMagickResource(MemoryResource,info->length); // TODO(ocl)
DestroyMagickCLCacheInfo(info);
// (void) RelinquishAlignedMemory(pixels); // TODO(ocl)
我这样处理之后:
// RelinquishMagickResource(MemoryResource,info->length);
LiberateMagickResource(MemoryResource,info->length);
DestroyMagickCLCacheInfo(info);
// (void) RelinquishAlignedMemory(pixels);
MagickFreeAlignedMemory(pixels);
程序闪退,打印的信息如下:
$ gm display ~/temp/1.png
gm display: abort due to signal 11 (SIGSEGV) "Segmentation Fault"...
Aborted (core dumped)
确认起因是因为使用了LiberateMagickResource(MemoryResource,info->length);
这行代码。经过调试发现在GetAuthenticOpenCLBuffer()
函数返回NULL
后程序闪退。具体代码是:
if ((cache_info->type != MemoryCache)/* || (cache_info->mapped != MagickFalse) */)
return((cl_mem) NULL);
我这样修改好像不闪退了:
if ((cache_info->type != MemoryCache) || (cache_info->type != MapCache))
return((cl_mem) NULL);
目前尚未理解LiberateMagickResource(MemoryResource,info->length);
的用意,及像上面这样修改会不会引发什么新的问题。
注:必须清除.cache/ImageMagick
里的所有文件才能出现闪退问题,即在跑opencl
的benchmark
时会出现。
理解了一下GM
的AcquireMagickResource()
,LiberateMagickResource()
函数,它们实际上起到监视的作用,没有分配和释放资源的功能,包括InitializeMagickResources()
函数,初始化内存分配的上限,磁盘上限等等,可以通过比如MAGICK_LIMIT_MEMORY
,MAGICK_LIMIT_DISK
等环境变量来设置。
调试发现当在DestroyMagickCLCacheInfoAndPixels()
函数中使用LiberateMagickResource()
后:
// LiberateMagickResource()
case SummationLimit:
{
/*
Limit depends on sum of previous allocations as well as
the currently requested size.
*/
LockSemaphoreInfo(info->semaphore);
info->value-=size;
value=info->value;
UnlockSemaphoreInfo(info->semaphore);
break;
}
中的info->value-=size;
可能会变成负值,这样的话,再次调用AcquireMagickResource()
时可能返回失败,即:
// AcquireMagickResource()
case SummationLimit:
{
/*
Limit depends on sum of previous allocations as well as
the currently requested size.
*/
LockSemaphoreInfo(info->semaphore);
value=info->value+size;
if ((info->maximum != ResourceInfinity) &&
(value > (magick_uint64_t) info->maximum))
{
value=info->value;
status=MagickFail;
}
else
{
info->value=value;
}
UnlockSemaphoreInfo(info->semaphore);
break;
}
这里的if
分支,这样的话,需要找到为什么LiberateMagickResource()
会将info->value
的值搞成负数?
info->value
的值之所以为负数,原因其实很简单,不是AcquireMagickResource()
调少了,就是LiberateMagickResource()
调多了。
最终还是解决了这个问题,这是搬代码过程中自己给自己挖的一个坑,原IM
中的代码是:
// IM's RelinquishPixelCachePixels()
#if defined(MAGICKCORE_OPENCL_SUPPORT)
if (cache_info->opencl != (MagickCLCacheInfo) NULL)
{
cache_info->opencl=RelinquishMagickCLCacheInfo(cache_info->opencl,
MagickTrue);
cache_info->pixels=(Quantum *) NULL;
break;
}
#endif
这里的break;
是关键。
仅运行一次缩放图片的话gm-ocl
的速度远小于gm
,而迭代100
次的话,gm-ocl
速度高于gm
,见:
启用了硬件加速:
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 6.35s user 4.997407s total 20.010 iter/s 15.748 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 5.99s user 4.873903s total 20.517 iter/s 16.694 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.35s user 0.830804s total 1.204 iter/s 2.857 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.30s user 0.136360s total 7.334 iter/s 3.333 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.29s user 0.814550s total 1.228 iter/s 3.448 iter/cpu
[ysouyno@arch gm-ocl]$ echo $MAGICK_OCL_DEVICE
true
[ysouyno@arch gm-ocl]$
没有启用硬件加速:
[ysouyno@arch ~]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 40.57s user 5.829435s total 17.154 iter/s 2.465 iter/cpu
[ysouyno@arch ~]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 42.74s user 6.115149s total 16.353 iter/s 2.340 iter/cpu
[ysouyno@arch ~]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.31s user 0.057625s total 17.354 iter/s 3.226 iter/cpu
[ysouyno@arch ~]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.32s user 0.057751s total 17.316 iter/s 3.125 iter/cpu
[ysouyno@arch ~]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.31s user 0.057476s total 17.399 iter/s 3.226 iter/cpu
[ysouyno@arch ~]$ echo $MAGICK_OCL_DEVICE
[ysouyno@arch ~]$
分析:在启用了硬件加速后,gm-ocl
每次都将加载~/.cache/ImageMagick/
中的镜像,读取磁盘文件属于慢操作;而gm
则没有这种加载时间的影响。当迭代100
次时gm-ocl
的加载时间比重就缩小了。
如果改成1000
次的话,似乎gm-ocl
的优势更加明显。
启用了硬件加速:
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 6.02s user 4.814306s total 20.771 iter/s 16.611 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 1000 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1000 iter 59.54s user 43.377261s total 23.054 iter/s 16.795 iter/cpu
[ysouyno@arch gm-ocl]$ echo $MAGICK_OCL_DEVICE
true
[ysouyno@arch gm-ocl]$
没有启用硬件加速:
[ysouyno@arch ~]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 41.49s user 5.985783s total 16.706 iter/s 2.410 iter/cpu
[ysouyno@arch ~]$ gm benchmark -iterations 1000 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1000 iter 536.90s user 77.881720s total 12.840 iter/s 1.863 iter/cpu
[ysouyno@arch ~]$ echo $MAGICK_OCL_DEVICE
[ysouyno@arch ~]$
刚刚我为GM
增加了AccelerateEvent
,因为我在opencl.c:CompileOpenCLKernel()
中没有找到合适的方法替代IM
的ThrowMagickException()
,目前用日志代替异常(其实这个异常本身也没有结束程序的功能,况且如果这里发生错误的话也没必要结束程序):
status=openCL_library->clBuildProgram(device->program,1,&device->deviceID,
options,NULL,NULL);
if (status != CL_SUCCESS)
{
// (void) ThrowMagickException(exception,GetMagickModule(),DelegateWarning,
// "clBuildProgram failed.","(%d)",(int)status);
(void) LogMagickEvent(AccelerateEvent,GetMagickModule(), // TODO(ocl)
"clBuildProgram failed: %d",(int)status);
LogOpenCLBuildFailure(device,kernel,exception);
return(MagickFalse);
}
如果程序走进这个if
分支中,说明使用硬件加速失败,期望的行为是走原来流程,这样的话就不会因为初始化硬件加速而耽误太长时间。而实际上程序确实花费了一定的时间且走了原有流程。原IM
中也是如此,我觉得这是一个问题。
从IM
中拷贝过来的open_utf8()
,fopen_utf8()
,stat_utf8()
及remove_utf8()
函数直接用非_utf8
的函数代替。IM
在windows
下使用的是宽字符,所以有那样的处理。
之前将lt_dlclose()
函数改成了dlclose()
函数,真是多此一举。因为在windows
下lt_dlclose()
是一个宏,它最终调用FreeLibrary()
。
在linux
下使用lt_dlclose()
需要添加-lltdl
链接选项,发现在IM
中只要使用了--enable-opencl
后运行./configure
就自动添加上了-lltdl
,我想在GM
中也要实现它。
从早上到现在我一直在尝试--enable-opencl
时自动添加上-lltdl
链接选项。我参考了IM
中的configure.ac
中的实现,修改后GM
的configure.ac
的片断:
#
# Optionally check for libltdl if using it is still enabled
#
# Only use/depend on libtdl if we are building modules. This is a
# change from previous releases (prior to 1.3.17) which supported
# loaded modules via libtdl if shared libraries were built. of
# whether modules are built or not.
have_ltdl='no'
LIB_LTDL=''
if test "$build_modules" != 'no' || test "X$no_cl" != 'Xyes'
then
AC_MSG_CHECKING([for libltdl ])
AC_MSG_RESULT()
failed=0
passed=0
AC_CHECK_HEADER([ltdl.h],[passed=`expr $passed + 1`],[failed=`expr $failed + 1`])
AC_CHECK_LIB([ltdl],[lt_dlinit],[passed=`expr $passed + 1`],[failed=`expr $failed + 1`],)
AC_MSG_CHECKING([if libltdl package is complete])
if test $passed -gt 0
then
if test $failed -gt 0
then
AC_MSG_RESULT([no -- some components failed test])
have_ltdl='no (failed tests)'
else
LIB_LTDL='-lltdl'
LIBS="$LIB_LTDL $LIBS"
AC_DEFINE(HasLTDL,1,[Define if using libltdl to support dynamically loadable modules])
AC_MSG_RESULT([yes])
have_ltdl='yes'
fi
else
AC_MSG_RESULT([no])
fi
if test "$have_ltdl" != 'yes'
then
AC_MSG_FAILURE([libltdl is required by modules and OpenCL builds],[1])
fi
fi
AM_CONDITIONAL(WITH_LTDL, test "$have_ltdl" != 'no')
然后在设置MAGICK_DEP_LIBS
值的if
和else
分支中保证都含有$LIB_LTDL
,同时注意no_cl
的变量位置问题,否则上面代码段的no_cl
值为空,导致上面代码段中的if
分支始终能进入。
虽然经过这样的处理可以实现当使用--enable-opencl
时自动加上-lltdl
链接选项,但是引出了一个新的问题,当运行gm
时:
[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
gm display: No decode delegate for this image format (/home/ysouyno/temp/bg1a.jpg).
gm display: Unable to open file (Untitled) [No such file or directory].
[ysouyno@arch gm-ocl]$
经过调查发现,这是由于HasLTDL
宏被启用的缘故。
这里发现另外一个问题,在IM
中也存在这个问题。
虚拟机环境中,有存在cl.h
头文件,但没有libOpenCL.so
的情况,这种情况下安装各种intel
或者mesa
的runtime
均不能配置成功可运行的opencl
的环境(可以从clinfo
的运行结果来看),这样的话,按“关于-lltdl
链接选项(一)”的修改使用--enable-opencl
选项编译GM
和IM
的话,均编译失败:
undefined reference to `lt_dlclose'
因为如果有cl.h
头文件的话,那么HAVE_CL_CL_H
宏将启用,则HAVE_OPENCL
宏也被启用,这样的话lt_dlclose()
就可见了,但是没有opencl
的链接环境,导致no_cl
变量为yes
,则-lltdl
被忽略,从而链接失败,出现上述问题。
#if defined(HAVE_CL_CL_H)
# include <CL/cl.h>
# define HAVE_OPENCL 1
#endif
#if defined(HAVE_OPENCL_CL_H)
# include <OpenCL/cl.h>
# define HAVE_OPENCL 1
#endif
阅读了一下GM
的configure.ac
中关于build_modules
的代码,了解到要在原生的GM
中启用-lltdl
,需要使用如下命令:
$ ./configure --enable-shared --with-modules
这样在lib/GraphicsMagick-1.3.35/module-Q8/coders
目录中生成大量的.la
文件。
我在想,我的要求只是简单的在--enable-opencl
时添加一个链接选项,有必要大动干戈的修改原GM
的libltdl
的编译逻辑吗?我可以在configure.ac
中额外处理no_cl
,而不去启用HasLTDL
宏?这样处理好不好?
赶紧结束吧,这个链接选项不能搞两天呀!既然IM
也有同样的问题,那么就不考虑上面所说的,存在cl.h
头文件,但没有libOpenCL.so
的情况,造成链接失败。比如出现如下提示:
undefined reference to symbol 'dlsym@@GLIBC_2.2.5'
其它的测试看起来一切正常。
另因为lt_dlclose()
也适用于windows
平台,因此得尽快支持该平台,此平台还有好多开发,宏调整等等,得尽快完善起来。
新创建了一个临时分支gm-1.3.35_ocl_win
。
- 为
configure.exe
增加“Enable OpenCL”多选框 - 从“VisualMagick”拷贝
OpenCL/CL
头文件。 vs
报错:error C2004: expected 'defined(id)'
,因为它不支持这样的语法:
#if defined(/*MAGICKCORE_OPENMP_SUPPORT*/HAVE_OPENMP)
- 一些函数的
MagickExport
得去掉,因为重定义。 MAGICKCORE_WINDOWS_SUPPORT
替换为MSWINDOWS
。
问题出在CacheOpenCLKernel()
函数中:
static void CacheOpenCLKernel(MagickCLDevice device,char *filename,
ExceptionInfo *exception)
{
cl_uint
status;
size_t
binaryProgramSize;
unsigned char
*binaryProgram;
status=openCL_library->clGetProgramInfo(device->program,
CL_PROGRAM_BINARY_SIZES,sizeof(size_t),&binaryProgramSize,NULL);
if (status != CL_SUCCESS)
return;
binaryProgram=(unsigned char*) AcquireQuantumMemory(1,binaryProgramSize);
if (binaryProgram == (unsigned char *) NULL)
{
(void) ThrowException(exception,
ResourceLimitError,MemoryAllocationFailed,"CacheOpenCLKernel");
return;
}
status=openCL_library->clGetProgramInfo(device->program,
CL_PROGRAM_BINARIES,sizeof(unsigned char*),&binaryProgram,NULL);
if (status == CL_SUCCESS)
{
(void) LogMagickEvent(AccelerateEvent,GetMagickModule(),
"Creating cache file: \"%s\"",filename);
(void) BlobToFile(filename,binaryProgram,binaryProgramSize,exception);
}
binaryProgram=(unsigned char *) RelinquishMagickMemory(binaryProgram);
}
因为clGetProgramInfo()
的返回值是-30
对应CL_INVALID_VALUE
错误,修改之后可以临时解决:
static void CacheOpenCLKernel(MagickCLDevice device, char* filename,
ExceptionInfo* exception)
{
cl_uint
status;
size_t
* binaryProgramSize,
num_binaries;
unsigned char
** binaryProgram;
status = openCL_library->clGetProgramInfo(device->program, CL_PROGRAM_BINARY_SIZES,
0, 0, &num_binaries);
if (status != CL_SUCCESS)
return;
num_binaries = num_binaries / sizeof(size_t);
binaryProgramSize = (size_t*)malloc(num_binaries * sizeof(size_t));
binaryProgram = (const unsigned char**)calloc(num_binaries, sizeof(unsigned char*));
status = openCL_library->clGetProgramInfo(device->program,
CL_PROGRAM_BINARY_SIZES, num_binaries * sizeof(size_t), binaryProgramSize, NULL);
LogMagickEvent(AccelerateEvent, GetMagickModule(), "clGetProgramInfo return: %d", status);
if (status != CL_SUCCESS)
return;
for (int i = 0; i < num_binaries; ++i) {
if (binaryProgramSize[i]) {
binaryProgram[i] = (unsigned char*)AcquireQuantumMemory(1, binaryProgramSize[i]);
if (binaryProgram[i] == (unsigned char*)NULL)
{
(void)ThrowException(exception,
ResourceLimitError, MemoryAllocationFailed, "CacheOpenCLKernel");
return;
}
status = openCL_library->clGetProgramInfo(device->program,
CL_PROGRAM_BINARIES, num_binaries * sizeof(unsigned char*), binaryProgram, NULL);
if (status == CL_SUCCESS)
{
(void)LogMagickEvent(AccelerateEvent, GetMagickModule(),
"Creating cache file: \"%s\"", filename);
(void)BlobToFile(filename, binaryProgram[i], binaryProgramSize[i], exception);
}
binaryProgram = (unsigned char*)RelinquishMagickMemory(binaryProgram);
}
}
}
在windows
平台下这个问题出现了,之前相同代码在linux
上运行的很好。
经调查发现是因为在DestroyMagickCLCacheInfoAndPixels()
函数中使用的是:
MagickFreeAlignedMemory(pixels);
来清除在pixel_cache.c:OpenCache()
中由:
MagickReallocMemory(PixelPacket *,cache_info->pixels,(size_t) offset);
申请的内存,造成不匹配。
因此DestroyMagickCLCacheInfoAndPixels()
中改为MagickFreeMemory()
来释放内存。这里的做法不同于IM
,在GM
中未使用对齐的内存,就像pixel_cache.c:DestroyCacheInfo()
那样释放内存一样,才不会出问题。
/*
Release Cache Pixel Resources
*/
if (MemoryCache == cache_info->type)
{
#if defined(HAVE_OPENCL)
if (cache_info->opencl != (MagickCLCacheInfo) NULL)
{
cache_info->opencl=RelinquishMagickCLCacheInfo(cache_info->opencl,
MagickTrue);
cache_info->pixels=(Quantum *) NULL;
}
#else
MagickFreeMemory(cache_info->pixels);
LiberateMagickResource(MemoryResource,cache_info->length);
#endif
}
OpenCL
中不使用对齐内存对性能影响很大,这可以作为一个性能优化点。
此外MAGICKCORE_HAVE__ALIGNED_MALLOC
宏应该被替换为HAVE__ALIGNED_MALLOC
。
调试状态下关闭程序可以看到异常出在:
static MagickCLDevice RelinquishMagickCLDevice(MagickCLDevice device)
{
if (device == (MagickCLDevice) NULL)
return((MagickCLDevice) NULL);
device->platform_name=RelinquishMagickMemory(device->platform_name);
device->vendor_name=RelinquishMagickMemory(device->vendor_name);
device->name=RelinquishMagickMemory(device->name);
device->version=RelinquishMagickMemory(device->version);
if (device->program != (cl_program) NULL)
(void) openCL_library->clReleaseProgram(device->program);
while (device->command_queues_index >= 0)
(void) openCL_library->clReleaseCommandQueue(
device->command_queues[device->command_queues_index--]);
RelinquishSemaphoreInfo(&device->lock);
return((MagickCLDevice) RelinquishMagickMemory(device));
}
这里:
if (device->program != (cl_program) NULL)
(void) openCL_library->clReleaseProgram(device->program);
问题很诡异,有时一直出现,有时一直正常。
R6025
- pure virtual function call
Unhandled exception at 0x05E88292 (intelocl32.dll) in IMDisplay.exe: Fatal program exit requested.
Type | IM | GM |
---|---|---|
FUNC | SyncImagePixelCache | SyncImagePixelsEx |
FUNC | OpenPixelCache | OpenCache |
MACRO | MAGICKCORE_QUANTUM_DEPTH | QuantumDepth |
FUNC | GetImagePixelCache | ModifyCache |
FUNC | GetVirtualPixelCacheNexus | AcquireCacheNexus |
FUNC | PersistPixelCache | PersistCache |
MACRO | MAGICKCORE_OPENCL_SUPPORT | HAVE_OPENCL |
MACRO | MAGICKCORE_HAVE__ALIGNED_MALLOC | HAVE__ALIGNED_MALLOC |
MACRO | MAGICKCORE_WINDOWS_SUPPORT | MSWINDOWS |
我在调查R6025
的问题,调试发现LoadOpenCLDevices()
函数中:
number_platforms=0;
if (openCL_library->clGetPlatformIDs(0,NULL,&number_platforms) != CL_SUCCESS)
return;
number_platforms
的返回值在IM
中为1
,而在GM
中却为2
,同样的代码在同一台电脑上为什么返回的值不同呢?排除了编译选项的问题,查看了clGetPlatformIDs()
的函数说明都没有找到问题,最后发现是IM
是vs2017
编译环境,而GM
是vs2022
编译环境,将它们都改用vs2017
环境后调试状态下number_platforms
的值都为1
了。
但是如果number_platforms
为1
的话,在本机电脑上实际上硬件加速是失败的,可是直接运行的话确实是走的硬件加速流程,难道代码中clGetPlatformIDs()
在调试状态和非调试状态下的值不一样?添加日志输出后发现确实是这样,那么总结一下clGetPlatformIDs()
的怪现象:
vs2017
,调试,number_platforms
值为1
。vs2022
,调试,number_platforms
值为2
。vs2017
,运行,number_platforms
值为2
。
运气好,上午修复了这个问题。
过程中也经历了尝试vs2010
编译,尝试更换电脑环境等,均没有找到原因,同时我也在怀疑会不会是IMDisplay.exe
这个外部的测试程序问题引起的?毕竟IM
和GM
的这个IMDisplay.exe
程序相差也是很大的。
同时了解调用OpenCLTerminus()
的所有地方,在nt_base.c:DllMain()
的DLL_PROCESS_DETACH
里面调用DestroyMagick();
是不是不好?
case DLL_PROCESS_DETACH:
DestroyMagick();
break;
另IMDisplay.cpp
中只有InitInstance()
,却没有ExitInstance()
,是不是应该显示调用MagickLib::DestroyMagick();
比交给系统处理DLL_PROCESS_DETACH
更好呢?
值得注意的是如果将IM
的如下代码修改成这样:
BOOL CIMDisplayApp::ExitInstance()
{
// Magick::TerminateMagick();
return CWinApp::ExitInstance();
}
同时保证IM
的magick-baseconfig.h
中的ProvideDllMain
宏启用:
/*
When building ImageMagick using DLLs, include a DllMain()
function which automatically invokes MagickCoreGenesis(NULL), and
MagickCoreTerminus() so that the user doesn't need to. This is disabled
by default.
*/
#define ProvideDllMain
则IM
也同样会出现R6025
的错误。
分析ScaleImage()
函数及考虑参数如何传递:
- 函数中最大的循环是按
Y
垂直方向的,这样每次内循环以X
水平方向进行 - 最大的循环中以两个
if-else
分支为主,分别比较目标宽高是否等于原始宽高,所以可以考虑以两个无符号整形代表(因为kernel
函数不支持布尔类型,1
表示相等,0
表示不等)或者也可以直接传入原始宽高和目标高宽,在kernel
函数中比较是否相等 - 选择传入目标宽高和原始宽高比较好,这样
x_scale
和y_scale
也可以在kernel
函数中计算 - 该函数中有申请可变长的动态数组,按照原始图片的宽度为长度申请堆内存。参考
ResizeImage()
和AccelerateResizeImage()
函数,它们没有用到可变长动态数组,在kernel
函数中申请动态数组明显不合适,因为opencl
不支持动态数组 - 继续参考
ResizeImage()
和AccelerateResizeImage()
函数,用到了三个图片内存传入kernel
函数,分别是imageBuffer
,tempImageBuffer
和filteredImageBuffer
,原以为参数传递错了,为什么在resizeHorizontalFilter()
和resizeVerticalFilter()
函数中只有一个函数用到了filteredImageBuffer
变量? - 仔细分析
resizeHorizontalFilter()
和resizeVerticalFilter()
函数发现它们的第五个参数是输入参数,第九个参数是输出参数,tempImageBuffer
是临时内存,做为第一个函数的输出参数和第二个函数的输入参数,代码如下:
outputReady=resizeHorizontalFilter(device,queue,image,filteredImage,
imageBuffer,matte_or_cmyk,(cl_uint) image->columns,
(cl_uint) image->rows,tempImageBuffer,(cl_uint) resizedColumns,
(cl_uint) image->rows,filter_type,filter_info,blur,
cubicCoefficientsBuffer,xFactor,exception);
if (outputReady == MagickFalse)
goto cleanup;
outputReady=resizeVerticalFilter(device,queue,image,filteredImage,
tempImageBuffer,matte_or_cmyk,(cl_uint) resizedColumns,
(cl_uint) image->rows,filteredImageBuffer,(cl_uint) resizedColumns,
(cl_uint) resizedRows,filter_type,filter_info,blur,
cubicCoefficientsBuffer,yFactor,exception);
if (outputReady == MagickFalse)
goto cleanup;
- 那似乎
ScaleImage()
里用不到这个临时内存,因为函数结构有区别 - 以原图宽高为长度的动态数组是不是可以在调用
kernel
函数之前申请好,用于临时内存?
昨天搞了一天也没有搞出来kernel
函数怎么写,还得仔细分析一下ScaleImage()
函数流程:
- 从
GM
的ScaleImage()
入手,它比IM
好懂 - 大循环的第一个
if-else
分支处理Y
方向,即垂直方向,它用到两个动态数组x_vector
和y_vector
,它们的长度相等,都是原图的宽度,y_scale
小于1
是缩小,y_scale
大于1
是放大,不管是放大还是缩小,都会先读一行原图像素放到x_vector
中 - 如果是缩小的话,各像素乘以
y_scale
的结果存放到y_vector
中,可能会继续读取下一行原图进行累积计算 - 如果是放大的话,会将
y_vector+y_span*x_vector
的结果放到一个临时变量pixel
中,之所以要放到pixel
中是因为要处理计算结果大于255.0
的情况,且可能y_vector
在这里首次被使用,所以它申请内存时必须初始始化为0
,所以它用的是MagickAllocateClearedArray()
函数
y_vector=MagickAllocateClearedArray(DoublePixelPacket *,
image->columns,sizeof(DoublePixelPacket));
pixel
的结果是存到s
中,而s=scanline;
,且scanline=x_vector;
,所以到这里x_vector
存放的是Y
方向的处理结果- 然后这里到第二个
if-else
分支,即处理X
方向,代码同第一个if-else
分支大同小异,但要注意else
,它有一个稍大的循环。最终结果存在t
即scale_scanline
中 scale_scanline
是以一个以目标宽度为长度的动态数组
我尝试写的kernel
函数模仿了ScaleImage()
的很多代码,实际上不能工作,以试着重新理解opencl
的方式,理解work-group
和work-item
,仅有的收获在:
STRINGIFY(
__kernel // __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum *inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4 *outputPixelCache, __local float *densityCache, __local float *gammaCache)
{
const int x=get_global_id(0);
const int y=get_global_id(1);
const unsigned int columns=get_global_size(0);
int cy=y;
float4 pixel=ReadAllChannels(inputImage,4,columns,x,cy);
pixel/=4.5;
WriteAllChannels(filteredImage,4,filteredColumns,
x*filteredColumns/inputColumns,y*filteredRows/inputRows,pixel);
}
)
似乎生成的图片没有变形,我加了pixel/=4.5;
这行代码是为了调试方便,它的效果是使图片变暗。仅此简单的代码也能完成缩放功能(备注:缩小没问题,放大不行),但是WriteAllChannels()
的x
和y
坐标要从work-item
的视角看work-group
,目前只能以x*filteredColumns/inputColumns
和y*filteredRows/inputRows
来代替,以验证我对work-group
和work-item
的理解,从上面的代码看,我似乎理解了一些。
参照下面的IM
代码理解:
__kernel void Contrast(__global CLQuantum *image,
const unsigned int number_channels,const int sign)
{
const int x=get_global_id(0);
const int y=get_global_id(1);
const unsigned int columns=get_global_size(0);
float4 pixel=ReadAllChannels(image,number_channels,columns,x,y);
if (number_channels < 3)
pixel.y=pixel.z=pixel.x;
pixel=ConvertRGBToHSB(pixel);
float brightness=pixel.z;
brightness+=0.5f*sign*(0.5f*(sinpi(brightness-0.5f)+1.0f)-brightness);
brightness=clamp(brightness,0.0f,1.0f);
pixel.z=brightness;
pixel=ConvertHSBToRGB(pixel);
WriteAllChannels(image,number_channels,columns,x,y,pixel);
}
)
此外我觉得没有必要学AccelerateResizeImage()
函数去增加filteredImageBuffer
变量,可以学IM
的AccelerateContrastImage()
函数,在ComputeContrastImage()
中直接调用kernel
函数,这样可以少一层函数调用。
在“如何写ScaleImage()
的硬件加速函数(二)”中介绍的kernel
函数的写法可能会产生如下现象:
ScaleFilter()
不是总能被成功调用- 每次修改过
ScaleFilter()
后,有时在~.cache/GraphicsMagick
目录中不会生成新的.bin
文件 - 这种情况下,调试发现在
getOpenCLEnvironment(exception);
处就失败返回
MagickPrivate Image *AccelerateScaleImage(const Image *image,
const size_t scaledColumns,const size_t scaledRows,
ExceptionInfo *exception)
{
Image
*filteredImage;
MagickCLEnv
clEnv;
assert(image != NULL);
assert(exception != (ExceptionInfo *) NULL);
if (checkAccelerateCondition(image) == MagickFalse)
return((Image *) NULL);
clEnv=getOpenCLEnvironment(exception);
if (clEnv == (MagickCLEnv) NULL)
return((Image *) NULL);
filteredImage=ComputeScaleImage(image,clEnv,scaledColumns,scaledRows,
exception);
return(filteredImage);
}
- 重启电脑似乎不能校正这种问题,但第二天开机这个问题就没有了,难道我的
ScaleFilter()
函数让CPU
或者GPU
内部错乱了? - 没添加额外调试输出前,没有找到任何异常日志
- 忘说了一个关键问题,这两天电脑已经发现死机两次,包括今天早上这次,刚输入完密码回车后就死机
经过这两天的尝试,越来越对ScaleImage()
用硬件加速实现这件事感到怀疑,因为似乎没有发现这个函数的硬件加速版本能带来很好的性能,当然我这个OpenCL
新手写的代码连我自己也不敢恭维,这也是一方面的原因,甚至可能占比很高。
正如前面日志所说的能参考的代码只有ResizeHorizontalFilter()
和ResizeVerticalFilter()
,但是这要修改accelerate.c:scaleFilter()
函数,在调用EnqueueOpenCLKernel()
的地方要传入lsize
参数,而不是现在的NULL
,且可能要增加:
// for ResizeHorizontalFilter()
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
或者:
// for ResizeVerticalFilter()
__kernel __attribute__((reqd_work_group_size(1, 256, 1)))
否则程序将卡死在ScaleImage()
的硬件加速函数ScaleFilter()
中。
此外在“如何写ScaleImage()
的硬件加速函数(二)”中ScaleFilter()
的简单实现中,对于缩小操作可以达到效果,但放大操作的图片像个筛子,所以我这个基础上我“完善”了一下,缩放后的效果还能勉强看出原图的轮廓,虽然效果不好但至少不是筛子了,尝试的代码如下:
// 大方块
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const unsigned int columns = get_global_size(0);
float4 pixel = ReadAllChannels(inputImage, 4, columns, x, y);
int num_x = get_global_size(0) / get_local_size(0);
int num_y = get_global_size(1) / get_local_size(1);
int dst_local_w = (filteredColumns + num_x - 1) / num_x;
int dst_local_h = (filteredRows + num_y - 1) / num_y;
for (int j = 0; j < dst_local_h; ++j)
for (int i = 0; i < dst_local_w; ++i) {
// pixel_res/=4.5;
int filtered_x = x / get_local_size(0) * dst_local_w + i;
int filtered_y = y / get_local_size(1) * dst_local_h + j;
WriteAllChannels(filteredImage, 4, filteredColumns, filtered_x, filtered_y, pixel);
}
}
或者可以用它的“升级”版本:
// 比大方块好点儿
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const unsigned int columns = get_global_size(0);
int cx = x;
int cy = y;
float4 pixel = ReadAllChannels(inputImage, 4, columns, cx, cy);
int num_x = get_global_size(0) / get_local_size(0);
int num_y = get_global_size(1) / get_local_size(1);
int dst_local_w = (filteredColumns + num_x - 1) / num_x;
int dst_local_h = (filteredRows + num_y - 1) / num_y;
for (int j = 0; j < dst_local_h; ++j) {
int filtered_y = y / get_local_size(1) * dst_local_h + j;
if (fabs((float)filtered_y / filteredRows - (float)y / inputRows) < 0.1) {
cy++;
}
for (int i = 0; i < dst_local_w; ++i) {
int filtered_x = x / get_local_size(0) * dst_local_w + i;
if (fabs((float)filtered_x / filteredColumns - (float)x / inputColumns) < 0.1) {
pixel = ReadAllChannels(inputImage, 4, columns, cx++, cy);
}
WriteAllChannels(filteredImage, 4, filteredColumns, filtered_x, filtered_y, pixel);
}
}
}
在我心里,对参考ResizeHorizontalFilter()
和ResizeVerticalFilter()
来实现ScaleFilter()
有一种执着,花了许多时间,也总结了遇到的一些坑:
- 如果不修改
EnqueueOpenCLKernel()
函数的调用方式的话,可能在某个时刻电脑就死机了 - 只使用
ResizeHorizontalFilter()
和ResizeVerticalFilter()
其中一个函数的代码来改写,似乎达不到效果,最好的效果是水平方向已缩小,垂直方向只显示了原图的上半部 - 好像我是这么修改的可以达到上面第二点提到的最好效果:
event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements * 2, 0);
wait_group_events(1, &e);
for (unsigned int i = startStep; i < stopStep * 2; i++, cacheIndex++)
如果要参考其中之一的话,估计渐渐改着改着会发现我需要用到两个函数,会是ScaleHorizontalFilter()
和ScaleVerticalFilter()
,然后到最后会是一版不一样的ResizeImage()
的硬件加速版本,这样的话,意义在哪里?
所以我昨天下午又换回了自己来写,目前放大缩小操作可以实现,但效果差,会像好多小方格拼成的图片那样,且缩放速度相比原函数慢好多。另我知道的还有一个小问题,即缩小后的图片宽度比显示部分要宽,代码先贴在这里,因为有新任务要处理。
修改了accelerate.c
的scaleFilter()
函数:
static MagickBooleanType scaleFilter(MagickCLDevice device,
cl_command_queue queue,const Image *image,Image *filteredImage,
cl_mem imageBuffer,cl_uint matte_or_cmyk,cl_uint columns,cl_uint rows,
cl_mem scaledImageBuffer,cl_uint scaledColumns,cl_uint scaledRows,
ExceptionInfo *exception)
{
cl_kernel
scaleKernel;
cl_int
status;
const unsigned int
workgroupSize = 256;
float
scale=1.0;
int
numCachedPixels;
MagickBooleanType
outputReady;
size_t
gsize[2],
i,
imageCacheLocalMemorySize,
lsize[2],
totalLocalMemorySize,
x_vector,
y_vector,
y_volumes;
unsigned int
chunkSize,
pixelPerWorkgroup;
scaleKernel=NULL;
outputReady=MagickFalse;
scale=1.0/scale; // TODO(ocl)
if (scaledColumns < workgroupSize)
{
chunkSize=32;
pixelPerWorkgroup=32;
}
else
{
chunkSize=workgroupSize;
pixelPerWorkgroup=workgroupSize;
}
DisableMSCWarning(4127)
while(1)
RestoreMSCWarning
{
/* calculate the local memory size needed per workgroup */
numCachedPixels=(int) pixelPerWorkgroup;
imageCacheLocalMemorySize=numCachedPixels*sizeof(CLQuantum)*4;
totalLocalMemorySize=imageCacheLocalMemorySize;
/* local size for the pixel accumulator */
x_vector=chunkSize*sizeof(cl_float4);
totalLocalMemorySize+=x_vector;
/* local memory size for the weight accumulator */
y_vector=chunkSize*sizeof(cl_float4);
totalLocalMemorySize+=y_vector;
/* local memory size for the gamma accumulator */
y_volumes =chunkSize*sizeof(float);
totalLocalMemorySize+=y_volumes;
if (totalLocalMemorySize <= device->local_memory_size)
break;
else
{
pixelPerWorkgroup=pixelPerWorkgroup/2;
chunkSize=chunkSize/2;
if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
{
/* quit, fallback to CPU */
goto cleanup;
}
}
}
scaleKernel=AcquireOpenCLKernel(device,"ScaleFilter");
if (scaleKernel == (cl_kernel) NULL)
{
(void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
ResourceLimitWarning,"AcquireOpenCLKernel failed.", ".");
goto cleanup;
}
i=0;
status =SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&imageBuffer);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&matte_or_cmyk);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&columns);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&rows);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&scaledImageBuffer);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledColumns);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledRows);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(float),(void*)&scale);
status|=SetOpenCLKernelArg(scaleKernel,i++,imageCacheLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&numCachedPixels);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&pixelPerWorkgroup);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&chunkSize);
status|=SetOpenCLKernelArg(scaleKernel,i++,x_vector,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,y_vector,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,y_volumes,NULL);
if (status != CL_SUCCESS)
{
(void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
ResourceLimitWarning,"SetOpenCLKernelArg failed.",".");
goto cleanup;
}
gsize[0]=image->columns;
gsize[1]=image->rows;
outputReady=EnqueueOpenCLKernel(queue,scaleKernel,2,
(const size_t *) NULL,gsize,(Image *)NULL,image,filteredImage,MagickFalse,
exception);
cleanup:
if (scaleKernel != (cl_kernel) NULL)
ReleaseOpenCLKernel(scaleKernel);
return(outputReady);
}
和accelerate-kernels-private.h
的ScaleFilter()
函数:
STRINGIFY(
__kernel // __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum *inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4 *x_vector, __local float4 *y_vector, __local float *y_volumes)
{
if (get_local_size(0) > pixelChunkSize)
return;
const int x = get_global_id(0);
const int y = get_global_id(1);
const unsigned int columns = get_global_size(0);
int cx = x;
int cy = y;
for (int i = 0; i < pixelChunkSize; ++i) {
x_vector[i] = (float4)0.0;
y_vector[i] = (float4)0.0;
y_volumes[i] = 0.0;
}
float4 pixel1 = ReadAllChannels(inputImage, 4, columns, x, y);
int num_x = get_global_size(0) / get_local_size(0);
int num_y = get_global_size(1) / get_local_size(1);
int dst_local_w = (filteredColumns + num_x - 1) / num_x;
int dst_local_h = (filteredRows + num_y - 1) / num_y;
int startx = x;
int stopx = MagickMin(startx + pixelChunkSize, inputColumns);
float y_scale = (float)filteredRows / inputRows;
float y_span = 1.0;
int next_row = 1;
__local float4* s = x_vector;
float4 pixel = 0.0;
float factor = 0.0;
int next_col = 0;
float x_scale = 0.0;
float x_span = 1.0;
float x_volume = 0.0;
float4 result[256];
float4* t = result;
for (int j = 0; j < dst_local_h; ++j) {
if (filteredRows == inputRows) {
for (int i = 0; i < get_local_size(0); ++i) {
x_vector[i] = ReadAllChannels(inputImage, 4, columns, x + i, cy);
}
}
else {
while (y_scale < y_span) {
if (next_row) {
for (int i = 0; i < get_local_size(0); ++i) {
x_vector[i] = ReadAllChannels(inputImage, 4, columns, x + i, cy);
}
cy++;
}
for (int i = 0; i < get_local_size(0); ++i) {
if (x_vector[i].w < 255.0)
y_volumes[i] += y_scale;
y_vector[i] += y_scale * x_vector[i];
}
y_span -= y_scale;
y_scale = (float)filteredRows / inputRows;
next_row = 1;
}
if (next_row) {
for (int i = 0; i < get_local_size(0); ++i) {
x_vector[i] = ReadAllChannels(inputImage, 4, columns, x + i, cy);
}
cy++;
next_row = 0;
}
for (int i = 0; i < get_local_size(0); ++i) {
if (x_vector[i].w < 255.0)
y_volumes[x] += y_span;
pixel = y_vector[i] + y_span * x_vector[i];
if (y_volumes[i] > 0.0 && y_volumes[i] < 1.0) {
factor = 1 / y_volumes[i];
pixel *= factor;
}
s->x = pixel.x > 255.0 ? 255.0 : pixel.x;
s->y = pixel.y > 255.0 ? 255.0 : pixel.y;
s->z = pixel.z > 255.0 ? 255.0 : pixel.z;
s->w = pixel.w > 255.0 ? 255.0 : pixel.w;
s++;
y_vector[i] = 0.0;
y_volumes[i] = 0.0;
}
y_scale -= y_span;
if (y_scale < 0) {
y_scale = (float)filteredRows / inputRows;
next_row = 1;
}
y_span = 1.0;
}
if (filteredColumns == inputColumns) {
//
}
else {
pixel = 0.0;
s = x_vector;
for (int i = 0; i < get_local_size(0); ++i) {
x_scale = (float)filteredColumns / inputColumns;
while (x_scale >= x_span) {
if (next_col) {
if (x_volume < 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
*t *= factor;
}
x_volume = 0.0;
pixel = 0.0;
t++;
}
if (s->w < 255.0)
x_volume += x_span;
pixel += x_span * *s;
t->x = pixel.x > 255.0 ? 255.0 : pixel.x;
t->y = pixel.y > 255.0 ? 255.0 : pixel.y;
t->z = pixel.z > 255.0 ? 255.0 : pixel.z;
t->w = pixel.w > 255.0 ? 255.0 : pixel.w;
x_scale -= x_span;
x_span = 1.0;
next_col = 1;
}
if (x_scale > 0.0) {
if (next_col) {
if (x_volume > 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
*t *= factor;
}
x_volume = 0.0;
pixel = 0.0;
next_col = 0;
t++;
}
if (s->w < 255.0)
x_volume += x_scale;
pixel += x_scale * *s;
x_span -= x_scale;
}
s++;
}
if (x_span > 0.0) {
s--;
if (s->w < 255.0)
x_volume += x_scale;
pixel += x_span * *s;
}
if (!next_col && ((t - result) < filteredColumns)) {
t->x = pixel.x > 255.0 ? 255.0 : pixel.x;
t->y = pixel.y > 255.0 ? 255.0 : pixel.y;
t->z = pixel.z > 255.0 ? 255.0 : pixel.z;
t->w = pixel.w > 255.0 ? 255.0 : pixel.w;
}
t = result;
}
for (int i = 0; i < dst_local_w; ++i) {
int filtered_x = x / get_local_size(0) * dst_local_w + i;
int filtered_y = y / get_local_size(1) * dst_local_h + j;
WriteAllChannels(filteredImage, 4, filteredColumns, filtered_x, filtered_y, *(t + i));
}
}
}
)
晚上做梦都在一直想这事儿,早上花了一个多小时,小有成果。
这里是参考ResizeHorizontalFilter()
,居然把之前没有想明白的一些代码整清楚了:
accelerate.c:resizeHorizontalFilter()
中传参gsize
和lsize
的地方,是拿目标宽高进行计算的,我的脑海中却一直用原始宽高去理解
gsize[0]=(resizedColumns+pixelPerWorkgroup-1)/pixelPerWorkgroup*
workgroupSize;
gsize[1]=resizedRows;
lsize[0]=workgroupSize;
lsize[1]=1;
outputReady=EnqueueOpenCLKernel(queue,horizontalKernel,2,
(const size_t *) NULL,gsize,lsize,image,filteredImage,MagickFalse,
exception);
- 这里的用意是在
kernel
函数ResizeHorizontalFilter()
中可以方便的使用get_global_id(0)
和get_local_id(0)
,不用像我在“如何写ScaleImage()
的硬件加速函数(四)”中那样去计算dst_local_w
和filtered_x
了 - 但是似乎这里引出了另外一个问题,那原始宽高怎么用于计算呢?答案就是
get_group_id(0)
贴上目前开发一半的代码,效果是:
- 原图缩小一倍,水平方向显示原图全部,但被压缩一半;垂直方向显示原图上半部
- 原图放大一倍,垂直方向显示原图全部,但被压缩一半;水平方向显示原图全部,但被拉长一倍
附scaleFilter()
和ScaleFilter()
代码:
static MagickBooleanType scaleFilter(MagickCLDevice device,
cl_command_queue queue,const Image *image,Image *filteredImage,
cl_mem imageBuffer,cl_uint matte_or_cmyk,cl_uint columns,cl_uint rows,
cl_mem scaledImageBuffer,cl_uint scaledColumns,cl_uint scaledRows,
ExceptionInfo *exception)
{
cl_kernel
scaleKernel;
cl_int
status;
const unsigned int
workgroupSize = 256;
float
scale=1.0;
int
numCachedPixels;
MagickBooleanType
outputReady;
size_t
gammaAccumulatorLocalMemorySize,
gsize[2],
i,
imageCacheLocalMemorySize,
pixelAccumulatorLocalMemorySize,
lsize[2],
totalLocalMemorySize,
weightAccumulatorLocalMemorySize;
unsigned int
chunkSize,
pixelPerWorkgroup;
int
scale_ratio = 8; // related to the upper limit of zoom in?
scaleKernel=NULL;
outputReady=MagickFalse;
scale=1.0/scale; // TODO(ocl)
if (scaledColumns < workgroupSize)
{
chunkSize=32;
pixelPerWorkgroup=32;
}
else
{
chunkSize=workgroupSize;
pixelPerWorkgroup=workgroupSize;
}
DisableMSCWarning(4127)
while(1)
RestoreMSCWarning
{
/* calculate the local memory size needed per workgroup */
numCachedPixels=(int) pixelPerWorkgroup;
imageCacheLocalMemorySize=numCachedPixels*sizeof(CLQuantum)*4;
totalLocalMemorySize=imageCacheLocalMemorySize;
/* local size for the pixel accumulator */
pixelAccumulatorLocalMemorySize=chunkSize*sizeof(cl_float4);
totalLocalMemorySize+=pixelAccumulatorLocalMemorySize;
/* local memory size for the weight accumulator */
weightAccumulatorLocalMemorySize=chunkSize*sizeof(float);
totalLocalMemorySize+=weightAccumulatorLocalMemorySize;
/* local memory size for the gamma accumulator */
gammaAccumulatorLocalMemorySize=chunkSize*sizeof(float);
totalLocalMemorySize+=gammaAccumulatorLocalMemorySize;
if (totalLocalMemorySize <= device->local_memory_size)
break;
else
{
pixelPerWorkgroup=pixelPerWorkgroup/2;
chunkSize=chunkSize/2;
if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
{
/* quit, fallback to CPU */
goto cleanup;
}
}
}
scaleKernel=AcquireOpenCLKernel(device,"ScaleFilter");
if (scaleKernel == (cl_kernel) NULL)
{
(void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
ResourceLimitWarning,"AcquireOpenCLKernel failed.", ".");
goto cleanup;
}
i=0;
status =SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&imageBuffer);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&matte_or_cmyk);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&columns);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&rows);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&scaledImageBuffer);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledColumns);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledRows);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(float),(void*)&scale);
status|=SetOpenCLKernelArg(scaleKernel,i++,imageCacheLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&numCachedPixels);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&pixelPerWorkgroup);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&chunkSize);
status|=SetOpenCLKernelArg(scaleKernel,i++,pixelAccumulatorLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,weightAccumulatorLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,gammaAccumulatorLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&scale_ratio);
if (status != CL_SUCCESS)
{
(void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
ResourceLimitWarning,"SetOpenCLKernelArg failed.",".");
goto cleanup;
}
gsize[0] = (scaledColumns + pixelPerWorkgroup - 1) / pixelPerWorkgroup *
workgroupSize;
gsize[1] = scaledRows;
lsize[0] = workgroupSize;
lsize[1] = 1;
outputReady=EnqueueOpenCLKernel(queue,scaleKernel,2,
(const size_t *) NULL,gsize,(Image *)NULL,image,filteredImage,MagickFalse,
exception);
cleanup:
if (scaleKernel != (cl_kernel) NULL)
ReleaseOpenCLKernel(scaleKernel);
return(outputReady);
}
STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum *inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache,
const int scale_ratio)
{
// calculate the range of resized image pixels computed by this workgroup
const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
const unsigned int actualNumPixelToCompute = stopX - startX;
float xFactor = (float)filteredColumns / inputColumns;
// calculate the range of input image pixels to cache
const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);
// cache the input pixels into local memory
const unsigned int y = get_global_id(1);
const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y);
const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4 * scale_ratio;
event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
wait_group_events(1, &e);
unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
{
const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;
// determine which resized pixel computed by this workitem
const unsigned int itemID = get_local_id(0);
const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));
const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));
float4 filteredPixel = (float4)0.0f;
// -1 means this workitem doesn't participate in the computation
if (pixelIndex != -1)
{
// x coordinated of the resized pixel computed by this workitem
const int x = chunkStartX + pixelIndex;
// calculate how many steps required for this pixel
const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
const unsigned int n = stop - start;
// calculate how many steps this workitem will contribute
unsigned int numStepsPerWorkItem = n / numItems;
numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);
const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
if (startStep < n)
{
const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);
unsigned int cacheIndex = start + startStep - cacheRangeStartX;
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
float4 cp = (float4)0.0f;
__local CLQuantum* p = inputImageCache + (cacheIndex * 4);
cp.x = (float)*(p);
cp.y = (float)*(p + 1);
cp.z = (float)*(p + 2);
cp.w = (float)*(p + 3);
filteredPixel += cp;
}
}
}
if (itemID < actualNumPixelInThisChunk)
{
WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
}
}
}
)
不管什么事儿看来都怕琢磨,如果连做梦都能梦到你正在琢磨的事儿,估计离成功也就不远了。似乎目前已经达到了最好的效果,离目标越来越近了。
- 要理解
clEnqueueNDRangeKernel()
函数的第五第六个参数意义,但目前为止只能说暂时理解了
cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
- 第五第六个参数要结合
__attribute__
,否则无法调用kernel
函数
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
- 来回理解
ResizeImage()
和ScaleImage()
函数的实现,对它们的内部逻辑流程了如指掌了可以说 - 我选择以
ResizeHorizontalFilter()
为模板修改,之所以ResizeImage()
处理速度慢,因为它的处理效果好,且有多种过滤效果可供选择,ResizeHorizontalFilter()
是处理水平方向缩放,所以它将读入一整行原图像素,这正好和ScaleImage()
的最内层循环处理方式相同 ResizeHorizontalFilter()
的最内层循环(如下),因为有累加操作,所以这正是处理水平缩放的操作
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
/* float weight = getResizeFilterWeight(resizeFilterCubicCoefficients,
(ResizeWeightingFunctionType) resizeFilterType,
(ResizeWeightingFunctionType) resizeWindowType,
resizeFilterScale, resizeFilterWindowSupport,
resizeFilterBlur, scale*(start + i - bisect + 0.5)); */
float weight = getResizeFilterWeightForGM(resizeFilterType,
scale*(start + i - bisect + 0.5), support);
float4 cp = (float4)0.0f;
__local CLQuantum *p = inputImageCache + (cacheIndex*4);
cp.x = (float) *(p);
cp.y = (float) *(p + 1);
cp.z = (float) *(p + 2);
if (matte_or_cmyk != 0)
{
cp.w = (float) *(p + 3);
// float alpha = weight * QuantumScale * cp.w;
// error: use of type 'double' requires cl_khr_fp64 support
float alpha = weight * (1 - (float) cp.w / 255);
filteredPixel.x += alpha * cp.x;
filteredPixel.y += alpha * cp.y;
filteredPixel.z += alpha * cp.z;
filteredPixel.w += weight * cp.w;
gamma += alpha;
}
else
filteredPixel += ((float4) weight)*cp;
density += weight;
}
- “如何写
ScaleImage()
的硬件加速函数(五)”的问题在于没有办法处理图片下半部分(如何缩小一半的话),这里主要是因为y
变量的限定(代码如下),因为传入kernel
函数的gsize[1]=resizedRows;
被限定的死死的
const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y);
- 目前只考虑按比例缩放,所以这里的
y
需要除以缩放比 - 缩小后图片如果垂直方向相间着黑色宽竖条,那可能是因为
numCachedPixels
参数没有计算正确,这正可以修复“如何写ScaleImage()
的硬件加速函数(五)”中的scale_ratio
变量。
numCachedPixels=(int) ceil((pixelPerWorkgroup-1)/xFactor+2*support);
- 附上目前代码:
static MagickBooleanType scaleFilter(MagickCLDevice device,
cl_command_queue queue,const Image *image,Image *filteredImage,
cl_mem imageBuffer,cl_uint matte_or_cmyk,cl_uint columns,cl_uint rows,
cl_mem scaledImageBuffer,cl_uint scaledColumns,cl_uint scaledRows,
ExceptionInfo *exception)
{
cl_kernel
scaleKernel;
cl_int
status;
const unsigned int
workgroupSize = 256;
float
scale;
int
numCachedPixels;
MagickBooleanType
outputReady;
size_t
gammaAccumulatorLocalMemorySize,
gsize[2],
i,
imageCacheLocalMemorySize,
pixelAccumulatorLocalMemorySize,
lsize[2],
totalLocalMemorySize,
weightAccumulatorLocalMemorySize;
unsigned int
chunkSize,
pixelPerWorkgroup;
scaleKernel=NULL;
outputReady=MagickFalse;
scale=(float) scaledColumns/columns; // TODO(ocl)
if (scaledColumns < workgroupSize)
{
chunkSize=32;
pixelPerWorkgroup=32;
}
else
{
chunkSize=workgroupSize;
pixelPerWorkgroup=workgroupSize;
}
DisableMSCWarning(4127)
while(1)
RestoreMSCWarning
{
/* calculate the local memory size needed per workgroup */
numCachedPixels=(int) ceil((pixelPerWorkgroup-1)/scale+2*(0.5+MagickEpsilon)); // TODO(ocl)
imageCacheLocalMemorySize=numCachedPixels*sizeof(CLQuantum)*4;
totalLocalMemorySize=imageCacheLocalMemorySize;
/* local size for the pixel accumulator */
pixelAccumulatorLocalMemorySize=chunkSize*sizeof(cl_float4);
totalLocalMemorySize+=pixelAccumulatorLocalMemorySize;
/* local memory size for the weight accumulator */
weightAccumulatorLocalMemorySize=chunkSize*sizeof(float);
totalLocalMemorySize+=weightAccumulatorLocalMemorySize;
/* local memory size for the gamma accumulator */
gammaAccumulatorLocalMemorySize=chunkSize*sizeof(float);
totalLocalMemorySize+=gammaAccumulatorLocalMemorySize;
if (totalLocalMemorySize <= device->local_memory_size)
break;
else
{
pixelPerWorkgroup=pixelPerWorkgroup/2;
chunkSize=chunkSize/2;
if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
{
/* quit, fallback to CPU */
goto cleanup;
}
}
}
scaleKernel=AcquireOpenCLKernel(device,"ScaleFilter");
if (scaleKernel == (cl_kernel) NULL)
{
(void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
ResourceLimitWarning,"AcquireOpenCLKernel failed.", ".");
goto cleanup;
}
i=0;
status =SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&imageBuffer);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&matte_or_cmyk);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&columns);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&rows);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&scaledImageBuffer);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledColumns);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledRows);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(float),(void*)&scale);
status|=SetOpenCLKernelArg(scaleKernel,i++,imageCacheLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&numCachedPixels);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&pixelPerWorkgroup);
status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&chunkSize);
status|=SetOpenCLKernelArg(scaleKernel,i++,pixelAccumulatorLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,weightAccumulatorLocalMemorySize,NULL);
status|=SetOpenCLKernelArg(scaleKernel,i++,gammaAccumulatorLocalMemorySize,NULL);
if (status != CL_SUCCESS)
{
(void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
ResourceLimitWarning,"SetOpenCLKernelArg failed.",".");
goto cleanup;
}
gsize[0] = (scaledColumns + pixelPerWorkgroup - 1) / pixelPerWorkgroup *
workgroupSize;
gsize[1] = scaledRows;
lsize[0] = workgroupSize;
lsize[1] = 1;
outputReady=EnqueueOpenCLKernel(queue,scaleKernel,2,
(const size_t *) NULL,gsize,lsize,image,filteredImage,MagickFalse,
exception);
cleanup:
if (scaleKernel != (cl_kernel) NULL)
ReleaseOpenCLKernel(scaleKernel);
return(outputReady);
}
STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum *inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4 *outputPixelCache, __local float *densityCache, __local float *gammaCache)
{
// calculate the range of resized image pixels computed by this workgroup
const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
const unsigned int actualNumPixelToCompute = stopX - startX;
float xFactor = (float)filteredColumns / inputColumns;
// calculate the range of input image pixels to cache
const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);
// cache the input pixels into local memory
const unsigned int y = get_global_id(1);
const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
wait_group_events(1, &e);
unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
{
const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;
// determine which resized pixel computed by this workitem
const unsigned int itemID = get_local_id(0);
const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));
const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));
float4 filteredPixel = (float4)0.0f;
// -1 means this workitem doesn't participate in the computation
if (pixelIndex != -1)
{
// x coordinated of the resized pixel computed by this workitem
const int x = chunkStartX + pixelIndex;
// calculate how many steps required for this pixel
const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
const unsigned int n = stop - start;
// calculate how many steps this workitem will contribute
unsigned int numStepsPerWorkItem = n / numItems;
numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);
const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
if (startStep < n)
{
const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);
unsigned int cacheIndex = start + startStep - cacheRangeStartX;
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
float4 cp = (float4)0.0f;
__local CLQuantum* p = inputImageCache + (cacheIndex * 4);
cp.x = (float)*(p);
cp.y = (float)*(p + 1);
cp.z = (float)*(p + 2);
cp.w = (float)*(p + 3);
filteredPixel += cp;
}
}
}
if (itemID < actualNumPixelInThisChunk)
{
WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
}
}
}
)
其实“如何写ScaleImage()
的硬件加速函数(六)”的实现就是一个ResizeHorizontalFilter()
将y
改成y / xFactor
的精简版,并不是ScaleImage()
的硬件加速函数。虽然它不是,但至少省掉了ResizeVerticalFilter()
的调用,速度上更快了。
但是目前发现的问题还是竖条纹,连续多次缩小一倍,最终图片被黑色竖条纹全部覆盖住,不断缩小或者放大,右侧会出现密集竖条纹,等等等的问题啦。
经过分析,黑色竖纹的产生原因是因为kernel
函数ScaleFilter()
的最内层的循环没有执行,导致将初始值0.0f
赋进了目标地址。
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
float4 cp = (float4)0.0f;
__local CLQuantum* p = inputImageCache + (cacheIndex * 4);
cp.x = (float)*(p);
cp.y = (float)*(p + 1);
cp.z = (float)*(p + 2);
cp.w = (float)*(p + 3);
filteredPixel += cp;
}
可以这样解决:
STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum* inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache)
{
// calculate the range of resized image pixels computed by this workgroup
const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
const unsigned int actualNumPixelToCompute = stopX - startX;
float xFactor = (float)filteredColumns / inputColumns;
// calculate the range of input image pixels to cache
const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);
// cache the input pixels into local memory
const unsigned int y = get_global_id(1);
const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
wait_group_events(1, &e);
unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
{
const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;
// determine which resized pixel computed by this workitem
const unsigned int itemID = get_local_id(0);
const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));
const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));
float4 filteredPixel = (float4)0.0f;
// -1 means this workitem doesn't participate in the computation
if (pixelIndex != -1)
{
// x coordinated of the resized pixel computed by this workitem
const int x = chunkStartX + pixelIndex;
// calculate how many steps required for this pixel
const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
const unsigned int n = stop - start;
// calculate how many steps this workitem will contribute
unsigned int numStepsPerWorkItem = n / numItems;
numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);
const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
if (startStep < n)
{
const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);
unsigned int cacheIndex = start + startStep - cacheRangeStartX;
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
float4 cp = (float4)0.0f;
__local CLQuantum* p = inputImageCache + (cacheIndex * 4);
cp.x = (float)*(p);
cp.y = (float)*(p + 1);
cp.z = (float)*(p + 2);
cp.w = (float)*(p + 3);
filteredPixel += cp;
}
}
}
if (itemID < actualNumPixelInThisChunk) {
outputPixelCache[itemID] = (float4)0.0f;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int i = 0; i < numItems; i++) {
if (pixelIndex != -1) {
if (itemID % numItems == i) {
outputPixelCache[pixelIndex] += filteredPixel;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (itemID < actualNumPixelInThisChunk)
{
float4 filteredPixel = outputPixelCache[itemID];
WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
}
}
}
)
测试了一下性能,感觉提升不少(原图缩小一半,共三次操作,原图连续放大一倍两次,共三次操作):
ScaleImage()
加速版本:
20220428104719 0:3.229821 1.672 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104719 0:3.230185 1.672 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 1360
20220428104725 0:9.628057 1.875 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104725 0:9.628288 1.875 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104732 0:16.078872 2.234 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104732 0:16.079057 2.234 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104740 0:24.253815 2.484 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104740 0:24.254118 2.484 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104749 0:33.888819 2.875 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104749 0:33.889007 2.875 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 31
20220428104752 0:36.173104 3.047 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104752 0:36.173301 3.047 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 156
20220428104800 0:44.287153 3.469 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104800 0:44.287372 3.469 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 47
20220428104801 0:45.546271 3.656 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104801 0:45.546588 3.656 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 140
20220428104806 0:49.973027 4.047 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104806 0:49.973217 4.047 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 31
20220428104806 0:50.640522 4.250 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104806 0:50.640730 4.250 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 141
ScaleImage()
原先版本:
20220428104934 0:1.982873 0.266 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104934 0:2.040677 0.328 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 63
20220428104940 0:7.854823 0.578 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104940 0:7.913365 0.625 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 47
20220428104944 0:11.896725 0.875 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104944 0:11.956722 0.938 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 63
20220428104951 0:18.070817 1.219 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104951 0:18.378405 1.516 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 297
20220428104952 0:19.394056 1.531 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104953 0:20.634341 2.781 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1250
20220428104958 0:25.534006 3.063 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104958 0:25.836584 3.375 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 312
20220428104959 0:26.729520 3.406 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105000 0:27.930533 4.609 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1203
20220428105011 0:38.879392 5.438 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105012 0:39.210382 5.766 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 328
20220428105012 0:39.872525 5.797 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105014 0:41.176969 7.094 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1297
我觉得Y
方向的缩放以下面这种ScaleFilter()
的方法是实现不了的,我只能添加进X
方向的处理,缩小正常,放大的话图片变亮。
STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum* inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache)
{
// calculate the range of resized image pixels computed by this workgroup
const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
const unsigned int actualNumPixelToCompute = stopX - startX;
float xFactor = (float)filteredColumns / inputColumns;
// calculate the range of input image pixels to cache
const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);
// cache the input pixels into local memory
const unsigned int y = get_global_id(1);
const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
wait_group_events(1, &e);
unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
{
const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;
// determine which resized pixel computed by this workitem
const unsigned int itemID = get_local_id(0);
const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));
const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));
float4 filteredPixel = (float4)0.0f;
// -1 means this workitem doesn't participate in the computation
if (pixelIndex != -1)
{
// x coordinated of the resized pixel computed by this workitem
const int x = chunkStartX + pixelIndex;
// calculate how many steps required for this pixel
const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
const unsigned int n = stop - start;
// calculate how many steps this workitem will contribute
unsigned int numStepsPerWorkItem = n / numItems;
numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);
const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
if (startStep < n)
{
float x_scale = (float)filteredColumns / inputColumns;
float x_span = 1.0;
float x_volume = 0.0;
float factor = 0.0;
const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);
unsigned int cacheIndex = start + startStep - cacheRangeStartX;
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
float4 cp = (float4)0.0f;
__local CLQuantum* p = inputImageCache + (cacheIndex * 4);
cp.x = (float)*(p);
cp.y = (float)*(p + 1);
cp.z = (float)*(p + 2);
cp.w = (float)*(p + 3);
while (x_scale >= x_span) {
if (x_volume > 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
filteredPixel.x *= factor;
filteredPixel.y *= factor;
filteredPixel.z *= factor;
}
if (cp.w < 255.0) {
x_volume += x_span;
}
filteredPixel += x_span * cp;
filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
x_scale -= x_span;
x_span = 1.0;
}
if (x_scale > 0.0) {
if (x_volume > 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
filteredPixel.x *= factor;
filteredPixel.y *= factor;
filteredPixel.z *= factor;
}
if (cp.w < 255.0)
x_volume += x_scale;
filteredPixel += x_scale * cp;
x_span -= x_scale;
}
if (x_span > 0.0) {
if (cp.w < 255.0)
x_volume += x_span;
filteredPixel += x_span * cp;
}
filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
}
}
}
if (itemID < actualNumPixelInThisChunk) {
outputPixelCache[itemID] = (float4)0.0f;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int i = 0; i < numItems; i++) {
if (pixelIndex != -1) {
if (itemID % numItems == i) {
outputPixelCache[pixelIndex] += filteredPixel;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (itemID < actualNumPixelInThisChunk)
{
float4 filteredPixel = outputPixelCache[itemID];
WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
}
}
}
)
这么快在“如何写ScaleImage()
的硬件加速函数(八)”的基础上就解决了放大后图片变亮的问题,目前看效果还可以。
STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum* inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache)
{
// calculate the range of resized image pixels computed by this workgroup
const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
const unsigned int actualNumPixelToCompute = stopX - startX;
float xFactor = (float)filteredColumns / inputColumns;
// calculate the range of input image pixels to cache
const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);
// cache the input pixels into local memory
const unsigned int y = get_global_id(1);
const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
wait_group_events(1, &e);
unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
{
const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;
// determine which resized pixel computed by this workitem
const unsigned int itemID = get_local_id(0);
unsigned int local_idx = itemID;
const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));
const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));
float4 filteredPixel = (float4)0.0f;
if (itemID < actualNumPixelInThisChunk) {
outputPixelCache[itemID] = (float4)0.0f;
}
barrier(CLK_LOCAL_MEM_FENCE);
// -1 means this workitem doesn't participate in the computation
if (pixelIndex != -1)
{
// x coordinated of the resized pixel computed by this workitem
const int x = chunkStartX + pixelIndex;
// calculate how many steps required for this pixel
const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
const unsigned int n = stop - start;
// calculate how many steps this workitem will contribute
unsigned int numStepsPerWorkItem = n / numItems;
numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);
const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
if (startStep < n)
{
float x_scale = (float)filteredColumns / inputColumns;
float x_span = 1.0;
float x_volume = 0.0;
float factor = 0.0;
unsigned next_column = 0;
const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);
unsigned int cacheIndex = start + startStep - cacheRangeStartX;
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
float4 cp = (float4)0.0f;
__local CLQuantum* p = inputImageCache + (cacheIndex * 4);
cp.x = (float)*(p);
cp.y = (float)*(p + 1);
cp.z = (float)*(p + 2);
cp.w = (float)*(p + 3);
while (x_scale >= x_span) {
if (next_column) {
if (x_volume > 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
outputPixelCache[local_idx].x *= factor;
outputPixelCache[local_idx].y *= factor;
outputPixelCache[local_idx].z *= factor;
}
x_volume = 0.0;
filteredPixel = 0.0;
local_idx++;
}
if (cp.w < 255.0) {
x_volume += x_span;
}
filteredPixel += x_span * cp;
filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
x_scale -= x_span;
x_span = 1.0;
next_column = 1;
}
if (x_scale > 0.0) {
if (next_column) {
if (x_volume > 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
outputPixelCache[local_idx].x *= factor;
outputPixelCache[local_idx].y *= factor;
outputPixelCache[local_idx].z *= factor;
}
x_volume = 0.0;
filteredPixel = 0.0;
next_column = 0;
local_idx++;
}
if (cp.w < 255.0)
x_volume += x_scale;
filteredPixel += x_scale * cp;
x_span -= x_scale;
}
if (x_span > 0.0) {
if (cp.w < 255.0)
x_volume += x_span;
filteredPixel += x_span * cp;
}
filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
}
}
}
for (unsigned int i = 0; i < numItems; i++) {
if (pixelIndex != -1) {
if (itemID % numItems == i) {
outputPixelCache[pixelIndex] += filteredPixel;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (itemID < actualNumPixelInThisChunk)
{
float4 filteredPixel = outputPixelCache[itemID];
WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
}
}
}
)
难道就这么被我轻松实现了?
“如何写ScaleImage()
的硬件加速函数(九)”是在“如何写ScaleImage()
的硬件加速函数(八)”的基础上处理了图片放大变亮的问题,但是他们都只是X
方向的处理,没有实现原始函数ScaleImage()
的Y
方向缩放。
目前先处理Y
方向再处理X
方向的代码都有了,如下:
static MagickBooleanType scaleFilter(MagickCLDevice device,
cl_command_queue queue, const Image* image, Image* filteredImage,
cl_mem imageBuffer, cl_uint matte_or_cmyk, cl_uint columns, cl_uint rows,
cl_mem scaledImageBuffer, cl_uint scaledColumns, cl_uint scaledRows,
ExceptionInfo* exception)
{
cl_kernel
scaleKernel;
cl_int
status;
const unsigned int
workgroupSize = 256;
float
scale;
int
numCachedPixels;
MagickBooleanType
outputReady;
size_t
gammaAccumulatorLocalMemorySize,
gsize[2],
i,
imageCacheLocalMemorySize,
pixelAccumulatorLocalMemorySize,
pixelAccumulatorLocalMemorySize2,
lsize[2],
totalLocalMemorySize,
weightAccumulatorLocalMemorySize;
unsigned int
chunkSize,
pixelPerWorkgroup;
scaleKernel = NULL;
outputReady = MagickFalse;
scale = (float)scaledColumns / columns; // TODO(ocl)
unsigned int stop = 0;
unsigned int next_row = 1;
float y_span = 1.0;
float y_scale = (float)scaledRows / rows;
if (scaledRows == rows)
stop++;
else {
while (y_scale < y_span) {
if (next_row) {
stop++;
}
y_span -= y_scale;
y_scale = (float)scaledRows / rows;
next_row = 1;
}
if (next_row) {
stop++;
next_row = 0;
}
}
if (scaledColumns < workgroupSize)
{
chunkSize = 32;
pixelPerWorkgroup = 32;
}
else
{
chunkSize = workgroupSize;
pixelPerWorkgroup = workgroupSize;
}
DisableMSCWarning(4127)
while (1)
RestoreMSCWarning
{
/* calculate the local memory size needed per workgroup */
numCachedPixels=(int) ceil((pixelPerWorkgroup-1)/scale+2*(0.5+MagickEpsilon)); // TODO(ocl)
imageCacheLocalMemorySize = numCachedPixels * sizeof(CLQuantum) * 4 * stop;
totalLocalMemorySize = imageCacheLocalMemorySize;
/* local size for the pixel accumulator */
pixelAccumulatorLocalMemorySize = chunkSize * sizeof(cl_float4);
totalLocalMemorySize += pixelAccumulatorLocalMemorySize;
pixelAccumulatorLocalMemorySize2 = pixelAccumulatorLocalMemorySize;
totalLocalMemorySize += pixelAccumulatorLocalMemorySize2;
/* local memory size for the weight accumulator */
weightAccumulatorLocalMemorySize = chunkSize * sizeof(float);
totalLocalMemorySize += weightAccumulatorLocalMemorySize;
/* local memory size for the gamma accumulator */
gammaAccumulatorLocalMemorySize = chunkSize * sizeof(float);
totalLocalMemorySize += gammaAccumulatorLocalMemorySize;
if (totalLocalMemorySize <= device->local_memory_size)
break;
else
{
pixelPerWorkgroup = pixelPerWorkgroup / 2;
chunkSize = chunkSize / 2;
if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
{
/* quit, fallback to CPU */
goto cleanup;
}
}
}
scaleKernel = AcquireOpenCLKernel(device, "ScaleFilter");
if (scaleKernel == (cl_kernel)NULL)
{
(void)OpenCLThrowMagickException(device, exception, GetMagickModule(),
ResourceLimitWarning, "AcquireOpenCLKernel failed.", ".");
goto cleanup;
}
i = 0;
status = SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_mem), (void*)&imageBuffer);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&matte_or_cmyk);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&columns);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&rows);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_mem), (void*)&scaledImageBuffer);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&scaledColumns);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&scaledRows);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(float), (void*)&scale);
status |= SetOpenCLKernelArg(scaleKernel, i++, imageCacheLocalMemorySize, NULL);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(int), &numCachedPixels);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(unsigned int), &pixelPerWorkgroup);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(unsigned int), &chunkSize);
status |= SetOpenCLKernelArg(scaleKernel, i++, pixelAccumulatorLocalMemorySize, NULL);
status |= SetOpenCLKernelArg(scaleKernel, i++, pixelAccumulatorLocalMemorySize2, NULL);
status |= SetOpenCLKernelArg(scaleKernel, i++, weightAccumulatorLocalMemorySize, NULL);
status |= SetOpenCLKernelArg(scaleKernel, i++, gammaAccumulatorLocalMemorySize, NULL);
status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(unsigned int), &stop);
if (status != CL_SUCCESS)
{
(void)OpenCLThrowMagickException(device, exception, GetMagickModule(),
ResourceLimitWarning, "SetOpenCLKernelArg failed.", ".");
goto cleanup;
}
gsize[0] = (scaledColumns + pixelPerWorkgroup - 1) / pixelPerWorkgroup *
workgroupSize;
gsize[1] = scaledRows;
lsize[0] = workgroupSize;
lsize[1] = 1;
outputReady = EnqueueOpenCLKernel(queue, scaleKernel, 2,
(const size_t*)NULL, gsize, lsize, image, filteredImage, MagickFalse,
exception);
cleanup:
if (scaleKernel != (cl_kernel)NULL)
ReleaseOpenCLKernel(scaleKernel);
return(outputReady);
}
STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
const unsigned int filteredColumns, const unsigned int filteredRows,
const float resizeFilterScale,
__local CLQuantum* inputImageCache, const int numCachedPixels,
const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
__local float4* outputPixelCache, __local float* outputPixelCache2, __local float* densityCache, __local float* gammaCache,
const unsigned int stopn)
{
// calculate the range of resized image pixels computed by this workgroup
const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
const unsigned int actualNumPixelToCompute = stopX - startX;
float xFactor = (float)filteredColumns / inputColumns;
// calculate the range of input image pixels to cache
const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);
// cache the input pixels into local memory
const unsigned int y = get_global_id(1);
const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
for (unsigned stopi = 0; stopi < stopn; ++stopi) {
event_t e = async_work_group_copy(inputImageCache + num_elements * stopi, inputImage + pos + num_elements * stopi, num_elements, 0);
wait_group_events(1, &e);
}
for (unsigned t = 0; t < num_elements / 4; ++t) {
outputPixelCache[t] = 0.0;
outputPixelCache2[t] = 0.0;
}
float y_scale = (float)filteredRows / inputRows;
float y_span = 1.0;
unsigned next_row = 1;
unsigned stopi = 0;
float4 y_vector = 0.0;
float y_volume = 0.0;
float factor = 0.0;
while (y_scale < y_span) {
/*if (next_row) {
stopi++;
}*/
for (unsigned ix = 0; ix < num_elements / 4; ++ix) {
unsigned tempi = num_elements / 4 * stopi + ix;
if (((float4)inputImageCache[tempi]).w < 255.0)
outputPixelCache2[tempi] += y_scale;
outputPixelCache[tempi] += y_scale * (float4)inputImageCache[tempi];
}
y_span -= y_scale;
y_scale = (float)filteredRows / inputRows;
next_row = 1;
if (next_row) {
stopi++;
next_row = 0;
}
}
stopi = 0;
for (unsigned t = 0; t < stopi; ++t) {
for (unsigned ix = 0; ix < num_elements / 4; ++ix) {
unsigned tempi = num_elements / 4 * t + ix;
if (((float4)inputImageCache[tempi]).w < 255.0)
outputPixelCache2[tempi] += y_span;
outputPixelCache[tempi] += outputPixelCache[tempi] + y_span * (float4)inputImageCache[tempi];
if (outputPixelCache2[tempi] > 0.0 && outputPixelCache2[tempi] < 1.0) {
factor = 1 / outputPixelCache2[tempi];
outputPixelCache[tempi] *= factor;
}
inputImageCache[tempi] = outputPixelCache[tempi].x > 255.0 ? 255.0 : outputPixelCache[tempi].x;
inputImageCache[tempi + 1] = outputPixelCache[tempi].y > 255.0 ? 255.0 : outputPixelCache[tempi].y;
inputImageCache[tempi + 2] = outputPixelCache[tempi].z > 255.0 ? 255.0 : outputPixelCache[tempi].z;
inputImageCache[tempi + 3] = outputPixelCache[tempi].w > 255.0 ? 255.0 : outputPixelCache[tempi].w;
}
}
unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
{
const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;
// determine which resized pixel computed by this workitem
const unsigned int itemID = get_local_id(0);
unsigned int local_idx = itemID;
const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));
const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));
float4 filteredPixel = (float4)0.0f;
if (itemID < actualNumPixelInThisChunk) {
outputPixelCache[itemID] = (float4)0.0f;
}
barrier(CLK_LOCAL_MEM_FENCE);
// -1 means this workitem doesn't participate in the computation
if (pixelIndex != -1)
{
// x coordinated of the resized pixel computed by this workitem
const int x = chunkStartX + pixelIndex;
// calculate how many steps required for this pixel
const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
const unsigned int n = stop - start;
// calculate how many steps this workitem will contribute
unsigned int numStepsPerWorkItem = n / numItems;
numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);
const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
if (startStep < n)
{
float x_scale = (float)filteredColumns / inputColumns;
float x_span = 1.0;
float x_volume = 0.0;
float factor = 0.0;
unsigned next_column = 0;
const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);
unsigned int cacheIndex = start + startStep - cacheRangeStartX;
for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
float4 cp = (float4)0.0f;
__local CLQuantum* p = inputImageCache + (cacheIndex * 4);
cp.x = (float)*(p);
cp.y = (float)*(p + 1);
cp.z = (float)*(p + 2);
cp.w = (float)*(p + 3);
while (x_scale >= x_span) {
if (next_column) {
if (x_volume > 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
outputPixelCache[local_idx].x *= factor;
outputPixelCache[local_idx].y *= factor;
outputPixelCache[local_idx].z *= factor;
}
x_volume = 0.0;
filteredPixel = 0.0;
local_idx++;
}
if (cp.w < 255.0) {
x_volume += x_span;
}
filteredPixel += x_span * cp;
filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
x_scale -= x_span;
x_span = 1.0;
next_column = 1;
}
if (x_scale > 0.0) {
if (next_column) {
if (x_volume > 0.0 && x_volume < 1.0) {
factor = 1 / x_volume;
outputPixelCache[local_idx].x *= factor;
outputPixelCache[local_idx].y *= factor;
outputPixelCache[local_idx].z *= factor;
}
x_volume = 0.0;
filteredPixel = 0.0;
next_column = 0;
local_idx++;
}
if (cp.w < 255.0)
x_volume += x_scale;
filteredPixel += x_scale * cp;
x_span -= x_scale;
}
if (x_span > 0.0) {
if (cp.w < 255.0)
x_volume += x_span;
filteredPixel += x_span * cp;
}
filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
}
}
}
for (unsigned int i = 0; i < numItems; i++) {
if (pixelIndex != -1) {
if (itemID % numItems == i) {
outputPixelCache[pixelIndex] += filteredPixel;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (itemID < actualNumPixelInThisChunk)
{
float4 filteredPixel = outputPixelCache[itemID];
WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
}
}
}
)
把“如何写ScaleImage()
的硬件加速函数(九)”和“如何写ScaleImage()
的硬件加速函数(十)”放到linux
平台下测试了一下,蛋疼,都有问题!(OpenCL
不是跨平台的嘛,怎么windows
上正确的在linux
上却不对?)前者有黑色竖线;后者有红色雪花覆盖整个图片(如果gm
的预览图显示就不正确,那不用再缩放了,这已经说明肯定有问题了)。
“如何写ScaleImage()
的硬件加速函数(十)”这里的代码写得比较随意,其中stopi = 0;
赋值为0
后,下面的循环根本没有执行,这才使得显示的图片变正确了;且async_work_group_copy()
的参数传得可能也不对,等等等。
- 我原来的想法是在外部计算好需要的行数传入
kernel
函数,并按照此行数申请好一定长度的一维数组,在async_work_group_copy()
时拷贝相应行数的像素 - 然后在
async_work_group_copy()
拷贝好的local
内存里按照ScaleImage()
的算法实现Y
方向的缩放 - 有一个顾虑需不需要考虑?因为
local
内存是对应一个work-group
的,它的各work-item
共享这片local
内存,那每个work-item
是不是都会按照ScaleImage()
的算法处理一次Y
方向的缩放? - 目前看好像不要考虑这个问题,即不影响结果也不影响效率
- 同时也在考虑,可不可以在进入
kernel
函数之前就缩放好Y
方向呢?
迭代100
次,缩小图片50%
,如下:
[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 4.92s user 5.146311s total 19.431 iter/s 20.325 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 5.88s user 5.887496s total 16.985 iter/s 17.007 iter/cpu
迭代100
次,放大图片200%
,如下:
[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 17.73s user 15.998019s total 6.251 iter/s 5.640 iter/cpu
[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 17.61s user 15.812017s total 6.324 iter/s 5.679 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 23.15s user 23.203446s total 4.310 iter/s 4.320 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 23.57s user 23.621860s total 4.233 iter/s 4.243 iter/cpu
主要看total
前面的值,是运行总时间。同样的方法再对比一下AccelerateResizeImage()
。
迭代100
次,缩小图片50%
,如下:
[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 11.28s user 8.047808s total 12.426 iter/s 8.865 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 44.43s user 6.364194s total 15.713 iter/s 2.251 iter/cpu
迭代100
次,放大图片200%
,如下:
[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 24.67s user 18.713505s total 5.344 iter/s 4.054 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 160.27s user 26.635967s total 3.754 iter/s 0.624 iter/cpu