Table of Contents

gm-ocl开发记录

`gm-ocl`开发记录

<2022-02-24 Thu>

增加`--enable-opencl`参数

拷贝ImageMagick中的m4/ax_have_opencl.m4，在configure.ac中添加：

# Enable support for OpenCL
no_cl=yes
AX_HAVE_OPENCL([C])
if test "X$no_cl" != 'Xyes'; then
  MAGICK_FEATURES="OpenCL $MAGICK_FEATURES"
fi

在根目录下运行autoreconf -vif重新生成configure，但出现错误：

$ autoreconf -vif
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
autoreconf: running: libtoolize --copy --force
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'.
libtoolize: copying file 'config/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
autoreconf: running: /usr/bin/autoconf --force
configure.ac:1057: error: possibly undefined macro: AC_MSG_RESULT
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
autoreconf: /usr/bin/autoconf failed with exit status: 1

试了比如安装pkg-config，autoconf-archive都不能解决，开发环境archlinux的autoconf的版本是2.71，换到了ubuntu的2.69的环境下也不行。

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal
$ autoconf --version
autoconf (GNU Autoconf) 2.69
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+/Autoconf: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>, <http://gnu.org/licenses/exceptions.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David J. MacKenzie and Akim Demaille.

可以将ax_have_opencl.m4的所有AC_MSG_RESULT全部删除（注：注释掉AC_MSG_RESULT的话还会出现其它错误导致不能生成configure），虽然能成功生成configure，但是运行./configure时会出现如下错误：

$ ./configure
...
...
checking for gcc option to support OpenMP... -fopenmp
./configure: line 7812: syntax error near unexpected token `OpenCL,'
./configure: line 7812: `        AX_CHECK_FRAMEWORK(OpenCL,'

临时解决的方法是将configure中的AX_CHECK_FRAMEWORK(...)的代码段删除。

AX_CHECK_FRAMEWORK的答案来自：

$ grep -rn 'AX_CHECK_FRAMEWORK'
ImageMagick/m4/ax_check_framework.m4:6:dnl @synopsis AX_CHECK_FRAMEWORK(framework, [action-if-found], [action-if-not-found])
ImageMagick/m4/ax_check_framework.m4:15:AC_DEFUN([AX_CHECK_FRAMEWORK],[
ImageMagick/m4/ax_have_opencl.m4:94:        AX_CHECK_FRAMEWORK([OpenCL], [

即，将ImageMagick的m4/ax_check_framework.m4拷贝过来即可，惊奇的是AX_CHECK_FRAMEWORK的问题解决，之前AC_MSG_RESULT的问题同时也迎刃而解。

可以删除configure文件后再运行autoreconf来代替autoreconf -vif，这样生成的configure可以最大限度的与原configure保持一致。

<2022-02-25 Fri>

关于`AccelerateResizeImage()`的链接问题

因为增加了两个新文件accelerate-private.h和accelerate.c，AccelerateResizeImage()就位于其中，因为magick/Makefile.am中没有增加这两个文件，所以在autoreconf时生成的Makefile.in就不知道新增了两个文件，因此在configure时生成的Makefile中就不会编译AccelerateResizeImage()函数，因此出现的链接问题。

$ rm configure
$ autoreconf
configure.ac:119: warning: AM_INIT_AUTOMAKE: two- and three-arguments forms are deprecated.  For more info, see:
configure.ac:119: https://www.gnu.org/software/automake/manual/automake.html#Modernize-AM_005fINIT_005fAUTOMAKE-invocation

<2022-03-02 Wed>

在`archlinux`上为`Intel`启用`OpenCL`

% lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Iris Plus Graphics G1 (Ice Lake) (rev 07)
% clinfo
Number of platforms                               0
% sudo pacman -S intel-compute-runtime

调试`RunOpenCLBenchmark()`时的崩溃问题

发现直接运行gm display没有崩溃问题，调试时却经常发生，有时SIGABRT，有时SIGSEGV，猜测可能是同步问题：

/*
  We need this to get a proper performance benchmark, the operations
  are executed asynchronous.
*/
if (is_cpu == MagickFalse)
  {
    CacheInfo
      *cache_info;

    MagickCLCacheInfo
      cl_info;

    cache_info=(CacheInfo *) resizedImage->cache;
    cl_info=GetCacheInfoOpenCL(cache_info);
    if (cl_info != (MagickCLCacheInfo) NULL)
      openCL_library->clWaitForEvents(cl_info->event_count,
        cl_info->events);
  }

if (i > 0)
  StopAccelerateTimer(&timer);

if (bluredImage != (Image *) NULL)
  DestroyImage(bluredImage);
if (unsharpedImage != (Image *) NULL)
  DestroyImage(unsharpedImage);
if (resizedImage != (Image *) NULL)
  DestroyImage(resizedImage);

经常遇到的是：resizedImage为0（SIGSEGV），DestroyImage(resizedImage);（SIGABRT）。

<2022-03-07 Mon>

调用`clCreateBuffer()`产生异常问题（一）

在vscode上的堆栈输出如下：

libc.so.6!__pthread_kill_implementation (Unknown Source:0)
libc.so.6!raise (Unknown Source:0)
libc.so.6!abort (Unknown Source:0)
libigdrcl.so![Unknown/Just-In-Time compiled code] (Unknown Source:0)
libOpenCL.so!clCreateBuffer (Unknown Source:0)
AcquireMagickCLCacheInfo(MagickCLDevice device, Quantum * pixels, const magick_int64_t length) (gm-ocl/magick/opencl.c:569)
GetAuthenticOpenCLBuffer(const Image * image, MagickCLDevice device, ExceptionInfo * exception) (gm-ocl/magick/pixel_cache.c:5252)

因为堆栈显示问题最终是出在__pthread_kill_implementation上，所以我的调查方向一直在线程同步上，代码调试了一遍又一遍却始终没有找到问题。心灰意冷并已经产生从头再开始的想法了。

注意到：

libigdrcl.so![Unknown/Just-In-Time compiled code] (Unknown Source:0)

谷歌了下libigdrcl.so，难道是安装intel驱动的问题？目前使用的是intel-compute-runtime，现改为intel-opencl-runtime，参考自：“GPGPU - ArchWiki”。

$ sudo pacman -Rns intel-compute-runtime
$ yay -S intel-opencl-runtime

我对比了intel-compute-runtime和intel-opencl-runtime的clinfo输出差异发现intel-compute-runtime支持GPU，intel-opencl-runtime支持CPU。

那代码到底有没有问题？clCreateBuffer()调用失败的原因要不要深究？

`vscode`环境

// tasks.json
{
    // See https://go.microsoft.com/fwlink/?LinkId=733558
    // for the documentation about the tasks.json format
    "version": "2.0.0",
    "tasks": [
        {
            "label": "build with opencl",
            "type": "shell",
            "command": "make",
            "problemMatcher": [],
            "group": {
                "kind": "build",
                "isDefault": true
            }
        }
    ]
}

// launch.json
{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(gdb) Launch",
            "type": "cppdbg",
            "request": "launch",
            "program": "${workspaceFolder}/utilities/gm",
            "args": ["display", "~/temp/bg1a.jpg"],
            "stopAtEntry": false,
            "cwd": "${fileDirname}",
            "environment": [
                {
                    "name": "MAGICK_OCL_DEVICE",
                    "value": "true"
                }
            ],
            "externalConsole": false,
            "MIMode": "gdb",
            "setupCommands": [
                {
                    "description": "Enable pretty-printing for gdb",
                    "text": "-enable-pretty-printing",
                    "ignoreFailures": true
                },
                {
                    "description":  "Set Disassembly Flavor to Intel",
                    "text": "-gdb-set disassembly-flavor intel",
                    "ignoreFailures": true
                }
            ]
        }

    ]
}

正在用的dev.sh内容如下：

#!/bin/bash

./configure CFLAGS='-g -O0' LDFLAGS='-ldl' --enable-opencl --prefix=$HOME/usr/local

调用`clCreateBuffer()`产生异常问题（二）

使用MAGICK_OCL_DEVICE=GPU且在已经安装了opencl-compute-runtime的情况下会产生两个问题：

gm运行卡死，无法操作，CPU使用率居高不下，或者
gm运行崩溃，产生如下提示：

$ gm display ~/temp/bg1a.jpg
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)

<2022-03-09 Wed>

调用`clCreateBuffer()`产生异常问题（三）

我在这里找到了一些有用的信息：“crash in NEO::DrmAllocation::makeBOsResident or in checkAllocationsForOverlapping when using more than one opencl block in gnuradio gr-clenabled”。

下载了compute-runtime-22.09.22577的源代码：

// compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp

OsHandleStorage HostPtrManager::prepareOsStorageForAllocation(MemoryManager &memoryManager, size_t size, const void *ptr, uint32_t rootDeviceIndex) {
    std::lock_guard<decltype(allocationsMutex)> lock(allocationsMutex);
    auto requirements = HostPtrManager::getAllocationRequirements(rootDeviceIndex, ptr, size);
    UNRECOVERABLE_IF(checkAllocationsForOverlapping(memoryManager, &requirements) == RequirementsStatus::FATAL);
    auto osStorage = populateAlreadyAllocatedFragments(requirements);
    if (osStorage.fragmentCount > 0) {
        if (memoryManager.populateOsHandles(osStorage, rootDeviceIndex) != MemoryManager::AllocationStatus::Success) {
            memoryManager.cleanOsHandles(osStorage, rootDeviceIndex);
            osStorage.fragmentCount = 0;
        }
    }
    return osStorage;
}

host_ptr_manager.cpp:250就是：

UNRECOVERABLE_IF(checkAllocationsForOverlapping(memoryManager, &requirements) == RequirementsStatus::FATAL);

参考链接的：

Is flag CL_USE_HOST_PTR used to create buffers?

Are such buffers mapped (clEnqueueMapBuffer) and returned pointers used to create other buffers? Or passed as a ptr to EnqueueReadWriteBuffer/Image() calls?

我加了pixels指针的输出发现：

$ gm display ~/temp/bg1a.jpg
14:45:53 0:1.237436  0.740u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648dc333c70
14:45:53 0:1.244275  0.800u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648db4beac0
14:45:53 0:1.334968  1.370u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648dd446b00
14:45:53 0:1.336437  1.390u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648db646ae0
14:45:53 0:1.432968  2.040u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648dd446b00
14:45:53 0:1.433129  2.060u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648db646ae0
14:45:53 0:1.544831  2.780u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648de446b50
14:45:53 0:1.544873  2.790u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648dc762150
14:45:53 0:1.659341  3.630u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648df446be0
14:45:53 0:1.659420  3.630u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648dd935030
14:45:53 0:1.778589  4.520u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648df446be0
14:45:53 0:1.778667  4.530u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648dd935030
14:45:54 0:2.242840  8.140u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x7f85b9757010
14:45:54 0:2.247728  8.200u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648dc8faa00
14:45:54 0:2.256726  8.280u 69908 opencl.c/CopyMagickCLCacheInfo/1552/User:
  clEnqueueMapBuffer return pixels: 0x5648dc8faa00
14:46:12 0:20.409439 8.320u 69908 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x5648db646ae0
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)

发现0x5648db646ae0指针在第三次调用clCreateBuffer时崩溃。怎么再次崩溃时的输出不一样：

$ gm display ~/temp/bg1a.jpg
15:09:02 0:1.357361  1.150u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55728b02e720
15:09:02 0:1.363826  1.210u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55728a1b9570
15:09:02 0:1.460255  1.810u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55728ce2e740
15:09:02 0:1.461296  1.830u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55728b341620
15:09:02 0:1.552602  2.460u 71516 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55728c1415d0
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)

这次没有崩溃，但是却发现clEnqueueMapBuffer返回的指针被clCreateBuffer创建缓存却没有崩溃：

$ gm display ~/temp/bg1a.jpg
15:11:13 0:1.283573  1.150u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf5096990
15:11:13 0:1.289888  1.200u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf4161c00
15:11:13 0:1.385760  1.820u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf6296b20
15:11:13 0:1.392014  1.850u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf4161c00
15:11:14 0:1.479689  2.450u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf6296b20
15:11:14 0:1.485056  2.470u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf4161c00
15:11:14 0:1.581975  3.120u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf7296b60
15:11:14 0:1.582017  3.120u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf528a550
15:11:14 0:1.700207  4.010u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf8196bc0
15:11:14 0:1.700267  4.010u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf528a550
15:11:14 0:1.816203  4.880u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf8196bc0
15:11:14 0:1.816260  4.880u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf528a550
15:11:14 0:2.285724  8.520u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x7fe42d322010
15:11:14 0:2.290441  8.550u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf54eb600
15:11:14 0:2.300859  8.630u 71783 opencl.c/CopyMagickCLCacheInfo/1552/User:
  clEnqueueMapBuffer return pixels: 0x55dcf54eb600
15:11:25 0:13.121552 8.660u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf5570380
15:11:25 0:13.135136 8.670u 71783 opencl.c/CopyMagickCLCacheInfo/1552/User:
  clEnqueueMapBuffer return pixels: 0x55dcf5570380
15:11:25 0:13.291565 8.680u 71783 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer pixels: 0x55dcf54eb600
15:11:25 0:13.300287 8.690u 71783 opencl.c/CopyMagickCLCacheInfo/1552/User:
  clEnqueueMapBuffer return pixels: 0x55dcf54eb600

这跟上面链接的说法不符呀！难道是因为clEnqueueMapBuffer中没有使用CL_USE_HOST_PTR标记？所以不存在上面所说的这个问题？

怪事儿，我为什么没有在cl.h里找到CL_USE_HOST_PTR的定义？

在IM上添加了相同的输出代码，表现和上面的类似，但IM却能工作的很好，看来得从其它方法再入手了。

注：IM的日志配置文件：usr/local/etc/ImageMagick-7/log.xml。

<2022-03-15 Tue>

调用`clCreateBuffer()`产生异常问题（四）

还是因为内存重叠的原因：我在opencl.c的AcquireMagickCLCacheInfo()函数中调用clCreateBuffer()之前添加了如下的输出代码：

LogMagickEvent(UserEvent, GetMagickModule(),
  "clCreateBuffer - req: %d, pixels: %p, len: %d",
  device->requested, pixels, length);

当出现问题时有如下log：

[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
10:51:31 0:1.368105  1.030u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer - req: 1, pixels: 0x55d124b58490, len: 15728640
10:51:31 0:1.374360  1.060u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer - req: 2, pixels: 0x55d123c23730, len: 1536000
10:51:31 0:1.469657  1.670u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer - req: 2, pixels: 0x55d125c58510, len: 15728640
10:51:31 0:1.475944  1.720u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer - req: 3, pixels: 0x55d123c23730, len: 1536000
10:51:31 0:1.566470  2.310u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer - req: 3, pixels: 0x55d125c58510, len: 15728640
10:51:31 0:1.571902  2.360u 28955 opencl.c/AcquireMagickCLCacheInfo/569/User:
  clCreateBuffer - req: 4, pixels: 0x55d124b23740, len: 1536000
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)

注意看第一行和最后一行的输出发现，最后一行的地址加偏移已经大于第一行的地址，说明此时内存重叠，所以出现clCreateBuffer()的调用崩溃。这个问题有点难解了，涉及要修改整个GM的内存布局？

(> (+ #x55d124b23740 1536000) #x55d124b58490)

调用`clCreateBuffer()`产生异常问题（五）

在前一篇的基础上继续分析，因为clCreateBuffer()返回的地址即GetAuthenticOpenCLBuffer()的返回值（它在ComputeResizeImage()函数中被调用），当ComputeResizeImage()结束时，调用clReleaseMemObject()将会减少该内存计数，当计数为0时clCreateBuffer()创建的内存才被释放。

为了打印内存的引用计数，增加了clGetMemObjectInfo()函数：

// opencl-private.h

typedef CL_API_ENTRY cl_int
  (CL_API_CALL *MAGICKpfn_clGetMemObjectInfo)(cl_mem memobj,
    cl_mem_info param_name,size_t param_value_size,void *param_value,
    size_t *param_value_size_ret)
    CL_API_SUFFIX__VERSION_1_0;

MAGICKpfn_clGetMemObjectInfo clGetMemObjectInfo;

比如ReleaseOpenCLMemObject()函数被改成了：

MagickPrivate void ReleaseOpenCLMemObject(cl_mem memobj)
{
  cl_uint refcnt=0;
  openCL_library->clGetMemObjectInfo(memobj, CL_MEM_REFERENCE_COUNT, sizeof(cl_uint), &refcnt, NULL);
  LogMagickEvent(UserEvent, GetMagickModule(),
    "b4 ReleaseOpenCLMemObject(%p) refcnt: %d", memobj, refcnt);
  cl_int ret=openCL_library->clReleaseMemObject(memobj);
  openCL_library->clGetMemObjectInfo(memobj, CL_MEM_REFERENCE_COUNT, sizeof(cl_uint), &refcnt, NULL);
  LogMagickEvent(UserEvent, GetMagickModule(),
    "af ReleaseOpenCLMemObject(%p) refcnt: %d", memobj, refcnt);
}

我有输出如下：

[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
13:51:33 0:1.510679  1.480u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
  clCreateBuffer -- req: 1, pixles: 0x5588e97712e0, len: 15728640
13:51:33 0:1.522527  1.570u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
  clCreateBuffer return: 0x5588e774f5c0, refcnt: 1
13:51:33 0:1.522589  1.570u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
  clCreateBuffer -- req: 2, pixles: 0x5588e88fc130, len: 1536000
13:51:33 0:1.524313  1.590u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
  clCreateBuffer return: 0x5588e77064b0, refcnt: 1
13:51:33 0:1.528273  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
  b4 ReleaseOpenCLMemObject(0x5588e774f5c0) refcnt: 2
13:51:33 0:1.528351  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
  af ReleaseOpenCLMemObject(0x5588e774f5c0) refcnt: 1
13:51:33 0:1.528410  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
  b4 ReleaseOpenCLMemObject(0x5588e77064b0) refcnt: 2
13:51:33 0:1.528465  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
  af ReleaseOpenCLMemObject(0x5588e77064b0) refcnt: 1
13:51:33 0:1.528520  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
  b4 ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:33 0:1.528582  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
  af ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:34 0:1.740225  2.900u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
  clCreateBuffer -- req: 2, pixles: 0x5588eb571300, len: 15728640
13:51:34 0:1.742399  2.910u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
  clCreateBuffer return: 0x5588e87fead0, refcnt: 1
13:51:34 0:1.742615  2.910u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
  clCreateBuffer -- req: 3, pixles: 0x5588e9a841e0, len: 1536000
13:51:34 0:1.744057  2.930u 47963 opencl.c/AcquireMagickCLCacheInfo/584/User:
  clCreateBuffer return: 0x5588e87feda0, refcnt: 1
13:51:34 0:1.745895  2.930u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
  b4 ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 2
13:51:34 0:1.745964  2.930u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
  af ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:34 0:1.746004  2.930u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
  b4 ReleaseOpenCLMemObject(0x5588e87feda0) refcnt: 2
13:51:34 0:1.746081  2.930u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
  af ReleaseOpenCLMemObject(0x5588e87feda0) refcnt: 1
13:51:34 0:1.746126  2.930u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
  b4 ReleaseOpenCLMemObject(0x5588e87756a0) refcnt: 1
13:51:34 0:1.746185  2.930u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
  af ReleaseOpenCLMemObject(0x5588e87756a0) refcnt: 1
13:51:34 0:1.946106  4.270u 47963 opencl.c/AcquireMagickCLCacheInfo/576/User:
  clCreateBuffer -- req: 3, pixles: 0x5588ea9841f0, len: 15728640
Abort was called at 250 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.09.22577/shared/source/memory_manager/host_ptr_manager.cpp
Aborted (core dumped)

测试地址重叠：

(> (+ #x5588ea9841f0 15728640) #x5588eb571300)
(- #x5588eb571300 #x5588ea9841f0)

解析下：最后一行0x5588ea9841f0调用clCreateBuffer()时崩溃，它的地址与0x5588eb571300重叠，而0x5588eb571300申请的cl_mem地址为：0x5588e87fead0，最后一次调用ReleaseOpenCLMemObject()后它的引用计数为1，这说明0x5588eb571300还没被释放而0x5588ea9841f0又开始申请造成内存重叠。

发现一处问题，上面输出中有如下：

13:51:33 0:1.528520  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/509/User:
  b4 ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1
13:51:33 0:1.528582  1.610u 47963 opencl.c/ReleaseOpenCLMemObject/513/User:
  af ReleaseOpenCLMemObject(0x5588e87fead0) refcnt: 1

ReleaseOpenCLMemObject(0x5588e87fead0)调用前后引用计数没有减少。难道clReleaseMemObject()调用失败了？

原来是因为当对象已经销毁后再调用clGetMemObjectInfo()将会返回-38的错误，即CL_INVALID_MEM_OBJECT。

<2022-03-16 Wed>

调用`clCreateBuffer()`产生异常问题（六）

我可能解决了这个问题，将问题定位在了RunOpenCLBenchmark()的结尾DestroyImage(resizedImage);处，即在DestroyCacheInfo()中应该有清除OpenCL相关内存的代码。

<2022-03-17 Thu>

关于`IM`中的`number_channels`成员（一）

在IM中number_channels成员出现频率有点高，经调试发现IM中图片对象初始化时通过调用OpenPixelCache()然后在InitializePixelChannelMap()中设置number_channels的值。这个函数的内部大量使用了GM中没有类型PixelChannel和PixelTrait，不太好把它给搬到GM中。

查看PixelChannel的定义发现了它的一个特点是：虽然它是enum类型，但每个成员都被指派了具体的值，且发现有多个成员共用一个值的情况。以此参照仍然没有在GM中找到类似定义，PixelChannel的定义：

typedef enum
{
  UndefinedPixelChannel = 0,
  RedPixelChannel = 0,
  CyanPixelChannel = 0,
  GrayPixelChannel = 0,
  LPixelChannel = 0,
  LabelPixelChannel = 0,
  YPixelChannel = 0,
  aPixelChannel = 1,
  GreenPixelChannel = 1,
  MagentaPixelChannel = 1,
  CbPixelChannel = 1,
  bPixelChannel = 2,
  BluePixelChannel = 2,
  YellowPixelChannel = 2,
  CrPixelChannel = 2,
  BlackPixelChannel = 3,
  AlphaPixelChannel = 4,
  IndexPixelChannel = 5,
  ReadMaskPixelChannel = 6,
  WriteMaskPixelChannel = 7,
  MetaPixelChannel = 8,
  CompositeMaskPixelChannel = 9,
  IntensityPixelChannel = MaxPixelChannels,  /* ???? */
  CompositePixelChannel = MaxPixelChannels,  /* ???? */
  SyncPixelChannel = MaxPixelChannels+1      /* not a real channel */
} PixelChannel;  /* must correspond to ChannelType */

我模仿IM的InitializePixelChannelMap()函数写了calc_image_number_channels()，虽然number_channels的值对于同一张测试图片bg1a.jpg来说均为3，但是在IM中值3显示正确，而在GM中3 + 1才能正确，所以我在ComputeResizeImage()中将calc_image_number_channels()的返回值加上了1：

number_channels=(cl_uint) calc_image_number_channels(image)+1;

这只是临时方案，估计下面要更改抄过来的kernel函数。

<2022-03-18 Fri>

关于`gpuSupportedResizeWeighting()`的代码能否省略

在AccelerateResizeImage()中有这样的一段代码被注释掉了：

// if ((gpuSupportedResizeWeighting(GetResizeFilterWeightingType(
//        resizeFilter)) == MagickFalse) ||
//     (gpuSupportedResizeWeighting(GetResizeFilterWindowWeightingType(
//        resizeFilter)) == MagickFalse))
//   return((Image *) NULL);

我认为这段代码可以省略，因为它的目的就是为了检查IM的ResizeFilter类型中的ResizeWeightingFunctionType类型的成员值：

struct _ResizeFilter
{
  double
    (*filter)(const double,const ResizeFilter *),
    (*window)(const double,const ResizeFilter *),
    support,        /* filter region of support - the filter support limit */
    window_support, /* window support, usally equal to support (expert only) */
    scale,          /* dimension scaling to fit window support (usally 1.0) */
    blur,           /* x-scale (blur-sharpen) */
    coefficient[7]; /* cubic coefficents for BC-cubic filters */

  ResizeWeightingFunctionType
    filterWeightingType,
    windowWeightingType;

  size_t
    signature;
};

是否在supportedResizeWeighting数组中：

static const ResizeWeightingFunctionType supportedResizeWeighting[] =
{
  BoxWeightingFunction,
  TriangleWeightingFunction,
  HannWeightingFunction,
  HammingWeightingFunction,
  BlackmanWeightingFunction,
  CubicBCWeightingFunction,
  SincWeightingFunction,
  SincFastWeightingFunction,
  LastWeightingFunction
};

很明显这个数组就是IM支持GPU的窗函数集合。在GM中我只在ResizeImage()函数中找到相近的定义：

static const FilterInfo
  filters[SincFilter+1] =
  {
    { Box, 0.0 },
    { Box, 0.0 },
    { Box, 0.5 },
    { Triangle, 1.0 },
    { Hermite, 1.0 },
    { Hanning, 1.0 },
    { Hamming, 1.0 },
    { Blackman, 1.0 },
    { Gaussian, 1.25 },
    { Quadratic, 1.5 },
    { Cubic, 2.0 },
    { Catrom, 2.0 },
    { Mitchell, 2.0 },
    { Lanczos, 3.0 },
    { BlackmanBessel, 3.2383 },
    { BlackmanSinc, 4.0 }
  };

明显可见，GM中处理相对于IM简单多了，所以上面的代码我仍然保持它被注释状态。

<2022-03-26 Sat>

又一个闪退问题

有一个星期没碰了，今天突然发现：

[ysouyno@arch ~]$ export MAGICK_OCL_DEVICE=true
[ysouyno@arch ~]$ gm display ~/temp/bg1a.jpg
Abort was called at 39 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.11.22682/shared/source/built_ins/built_ins.cpp
gm display: abort due to signal 6 (SIGABRT) "Abort"...
Aborted (core dumped)
[ysouyno@arch ~]$ clinfo
Abort was called at 39 line in file:
/build/intel-compute-runtime/src/compute-runtime-22.11.22682/shared/source/built_ins/built_ins.cpp
Aborted (core dumped)

连clinfo都运行不了了，看来肯定不是我代码的问题。发现intel-compute-runtime的版本也已经更新了。

对`IM`的`number_channels`及`PixelChannelMap`结构体中的`channel`和`offset`成员的理解

这个标题有点长，可能文章的内容也有点长，但是思路越来越清晰。先来看PixelChannelMap的结构体定义：

typedef struct _PixelChannelMap
{
  PixelChannel
    channel;

  PixelTrait
    traits;

  ssize_t
    offset;
} PixelChannelMap;

PixelChannelMap在IM的_Image结构体中对应成员channel_map：

PixelChannelMap
  *channel_map;

首先要说的是，channel_map在逐个计算像素的过程中非常重要，拿IM的resize.c:HorizontalFilter()函数为例：

for (i=0; i < (ssize_t) GetPixelChannels(image); i++)
{
  double
    alpha,
    gamma,
    pixel;

  PixelChannel
    channel;

  PixelTrait
    resize_traits,
    traits;

  ssize_t
    j;

  ssize_t
    k;

  channel=GetPixelChannelChannel(image,i);
  traits=GetPixelChannelTraits(image,channel);
  resize_traits=GetPixelChannelTraits(resize_image,channel);
  if ((traits == UndefinedPixelTrait) ||
      (resize_traits == UndefinedPixelTrait))
    continue;
  if (((resize_traits & CopyPixelTrait) != 0) ||
      (GetPixelWriteMask(resize_image,q) <= (QuantumRange/2)))
    {
      j=(ssize_t) (MagickMin(MagickMax(bisect,(double) start),(double)
        stop-1.0)+0.5);
      k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
        (contribution[j-start].pixel-contribution[0].pixel);
      SetPixelChannel(resize_image,channel,p[k*GetPixelChannels(image)+i],
        q);
      continue;
    }
  pixel=0.0;
  if ((resize_traits & BlendPixelTrait) == 0)
    {
      /*
        No alpha blending.
      */
      for (j=0; j < n; j++)
      {
        k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
          (contribution[j].pixel-contribution[0].pixel);
        alpha=contribution[j].weight;
        pixel+=alpha*p[k*GetPixelChannels(image)+i];
      }
      SetPixelChannel(resize_image,channel,ClampToQuantum(pixel),q);
      continue;
    }
  /*
    Alpha blending.
  */
  gamma=0.0;
  for (j=0; j < n; j++)
  {
    k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
      (contribution[j].pixel-contribution[0].pixel);
    alpha=contribution[j].weight*QuantumScale*
      GetPixelAlpha(image,p+k*GetPixelChannels(image));
    pixel+=alpha*p[k*GetPixelChannels(image)+i];
    gamma+=alpha;
  }
  gamma=PerceptibleReciprocal(gamma);
  SetPixelChannel(resize_image,channel,ClampToQuantum(gamma*pixel),q);
}

这个循环中的GetPixelChannels()函数就是返回number_channels的值：

static inline size_t GetPixelChannels(const Image *magick_restrict image)
{
  return(image->number_channels);
}

即处理每个像素的所有channel，通过GetPixelChannelChannel()函数以通道的offset成员获得对应的channel：

static inline PixelChannel GetPixelChannelChannel(
  const Image *magick_restrict image,const ssize_t offset)
{
  return(image->channel_map[offset].channel);
}

返回的channel是PixelChannel类型，这个定义在上面的文章中已经给出了，见：“PixelChannel”，看定义我之前还在奇怪为什么enum类型指定了好多相同的0值，1值，现在终于明白了。即比如RGB和CMYK两种形式R和C都是第0个通道，G和M都是第1个通道，依次类推。CMYK中的K就是black，在PixelChannel中对应的是BlackPixelChannel。

重点就是SetPixelChannel()这个函数：

static inline void SetPixelChannel(const Image *magick_restrict image,
  const PixelChannel channel,const Quantum quantum,
  Quantum *magick_restrict pixel)
{
  if (image->channel_map[channel].traits != UndefinedPixelTrait)
    pixel[image->channel_map[channel].offset]=quantum;
}

这里忽略理解traits，最后一个参数pixel就是处理像素后的目标地址，代码中的channel是通过循环i获取的，offset是通过channel获取的，最终计算出了pixel的真正地址，然后将计算好的quantum赋值进去。

最后，channel_map中的channel和offset是在哪里初始化的？我对比了代码发现只在：

static inline void SetPixelChannelAttributes(
  const Image *magick_restrict image,const PixelChannel channel,
  const PixelTrait traits,const ssize_t offset)
{
  if ((ssize_t) channel >= MaxPixelChannels)
    return;
  if (offset >= MaxPixelChannels)
    return;
  image->channel_map[offset].channel=channel;
  image->channel_map[channel].offset=offset;
  image->channel_map[channel].traits=traits;
}

而SetPixelChannelAttributes()只在InitializePixelChannelMap()函数中会被调用，InitializePixelChannelMap()这个函数有点熟悉，在之前了解number_channels的文章中做过了介绍，这个函数内部计算并初始化了number_channels的值。

我对GM和IM进行了比较，GM对IM进行了精简，代码：

for (y=0; y < (long) destination->rows; y++)
  {
    double
      weight;

    DoublePixelPacket
      pixel;

    long
      j;

    register long
      i;

    pixel=zero;
    if ((destination->matte) || (destination->colorspace == CMYKColorspace))
      {
        double
          transparency_coeff,
          normalize;

        normalize=0.0;
        for (i=0; i < n; i++)
          {
            j=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
              (contribution[i].pixel-contribution[0].pixel);
            weight=contribution[i].weight;
            transparency_coeff = weight * (1 - ((double) p[j].opacity/TransparentOpacity));
            pixel.red+=transparency_coeff*p[j].red;
            pixel.green+=transparency_coeff*p[j].green;
            pixel.blue+=transparency_coeff*p[j].blue;
            pixel.opacity+=weight*p[j].opacity;
            normalize += transparency_coeff;
          }
        normalize = 1.0 / (AbsoluteValue(normalize) <= MagickEpsilon ? 1.0 : normalize);
        pixel.red *= normalize;
        pixel.green *= normalize;
        pixel.blue *= normalize;
        q[y].red=RoundDoubleToQuantum(pixel.red);
        q[y].green=RoundDoubleToQuantum(pixel.green);
        q[y].blue=RoundDoubleToQuantum(pixel.blue);
        q[y].opacity=RoundDoubleToQuantum(pixel.opacity);
      }
    else
      {
        for (i=0; i < n; i++)
          {
            j=(long) (y*(contribution[n-1].pixel-contribution[0].pixel+1)+
                      (contribution[i].pixel-contribution[0].pixel));
            weight=contribution[i].weight;
            pixel.red+=weight*p[j].red;
            pixel.green+=weight*p[j].green;
            pixel.blue+=weight*p[j].blue;
          }
        q[y].red=RoundDoubleToQuantum(pixel.red);
        q[y].green=RoundDoubleToQuantum(pixel.green);
        q[y].blue=RoundDoubleToQuantum(pixel.blue);
        q[y].opacity=OpaqueOpacity;
      }

    if ((indexes != (IndexPacket *) NULL) &&
        (source_indexes != (IndexPacket *) NULL))
      {
        i=Min(Max((long) (center+0.5),start),stop-1);
        j=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
          (contribution[i-start].pixel-contribution[0].pixel);
        indexes[y]=source_indexes[j];
      }
  }

从上面的代码可看出，在GM中只处理了matte通道或CMYKColorsapce，它们都是四通道的。但是看到代码中的indexes，不知道这个在GM中的意义及是否在IM中有相关对应。

可以在GM中通过搜索indexes_valid来找到一些有用的信息：

/*
  Indexes are valid if the image storage class is PseudoClass or the
  colorspace is CMYK.
*/
cache_info->indexes_valid=((image->storage_class == PseudoClass) ||
                           (image->colorspace == CMYKColorspace));

原来是这样，这里的PseudoClass就是pseudocolor，可以在wiki的Indexed color中找到介绍，之所以叫做索引颜色，是因为为了节省内存或磁盘空间，颜色信息不是直接由图片的像素所携带，而是存放在一个单独的颜色表中或者调色板中。

从上面贴出来的IM和GM的代码对比发现，下面两段代码类似：

// IM
if (((resize_traits & CopyPixelTrait) != 0) ||
    (GetPixelWriteMask(resize_image,q) <= (QuantumRange/2)))
  {
    j=(ssize_t) (MagickMin(MagickMax(bisect,(double) start),(double)
      stop-1.0)+0.5);
    k=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
      (contribution[j-start].pixel-contribution[0].pixel);
    SetPixelChannel(resize_image,channel,p[k*GetPixelChannels(image)+i],
      q);
    continue;
  }

// GM
if ((indexes != (IndexPacket *) NULL) &&
    (source_indexes != (IndexPacket *) NULL))
  {
    i=Min(Max((long) (center+0.5),start),stop-1);
    j=y*(contribution[n-1].pixel-contribution[0].pixel+1)+
      (contribution[i-start].pixel-contribution[0].pixel);
    indexes[y]=source_indexes[j];
  }

依然参考InitializePixelChannelMap()的代码，结合刚刚知道的GM对于PseudoClass和CMYKColorspace时indexes有效，在IM中则CopyPixelTrait对应IndexPixelChannel：

if (image->colorspace == CMYKColorspace)
  SetPixelChannelAttributes(image,BlackPixelChannel,trait,n++);
if (image->alpha_trait != UndefinedPixelTrait)
  SetPixelChannelAttributes(image,AlphaPixelChannel,CopyPixelTrait,n++);
if (image->storage_class == PseudoClass)
  SetPixelChannelAttributes(image,IndexPixelChannel,CopyPixelTrait,n++);
if ((image->channels & ReadMaskChannel) != 0)
  SetPixelChannelAttributes(image,ReadMaskPixelChannel,CopyPixelTrait,n++);
if ((image->channels & WriteMaskChannel) != 0)
  SetPixelChannelAttributes(image,WriteMaskPixelChannel,CopyPixelTrait,n++);
if ((image->channels & CompositeMaskChannel) != 0)
  SetPixelChannelAttributes(image,CompositeMaskPixelChannel,CopyPixelTrait,
    n++);

不同的是IM中CMYKColorspace没有CopyPixelTrait特性。

小结一下：以目前的开发状态，将IM中的CopyPixelTrait与GM中的indexes对应起来。

<2022-03-28 Mon>

关于`IM`中的`number_channels`成员（二）

在“关于IM中的number_channels成员（一）”的结尾提到将计算出来的number_channels值加1才能显示正确的图形，之前说它是临时方案，看来这次要将它变成永久的了。

number_channels=(cl_uint) calc_image_number_channels(image)+1;

通过详细阅读GM和IM的HorizontalFilter()函数发现IM的最内层循环是通过：

static inline size_t GetPixelChannels(const Image *magick_restrict image)
{
  return(image->number_channels);
}

来获得的，而GM中没有这么做，它比IM少了刚刚提到的这一层循环，改为固定设置四个通道的值。代码在前面的笔记中已给出，见：“对IM的number_channels及PixelChannelMap结构体中的channel和offset成员的理解”。

我觉得这样的话反而简单了，我需要修改kernel函数和GM的自身处理相匹配，或者去掉number_channels成员，用固定值4代替？

kernel函数我相信现在修改它没什么难度，这两天翻来覆去的看accelerate-kernels-private.h:ResizeVerticalFilter()函数并和HorizontalFilter()进行对比理解，现在已经胸有成竹了。

<2022-03-29 Tue>

对`number_channels`的处理

今天开始尝试对number_channels进行处理，我的想法是既然number_channels在GM中已经失去了它的意义，那倒不如直接把它删除掉，在它曾经出现过的地方用数字4代替。当然也不是所有number_channels的地方都要修改，要看具体情况而定。

首先，我从IM中重新拷贝了accelerate-kernels-private.h文件，因为之前那些的修改是基于错误的number_channels逻辑的。其次number_channels的传参部分不能删除，因为原GM中对于图片是否有透明通道或者是否是CMYK的格式有做判断，因此将它改为matte_or_cmyk来表明是否是四通道，1表示有，0表示没有，这样的话在accelerate-kernels-private.h的ResizeHorizontalFilter()函数中alpha_index就可以删除了。最后在WriteAllChannels()处的调用，必须以4传入，因为即使原图不是四通道，它的第四个通道也要赋值以保证图片计算的正确性。

经过这些修改之后测试图片显示正常。

<2022-03-30 Wed>

一个低级错误引发的`core dumped`

当将图片不断缩小到宽高为1x1时会出现如下问题：

gm: magick/image.c:1407: DestroyImage: Assertion `image->signature == MagickSignature' failed.
Aborted (core dumped)

这是因为在ComputeResizeImage()函数中当缩小到1x1时失败，outputReady为0导致DestroyImage(filteredImage);的调用，但是在销毁filteredImage后并没有将其赋0导致。

看了一下GM的源码：

/*
  Free memory and set pointer to NULL
*/
#define MagickFreeMemory(memory) \
{ \
  void *_magick_mp=memory;      \
  MagickFree(_magick_mp);       \
  memory=0;                     \
}

这里明明将传入的指针赋0了。难道这段代码不起作用？

其实这个宏是有作用的，但要看怎么使用它。因为DestroyImage()是函数调用，实际上传入的指针是一个副本，将副本赋0并不影响原来的值，同时也要理解宏和函数调用的不同，这里有两种情况需要考虑：

有如下测试代码：

#include <stdio.h>
#include <stdlib.h>

typedef void (*MagickFreeFunc)(void *ptr);
static MagickFreeFunc FreeFunc = free;

void MagickFree(void *memory) {
  if (memory != (void *)NULL)
    (FreeFunc)(memory);
}

#define MagickFreeMemory(memory)                \
  {                                             \
    printf("&memory: %p\n", &memory);           \
    printf(" memory: %p\n", memory);            \
    void *_magick_mp = memory;                  \
    MagickFree(_magick_mp);                     \
    memory = 0;                                 \
  }

void destroy_image(char *image) { MagickFreeMemory(image); }

int main() {
  char *image = (char *)malloc(1024);
  printf("&image : %p\n", &image);
  printf(" image : %p\n", image);
  destroy_image(image);
  printf("&image : %p\n", &image);
  printf(" image : %p\n", image);
  return 0;
}

这是GM中的代码使用方式，输出如下：

% ./a.out
&image : 0x7ffc8a46dfb0
 image : 0x55711df782a0
&memory: 0x7ffc8a46df88
 memory: 0x55711df782a0
&image : 0x7ffc8a46dfb0
 image : 0x55711df782a0

这里指针并没有变化，因为是函数调用，如果将代码中的destroy_image()改为宏MagickFreeFunc，则是想要的效果：

#include <stdio.h>
#include <stdlib.h>

typedef void (*MagickFreeFunc)(void *ptr);
static MagickFreeFunc FreeFunc = free;

void MagickFree(void *memory) {
  if (memory != (void *)NULL)
    (FreeFunc)(memory);
}

#define MagickFreeMemory(memory)                \
  {                                             \
    printf("&memory: %p\n", &memory);           \
    printf(" memory: %p\n", memory);            \
    void *_magick_mp = memory;                  \
    MagickFree(_magick_mp);                     \
    memory = 0;                                 \
  }

void destroy_image(char *image) { MagickFreeMemory(image); }

int main() {
  char *image = (char *)malloc(1024);
  printf("&image : %p\n", &image);
  printf(" image : %p\n", image);
  MagickFreeMemory(image);
  printf("&image : %p\n", &image);
  printf(" image : %p\n", image);
  return 0;
}

% ./a.out
&image : 0x7ffd7247ae08
 image : 0x5572b59cf2a0
&memory: 0x7ffd7247ae08
 memory: 0x5572b59cf2a0
&image : 0x7ffd7247ae08
 image : (nil)

所以要想解决这个core dumped的问题，就老老实实地按照GM的代码风格，调用完DestroyImage()后再紧接着赋一次0。

关于`error: use of type 'double' requires cl_khr_fp64 support`错误

今天安装了最新的intel-compute-runtime，看来已经修复了core dumped问题，见：“又一个闪退问题”中提到的问题。

% sudo pacman -Ss intel-compute-runtime
[sudo] password for ysouyno:
community/intel-compute-runtime 22.12.22749-1 [installed]
    Intel(R) Graphics Compute Runtime for oneAPI Level Zero and OpenCL(TM) Driver

试运行了一下我的最新代码，发现有opencl编译错误：

error: use of type 'double' requires cl_khr_fp64 support

先只是简单的将double换成float来解决这个问题。

关于`IM`中`resizeHorizontalFilter()`中的`scale`变量

分析IM的accelerate.c:resizeHorizontalFilter()的源代码发现它的scale变量计算后只停留在此函数内，并没有往下传递进kernel函数，关于scale的计算代码是不是多余的？从目前我理解到的IM的逻辑来看，我认为它是多余的。因为向下传递给kernel函数的是resizeFilterScale变量，这个变量的值不依赖scale变量，而是通过传参获取现有的结构体中的值，且它进入kernel函数ResizeHorizontalFilter()后通过调用getResizeFilterWeight()函数再以filterType获得计算函数来进一步计算scale值，进而最终返回weight值。

另外发现在kernel函数ResizeHorizontalFilter()的开始部分scale又被计算了一次，因此我觉得可以确认accelerate.c:resizeHorizontalFilter()中的scale变量是多余的。

我在GM中应该怎么处理呢？考虑到GPU并行运行的影响，scale的值不依赖各个work-group或work-item。因此我认为将scale赋值给resizeFilterScale传进kernel函数不会影响计算结果，那这样的话kernel函数中的scale计算就显得有点多余了。

备注：代码写着写着，发现个严重问题，OpenCL不支持函数指针，那怎么把过滤函数传进kernel函数呢？

<2022-03-31 Thu>

在核函数中使用`GM`的计算代码

因为OpenCL不支持传递函数指针，所以增加了过滤函数的类型参数进行传参，涉及了一系列函数调用的参数修改。

在resizeHorizontalFilter()内部计算好scale的值，采用GM的计算方法，虽然它和IM的计算方法差不多。将kernel函数中的scale计算代码移除，同时核函数ResizeHorizontalFilter()的support也通过参数传入，它和scale一样，计算放在了resizeHorizontalFilter()中，另发现核函数ResizeHorizontalFilter()中的resizeFilterBlur变量已经不再使用。

所有修改见此次commit的上个commit，修改代码比较多，但愿没引出新的问题。

<2022-04-01 Fri>

关于`AcquireCriticalMemory()`函数的异常处理（一）

不太好给AcquireCriticalMemory()添加异常处理，GM中定义好的内存分配失败的异常就那么几个，查找所有调用AcquireCriticalMemory()的地方，发现有给StringInfo，有给ImageInfo，还有给MagickCLCacheInfo等等分配内存的，在每个调用AcquireCriticalMemory()的地方抛出异常是可行的，可以使用GM中已定义好的异常类型，比如StringInfo可以用UnableToAllocateString来代替，ImageInfo可以用UnableToAllocateImage，MagickCLCacheInfo可能需要增加一个异常类型；或者在AcquireCriticalMemory()函数内部处理异常，这正是IM的处理方式，但是这样的话在AcquireCriticalMemory()内部不能明确表达出是哪种类型操作产生的异常。当然可以通过增加参数来解决，但是处理起来同样很麻烦。

目前我修改的函数是这样的，固定了它的类型为UnableToAllocateModuleInfo，没有把此修改放到源代码里，目前仅存在笔记里：

MagickExport void *AcquireCriticalMemory(const size_t len)
{
  void
    *memory;

  // Fail if memory request cannot be fulfilled.
  memory=MagickMalloc(len);
  if (memory == (void *) NULL)
    MagickFatalError3(ResourceLimitFatalError,MemoryAllocationFailed,
      UnableToAllocateModuleInfo);
  return(memory);
}

<2022-04-02 Sat>

关于`AcquireCriticalMemory()`函数的异常处理（二）

了解了一下GM的异常处理，可以这么来用：

MagickExport void *AcquireCriticalMemory(const size_t len)
{
  void
    *memory;

  ExceptionInfo
    exception;

  GetExceptionInfo(&exception);

  // Fail if memory request cannot be fulfilled.
  memory=MagickMalloc(len);
  if (memory == (void *) NULL)
    ThrowException(&exception,ResourceLimitError,MemoryAllocationFailed,
      "AcquireCriticalMemory");
  return(memory);
}

但是发现如果memory为空时ThrowException()并不能结束掉程序，它最终调用的是ThrowLoggedException()函数将其记录下来。也可以这么使用：

MagickExport void *AcquireCriticalMemory(const size_t len)
{
  void
    *memory;

  // Fail if memory request cannot be fulfilled.
  memory=MagickMalloc(len);
  if (memory == (void *) NULL)
    MagickFatalError(ResourceLimitFatalError,MemoryAllocationFailed,
      "ocl: AcquireCriticalMemory");
  return(memory);
}

这是我当前使用的方案，当指针为0时它结束掉程序，打印出如下的信息：

[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
gm display: Memory allocation failed (ocl: AcquireCriticalMemory) [Resource temporarily unavailable].

如果能把输出的信息弄得再详细点儿就更好了。

值得注意的是在GM中有MagickFatalError()，MagickFatalError2()和MagickFatalError3()三个功能相似的函数，MagickFatalError()可以使用字符串做为参数，MagickFatalError2()也可以使用字符串做为参数，但是它与MagickFatalError()的具体应用场景还不太了解，MagickFatalError3()只能使用预定义的异常类型。

关于`LiberateMagickResource()`的闪退问题（一）

我正在处理所有标注了TODO(ocl)的代码，在DestroyMagickCLCacheInfoAndPixels()函数里的代码：

// RelinquishMagickResource(MemoryResource,info->length); // TODO(ocl)
DestroyMagickCLCacheInfo(info);
// (void) RelinquishAlignedMemory(pixels); // TODO(ocl)

我这样处理之后：

// RelinquishMagickResource(MemoryResource,info->length);
LiberateMagickResource(MemoryResource,info->length);
DestroyMagickCLCacheInfo(info);
// (void) RelinquishAlignedMemory(pixels);
MagickFreeAlignedMemory(pixels);

程序闪退，打印的信息如下：

$ gm display ~/temp/1.png
gm display: abort due to signal 11 (SIGSEGV) "Segmentation Fault"...
Aborted (core dumped)

确认起因是因为使用了LiberateMagickResource(MemoryResource,info->length);这行代码。经过调试发现在GetAuthenticOpenCLBuffer()函数返回NULL后程序闪退。具体代码是：

if ((cache_info->type != MemoryCache)/*  || (cache_info->mapped != MagickFalse) */)
  return((cl_mem) NULL);

我这样修改好像不闪退了：

if ((cache_info->type != MemoryCache) || (cache_info->type != MapCache))
  return((cl_mem) NULL);

目前尚未理解LiberateMagickResource(MemoryResource,info->length);的用意，及像上面这样修改会不会引发什么新的问题。

注：必须清除.cache/ImageMagick里的所有文件才能出现闪退问题，即在跑opencl的benchmark时会出现。

<2022-04-03 Sun>

关于`LiberateMagickResource()`的闪退问题（二）

理解了一下GM的AcquireMagickResource()，LiberateMagickResource()函数，它们实际上起到监视的作用，没有分配和释放资源的功能，包括InitializeMagickResources()函数，初始化内存分配的上限，磁盘上限等等，可以通过比如MAGICK_LIMIT_MEMORY，MAGICK_LIMIT_DISK等环境变量来设置。

调试发现当在DestroyMagickCLCacheInfoAndPixels()函数中使用LiberateMagickResource()后：

// LiberateMagickResource()

case SummationLimit:
  {
    /*
      Limit depends on sum of previous allocations as well as
      the currently requested size.
    */
    LockSemaphoreInfo(info->semaphore);
    info->value-=size;
    value=info->value;
    UnlockSemaphoreInfo(info->semaphore);
    break;
  }

中的info->value-=size;可能会变成负值，这样的话，再次调用AcquireMagickResource()时可能返回失败，即：

// AcquireMagickResource()

case SummationLimit:
  {
    /*
      Limit depends on sum of previous allocations as well as
      the currently requested size.
    */
    LockSemaphoreInfo(info->semaphore);
    value=info->value+size;
    if ((info->maximum != ResourceInfinity) &&
        (value > (magick_uint64_t) info->maximum))
      {
        value=info->value;
        status=MagickFail;
      }
    else
      {
        info->value=value;
      }
    UnlockSemaphoreInfo(info->semaphore);
    break;
  }

这里的if分支，这样的话，需要找到为什么LiberateMagickResource()会将info->value的值搞成负数？

<2022-04-06 Wed>

关于`LiberateMagickResource()`的闪退问题（三）

info->value的值之所以为负数，原因其实很简单，不是AcquireMagickResource()调少了，就是LiberateMagickResource()调多了。

最终还是解决了这个问题，这是搬代码过程中自己给自己挖的一个坑，原IM中的代码是：

// IM's RelinquishPixelCachePixels()

#if defined(MAGICKCORE_OPENCL_SUPPORT)
      if (cache_info->opencl != (MagickCLCacheInfo) NULL)
        {
          cache_info->opencl=RelinquishMagickCLCacheInfo(cache_info->opencl,
            MagickTrue);
          cache_info->pixels=(Quantum *) NULL;
          break;
        }
#endif

这里的break;是关键。

<2022-04-07 Thu>

`gm benchmark`比较

仅运行一次缩放图片的话gm-ocl的速度远小于gm，而迭代100次的话，gm-ocl速度高于gm，见：

启用了硬件加速：

[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 6.35s user 4.997407s total 20.010 iter/s 15.748 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 5.99s user 4.873903s total 20.517 iter/s 16.694 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.35s user 0.830804s total 1.204 iter/s 2.857 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.30s user 0.136360s total 7.334 iter/s 3.333 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.29s user 0.814550s total 1.228 iter/s 3.448 iter/cpu
[ysouyno@arch gm-ocl]$ echo $MAGICK_OCL_DEVICE
true
[ysouyno@arch gm-ocl]$

没有启用硬件加速：

[ysouyno@arch ~]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 40.57s user 5.829435s total 17.154 iter/s 2.465 iter/cpu
[ysouyno@arch ~]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 42.74s user 6.115149s total 16.353 iter/s 2.340 iter/cpu
[ysouyno@arch ~]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.31s user 0.057625s total 17.354 iter/s 3.226 iter/cpu
[ysouyno@arch ~]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.32s user 0.057751s total 17.316 iter/s 3.125 iter/cpu
[ysouyno@arch ~]$ gm benchmark convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1 iter 0.31s user 0.057476s total 17.399 iter/s 3.226 iter/cpu
[ysouyno@arch ~]$ echo $MAGICK_OCL_DEVICE

[ysouyno@arch ~]$

分析：在启用了硬件加速后，gm-ocl每次都将加载~/.cache/ImageMagick/中的镜像，读取磁盘文件属于慢操作；而gm则没有这种加载时间的影响。当迭代100次时gm-ocl的加载时间比重就缩小了。

如果改成1000次的话，似乎gm-ocl的优势更加明显。

启用了硬件加速：

[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 6.02s user 4.814306s total 20.771 iter/s 16.611 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 1000 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1000 iter 59.54s user 43.377261s total 23.054 iter/s 16.795 iter/cpu
[ysouyno@arch gm-ocl]$ echo $MAGICK_OCL_DEVICE
true
[ysouyno@arch gm-ocl]$

没有启用硬件加速：

[ysouyno@arch ~]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 100 iter 41.49s user 5.985783s total 16.706 iter/s 2.410 iter/cpu
[ysouyno@arch ~]$ gm benchmark -iterations 1000 convert ~/temp/bg1a.jpg -resize 960x540 ~/temp/out.jpg
Results: 8 threads 1000 iter 536.90s user 77.881720s total 12.840 iter/s 1.863 iter/cpu
[ysouyno@arch ~]$ echo $MAGICK_OCL_DEVICE

[ysouyno@arch ~]$

<2022-04-10 Sun>

关于`ThrowMagickException()`的替代

刚刚我为GM增加了AccelerateEvent，因为我在opencl.c:CompileOpenCLKernel()中没有找到合适的方法替代IM的ThrowMagickException()，目前用日志代替异常（其实这个异常本身也没有结束程序的功能，况且如果这里发生错误的话也没必要结束程序）：

status=openCL_library->clBuildProgram(device->program,1,&device->deviceID,
  options,NULL,NULL);
if (status != CL_SUCCESS)
{
  // (void) ThrowMagickException(exception,GetMagickModule(),DelegateWarning,
  //   "clBuildProgram failed.","(%d)",(int)status);
  (void) LogMagickEvent(AccelerateEvent,GetMagickModule(), // TODO(ocl)
    "clBuildProgram failed: %d",(int)status);
  LogOpenCLBuildFailure(device,kernel,exception);
  return(MagickFalse);
}

如果程序走进这个if分支中，说明使用硬件加速失败，期望的行为是走原来流程，这样的话就不会因为初始化硬件加速而耽误太长时间。而实际上程序确实花费了一定的时间且走了原有流程。原IM中也是如此，我觉得这是一个问题。

<2022-04-11 Mon>

关于`*_utf8`系列函数

从IM中拷贝过来的open_utf8()，fopen_utf8()，stat_utf8()及remove_utf8()函数直接用非_utf8的函数代替。IM在windows下使用的是宽字符，所以有那样的处理。

关于`lt_dlclose()`函数

之前将lt_dlclose()函数改成了dlclose()函数，真是多此一举。因为在windows下lt_dlclose()是一个宏，它最终调用FreeLibrary()。

在linux下使用lt_dlclose()需要添加-lltdl链接选项，发现在IM中只要使用了--enable-opencl后运行./configure就自动添加上了-lltdl，我想在GM中也要实现它。

<2022-04-12 Tue>

关于`-lltdl`链接选项（一）

从早上到现在我一直在尝试--enable-opencl时自动添加上-lltdl链接选项。我参考了IM中的configure.ac中的实现，修改后GM的configure.ac的片断：

#
# Optionally check for libltdl if using it is still enabled
#
# Only use/depend on libtdl if we are building modules.  This is a
# change from previous releases (prior to 1.3.17) which supported
# loaded modules via libtdl if shared libraries were built.  of
# whether modules are built or not.
have_ltdl='no'
LIB_LTDL=''
if test "$build_modules" != 'no' || test "X$no_cl" != 'Xyes'
then
  AC_MSG_CHECKING([for libltdl ])
  AC_MSG_RESULT()
  failed=0
  passed=0
  AC_CHECK_HEADER([ltdl.h],[passed=`expr $passed + 1`],[failed=`expr $failed + 1`])
  AC_CHECK_LIB([ltdl],[lt_dlinit],[passed=`expr $passed + 1`],[failed=`expr $failed + 1`],)
  AC_MSG_CHECKING([if libltdl package is complete])
  if test $passed -gt 0
  then
    if test $failed -gt 0
    then
      AC_MSG_RESULT([no -- some components failed test])
      have_ltdl='no (failed tests)'
    else
      LIB_LTDL='-lltdl'
      LIBS="$LIB_LTDL $LIBS"
      AC_DEFINE(HasLTDL,1,[Define if using libltdl to support dynamically loadable modules])
      AC_MSG_RESULT([yes])
      have_ltdl='yes'
    fi
  else
    AC_MSG_RESULT([no])
  fi
  if test "$have_ltdl" != 'yes'
  then
    AC_MSG_FAILURE([libltdl is required by modules and OpenCL builds],[1])
  fi
fi
AM_CONDITIONAL(WITH_LTDL, test "$have_ltdl" != 'no')

然后在设置MAGICK_DEP_LIBS值的if和else分支中保证都含有$LIB_LTDL，同时注意no_cl的变量位置问题，否则上面代码段的no_cl值为空，导致上面代码段中的if分支始终能进入。

虽然经过这样的处理可以实现当使用--enable-opencl时自动加上-lltdl链接选项，但是引出了一个新的问题，当运行gm时：

[ysouyno@arch gm-ocl]$ gm display ~/temp/bg1a.jpg
gm display: No decode delegate for this image format (/home/ysouyno/temp/bg1a.jpg).
gm display: Unable to open file (Untitled) [No such file or directory].
[ysouyno@arch gm-ocl]$

经过调查发现，这是由于HasLTDL宏被启用的缘故。

关于`-lltdl`链接选项（二）

这里发现另外一个问题，在IM中也存在这个问题。

虚拟机环境中，有存在cl.h头文件，但没有libOpenCL.so的情况，这种情况下安装各种intel或者mesa的runtime均不能配置成功可运行的opencl的环境（可以从clinfo的运行结果来看），这样的话，按“关于-lltdl链接选项（一）”的修改使用--enable-opencl选项编译GM和IM的话，均编译失败：

undefined reference to `lt_dlclose'

因为如果有cl.h头文件的话，那么HAVE_CL_CL_H宏将启用，则HAVE_OPENCL宏也被启用，这样的话lt_dlclose()就可见了，但是没有opencl的链接环境，导致no_cl变量为yes，则-lltdl被忽略，从而链接失败，出现上述问题。

#if defined(HAVE_CL_CL_H)
#  include <CL/cl.h>
#  define HAVE_OPENCL 1
#endif
#if defined(HAVE_OPENCL_CL_H)
#  include <OpenCL/cl.h>
#  define HAVE_OPENCL 1
#endif

阅读了一下GM的configure.ac中关于build_modules的代码，了解到要在原生的GM中启用-lltdl，需要使用如下命令：

$ ./configure --enable-shared --with-modules

这样在lib/GraphicsMagick-1.3.35/module-Q8/coders目录中生成大量的.la文件。

我在想，我的要求只是简单的在--enable-opencl时添加一个链接选项，有必要大动干戈的修改原GM的libltdl的编译逻辑吗？我可以在configure.ac中额外处理no_cl，而不去启用HasLTDL宏？这样处理好不好？

<2022-04-13 Wed>

关于`-lltdl`链接选项（三）

赶紧结束吧，这个链接选项不能搞两天呀！既然IM也有同样的问题，那么就不考虑上面所说的，存在cl.h头文件，但没有libOpenCL.so的情况，造成链接失败。比如出现如下提示：

undefined reference to symbol 'dlsym@@GLIBC_2.2.5'

其它的测试看起来一切正常。

另因为lt_dlclose()也适用于windows平台，因此得尽快支持该平台，此平台还有好多开发，宏调整等等，得尽快完善起来。

支持`windows`

新创建了一个临时分支gm-1.3.35_ocl_win。

为configure.exe增加“Enable OpenCL”多选框
从“VisualMagick”拷贝OpenCL/CL头文件。
vs报错：error C2004: expected 'defined(id)'，因为它不支持这样的语法：

#if defined(/*MAGICKCORE_OPENMP_SUPPORT*/HAVE_OPENMP)

一些函数的MagickExport得去掉，因为重定义。
MAGICKCORE_WINDOWS_SUPPORT替换为MSWINDOWS。

<2022-04-14 周四>

`windows`平台不能生成内核的`.bin`文件

问题出在CacheOpenCLKernel()函数中：

static void CacheOpenCLKernel(MagickCLDevice device,char *filename,
  ExceptionInfo *exception)
{
  cl_uint
    status;

  size_t
    binaryProgramSize;

  unsigned char
    *binaryProgram;

  status=openCL_library->clGetProgramInfo(device->program,
    CL_PROGRAM_BINARY_SIZES,sizeof(size_t),&binaryProgramSize,NULL);
  if (status != CL_SUCCESS)
    return;
  binaryProgram=(unsigned char*) AcquireQuantumMemory(1,binaryProgramSize);
  if (binaryProgram == (unsigned char *) NULL)
    {
      (void) ThrowException(exception,
        ResourceLimitError,MemoryAllocationFailed,"CacheOpenCLKernel");
      return;
    }
  status=openCL_library->clGetProgramInfo(device->program,
    CL_PROGRAM_BINARIES,sizeof(unsigned char*),&binaryProgram,NULL);
  if (status == CL_SUCCESS)
    {
      (void) LogMagickEvent(AccelerateEvent,GetMagickModule(),
        "Creating cache file: \"%s\"",filename);
      (void) BlobToFile(filename,binaryProgram,binaryProgramSize,exception);
    }
  binaryProgram=(unsigned char *) RelinquishMagickMemory(binaryProgram);
}

因为clGetProgramInfo()的返回值是-30对应CL_INVALID_VALUE错误，修改之后可以临时解决：

static void CacheOpenCLKernel(MagickCLDevice device, char* filename,
  ExceptionInfo* exception)
{
  cl_uint
    status;

  size_t
    * binaryProgramSize,
    num_binaries;

  unsigned char
    ** binaryProgram;

  status = openCL_library->clGetProgramInfo(device->program, CL_PROGRAM_BINARY_SIZES,
    0, 0, &num_binaries);
  if (status != CL_SUCCESS)
    return;
  num_binaries = num_binaries / sizeof(size_t);
  binaryProgramSize = (size_t*)malloc(num_binaries * sizeof(size_t));
  binaryProgram = (const unsigned char**)calloc(num_binaries, sizeof(unsigned char*));
  status = openCL_library->clGetProgramInfo(device->program,
    CL_PROGRAM_BINARY_SIZES, num_binaries * sizeof(size_t), binaryProgramSize, NULL);
  LogMagickEvent(AccelerateEvent, GetMagickModule(), "clGetProgramInfo return: %d", status);
  if (status != CL_SUCCESS)
    return;
  for (int i = 0; i < num_binaries; ++i) {
    if (binaryProgramSize[i]) {
      binaryProgram[i] = (unsigned char*)AcquireQuantumMemory(1, binaryProgramSize[i]);
      if (binaryProgram[i] == (unsigned char*)NULL)
      {
        (void)ThrowException(exception,
          ResourceLimitError, MemoryAllocationFailed, "CacheOpenCLKernel");
        return;
      }
      status = openCL_library->clGetProgramInfo(device->program,
        CL_PROGRAM_BINARIES, num_binaries * sizeof(unsigned char*), binaryProgram, NULL);
      if (status == CL_SUCCESS)
      {
        (void)LogMagickEvent(AccelerateEvent, GetMagickModule(),
          "Creating cache file: \"%s\"", filename);
        (void)BlobToFile(filename, binaryProgram[i], binaryProgramSize[i], exception);
      }
      binaryProgram = (unsigned char*)RelinquishMagickMemory(binaryProgram);
    }
  }
}

`_aligned_free()`和`free()`不匹配的问题

在windows平台下这个问题出现了，之前相同代码在linux上运行的很好。

经调查发现是因为在DestroyMagickCLCacheInfoAndPixels()函数中使用的是：

MagickFreeAlignedMemory(pixels);

来清除在pixel_cache.c:OpenCache()中由：

MagickReallocMemory(PixelPacket *,cache_info->pixels,(size_t) offset);

申请的内存，造成不匹配。

因此DestroyMagickCLCacheInfoAndPixels()中改为MagickFreeMemory()来释放内存。这里的做法不同于IM，在GM中未使用对齐的内存，就像pixel_cache.c:DestroyCacheInfo()那样释放内存一样，才不会出问题。

  /*
    Release Cache Pixel Resources
  */
  if (MemoryCache == cache_info->type)
    {
#if defined(HAVE_OPENCL)
      if (cache_info->opencl != (MagickCLCacheInfo) NULL)
        {
          cache_info->opencl=RelinquishMagickCLCacheInfo(cache_info->opencl,
            MagickTrue);
          cache_info->pixels=(Quantum *) NULL;
        }
#else
      MagickFreeMemory(cache_info->pixels);
      LiberateMagickResource(MemoryResource,cache_info->length);
#endif
    }

OpenCL中不使用对齐内存对性能影响很大，这可以作为一个性能优化点。

此外MAGICKCORE_HAVE__ALIGNED_MALLOC宏应该被替换为HAVE__ALIGNED_MALLOC。

关于`R6025 pure virtual function call`的问题

调试状态下关闭程序可以看到异常出在：

static MagickCLDevice RelinquishMagickCLDevice(MagickCLDevice device)
{
  if (device == (MagickCLDevice) NULL)
    return((MagickCLDevice) NULL);

  device->platform_name=RelinquishMagickMemory(device->platform_name);
  device->vendor_name=RelinquishMagickMemory(device->vendor_name);
  device->name=RelinquishMagickMemory(device->name);
  device->version=RelinquishMagickMemory(device->version);
  if (device->program != (cl_program) NULL)
    (void) openCL_library->clReleaseProgram(device->program);
  while (device->command_queues_index >= 0)
    (void) openCL_library->clReleaseCommandQueue(
      device->command_queues[device->command_queues_index--]);
  RelinquishSemaphoreInfo(&device->lock);
  return((MagickCLDevice) RelinquishMagickMemory(device));
}

这里：

if (device->program != (cl_program) NULL)
  (void) openCL_library->clReleaseProgram(device->program);

问题很诡异，有时一直出现，有时一直正常。

R6025
- pure virtual function call
Unhandled exception at 0x05E88292 (intelocl32.dll) in IMDisplay.exe: Fatal program exit requested.

`IM`和`GM`函数及宏对照

Type	IM	GM
FUNC	SyncImagePixelCache	SyncImagePixelsEx
FUNC	OpenPixelCache	OpenCache
MACRO	MAGICKCORE_QUANTUM_DEPTH	QuantumDepth
FUNC	GetImagePixelCache	ModifyCache
FUNC	GetVirtualPixelCacheNexus	AcquireCacheNexus
FUNC	PersistPixelCache	PersistCache
MACRO	MAGICKCORE_OPENCL_SUPPORT	HAVE_OPENCL
MACRO	MAGICKCORE_HAVE__ALIGNED_MALLOC	HAVE__ALIGNED_MALLOC
MACRO	MAGICKCORE_WINDOWS_SUPPORT	MSWINDOWS

<2022-04-18 周一>

关于`clGetPlatformIDs()`在`windows`下的怪现象

我在调查R6025的问题，调试发现LoadOpenCLDevices()函数中：

number_platforms=0;
if (openCL_library->clGetPlatformIDs(0,NULL,&number_platforms) != CL_SUCCESS)
  return;

number_platforms的返回值在IM中为1，而在GM中却为2，同样的代码在同一台电脑上为什么返回的值不同呢？排除了编译选项的问题，查看了clGetPlatformIDs()的函数说明都没有找到问题，最后发现是IM是vs2017编译环境，而GM是vs2022编译环境，将它们都改用vs2017环境后调试状态下number_platforms的值都为1了。

但是如果number_platforms为1的话，在本机电脑上实际上硬件加速是失败的，可是直接运行的话确实是走的硬件加速流程，难道代码中clGetPlatformIDs()在调试状态和非调试状态下的值不一样？添加日志输出后发现确实是这样，那么总结一下clGetPlatformIDs()的怪现象：

vs2017，调试，number_platforms值为1。
vs2022，调试，number_platforms值为2。
vs2017，运行，number_platforms值为2。

<2022-04-19 周二>

如何修复`R6025 pure virtual function call`问题

运气好，上午修复了这个问题。

过程中也经历了尝试vs2010编译，尝试更换电脑环境等，均没有找到原因，同时我也在怀疑会不会是IMDisplay.exe这个外部的测试程序问题引起的？毕竟IM和GM的这个IMDisplay.exe程序相差也是很大的。

同时了解调用OpenCLTerminus()的所有地方，在nt_base.c:DllMain()的DLL_PROCESS_DETACH里面调用DestroyMagick();是不是不好？

case DLL_PROCESS_DETACH:
  DestroyMagick();
  break;

另IMDisplay.cpp中只有InitInstance()，却没有ExitInstance()，是不是应该显示调用MagickLib::DestroyMagick();比交给系统处理DLL_PROCESS_DETACH更好呢？

值得注意的是如果将IM的如下代码修改成这样：

BOOL CIMDisplayApp::ExitInstance()
{
  // Magick::TerminateMagick();
  return CWinApp::ExitInstance();
}

同时保证IM的magick-baseconfig.h中的ProvideDllMain宏启用：

/*
  When building ImageMagick using DLLs, include a DllMain()
  function which automatically invokes MagickCoreGenesis(NULL), and
  MagickCoreTerminus() so that the user doesn't need to. This is disabled
  by default.
*/
#define ProvideDllMain

则IM也同样会出现R6025的错误。

<2022-04-21 周四>

如何写`ScaleImage()`的硬件加速函数（一）

分析ScaleImage()函数及考虑参数如何传递：

函数中最大的循环是按Y垂直方向的，这样每次内循环以X水平方向进行
最大的循环中以两个if-else分支为主，分别比较目标宽高是否等于原始宽高，所以可以考虑以两个无符号整形代表（因为kernel函数不支持布尔类型，1表示相等，0表示不等）或者也可以直接传入原始宽高和目标高宽，在kernel函数中比较是否相等
选择传入目标宽高和原始宽高比较好，这样x_scale和y_scale也可以在kernel函数中计算
该函数中有申请可变长的动态数组，按照原始图片的宽度为长度申请堆内存。参考ResizeImage()和AccelerateResizeImage()函数，它们没有用到可变长动态数组，在kernel函数中申请动态数组明显不合适，因为opencl不支持动态数组
继续参考ResizeImage()和AccelerateResizeImage()函数，用到了三个图片内存传入kernel函数，分别是imageBuffer，tempImageBuffer和filteredImageBuffer，原以为参数传递错了，为什么在resizeHorizontalFilter()和resizeVerticalFilter()函数中只有一个函数用到了filteredImageBuffer变量？
仔细分析resizeHorizontalFilter()和resizeVerticalFilter()函数发现它们的第五个参数是输入参数，第九个参数是输出参数，tempImageBuffer是临时内存，做为第一个函数的输出参数和第二个函数的输入参数，代码如下：

outputReady=resizeHorizontalFilter(device,queue,image,filteredImage,
  imageBuffer,matte_or_cmyk,(cl_uint) image->columns,
  (cl_uint) image->rows,tempImageBuffer,(cl_uint) resizedColumns,
  (cl_uint) image->rows,filter_type,filter_info,blur,
  cubicCoefficientsBuffer,xFactor,exception);
if (outputReady == MagickFalse)
  goto cleanup;

outputReady=resizeVerticalFilter(device,queue,image,filteredImage,
  tempImageBuffer,matte_or_cmyk,(cl_uint) resizedColumns,
  (cl_uint) image->rows,filteredImageBuffer,(cl_uint) resizedColumns,
  (cl_uint) resizedRows,filter_type,filter_info,blur,
  cubicCoefficientsBuffer,yFactor,exception);
if (outputReady == MagickFalse)
  goto cleanup;

那似乎ScaleImage()里用不到这个临时内存，因为函数结构有区别
以原图宽高为长度的动态数组是不是可以在调用kernel函数之前申请好，用于临时内存？

如何写`ScaleImage()`的硬件加速函数（二）

昨天搞了一天也没有搞出来kernel函数怎么写，还得仔细分析一下ScaleImage()函数流程：

从GM的ScaleImage()入手，它比IM好懂
大循环的第一个if-else分支处理Y方向，即垂直方向，它用到两个动态数组x_vector和y_vector，它们的长度相等，都是原图的宽度，y_scale小于1是缩小，y_scale大于1是放大，不管是放大还是缩小，都会先读一行原图像素放到x_vector中
如果是缩小的话，各像素乘以y_scale的结果存放到y_vector中，可能会继续读取下一行原图进行累积计算
如果是放大的话，会将y_vector+y_span*x_vector的结果放到一个临时变量pixel中，之所以要放到pixel中是因为要处理计算结果大于255.0的情况，且可能y_vector在这里首次被使用，所以它申请内存时必须初始始化为0，所以它用的是MagickAllocateClearedArray()函数

y_vector=MagickAllocateClearedArray(DoublePixelPacket *,
                                    image->columns,sizeof(DoublePixelPacket));

pixel的结果是存到s中，而s=scanline;，且scanline=x_vector;，所以到这里x_vector存放的是Y方向的处理结果
然后这里到第二个if-else分支，即处理X方向，代码同第一个if-else分支大同小异，但要注意else，它有一个稍大的循环。最终结果存在t即scale_scanline中
scale_scanline是以一个以目标宽度为长度的动态数组

我尝试写的kernel函数模仿了ScaleImage()的很多代码，实际上不能工作，以试着重新理解opencl的方式，理解work-group和work-item，仅有的收获在：

STRINGIFY(
__kernel // __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum *inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4 *outputPixelCache, __local float *densityCache, __local float *gammaCache)
{
  const int x=get_global_id(0);
  const int y=get_global_id(1);
  const unsigned int columns=get_global_size(0);
  int cy=y;

  float4 pixel=ReadAllChannels(inputImage,4,columns,x,cy);

  pixel/=4.5;
  WriteAllChannels(filteredImage,4,filteredColumns,
    x*filteredColumns/inputColumns,y*filteredRows/inputRows,pixel);
}
)

似乎生成的图片没有变形，我加了pixel/=4.5;这行代码是为了调试方便，它的效果是使图片变暗。仅此简单的代码也能完成缩放功能（备注：缩小没问题，放大不行），但是WriteAllChannels()的x和y坐标要从work-item的视角看work-group，目前只能以x*filteredColumns/inputColumns和y*filteredRows/inputRows来代替，以验证我对work-group和work-item的理解，从上面的代码看，我似乎理解了一些。

参照下面的IM代码理解：

__kernel void Contrast(__global CLQuantum *image,
  const unsigned int number_channels,const int sign)
{
  const int x=get_global_id(0);
  const int y=get_global_id(1);
  const unsigned int columns=get_global_size(0);

  float4 pixel=ReadAllChannels(image,number_channels,columns,x,y);
  if (number_channels < 3)
    pixel.y=pixel.z=pixel.x;

  pixel=ConvertRGBToHSB(pixel);
  float brightness=pixel.z;
  brightness+=0.5f*sign*(0.5f*(sinpi(brightness-0.5f)+1.0f)-brightness);
  brightness=clamp(brightness,0.0f,1.0f);
  pixel.z=brightness;
  pixel=ConvertHSBToRGB(pixel);

  WriteAllChannels(image,number_channels,columns,x,y,pixel);
}
)

此外我觉得没有必要学AccelerateResizeImage()函数去增加filteredImageBuffer变量，可以学IM的AccelerateContrastImage()函数，在ComputeContrastImage()中直接调用kernel函数，这样可以少一层函数调用。

<2022-04-22 Fri>

如何写`ScaleImage()`的硬件加速函数（三）

在“如何写ScaleImage()的硬件加速函数（二）”中介绍的kernel函数的写法可能会产生如下现象：

ScaleFilter()不是总能被成功调用
每次修改过ScaleFilter()后，有时在~.cache/GraphicsMagick目录中不会生成新的.bin文件
这种情况下，调试发现在getOpenCLEnvironment(exception);处就失败返回

MagickPrivate Image *AccelerateScaleImage(const Image *image,
  const size_t scaledColumns,const size_t scaledRows,
  ExceptionInfo *exception)
{
  Image
    *filteredImage;

  MagickCLEnv
    clEnv;

  assert(image != NULL);
  assert(exception != (ExceptionInfo *) NULL);

  if (checkAccelerateCondition(image) == MagickFalse)
    return((Image *) NULL);

  clEnv=getOpenCLEnvironment(exception);
  if (clEnv == (MagickCLEnv) NULL)
    return((Image *) NULL);

  filteredImage=ComputeScaleImage(image,clEnv,scaledColumns,scaledRows,
    exception);
  return(filteredImage);
}

重启电脑似乎不能校正这种问题，但第二天开机这个问题就没有了，难道我的ScaleFilter()函数让CPU或者GPU内部错乱了？
没添加额外调试输出前，没有找到任何异常日志
忘说了一个关键问题，这两天电脑已经发现死机两次，包括今天早上这次，刚输入完密码回车后就死机

<2022-04-26 周二>

如何写`ScaleImage()`的硬件加速函数（四）

经过这两天的尝试，越来越对ScaleImage()用硬件加速实现这件事感到怀疑，因为似乎没有发现这个函数的硬件加速版本能带来很好的性能，当然我这个OpenCL新手写的代码连我自己也不敢恭维，这也是一方面的原因，甚至可能占比很高。

正如前面日志所说的能参考的代码只有ResizeHorizontalFilter()和ResizeVerticalFilter()，但是这要修改accelerate.c:scaleFilter()函数，在调用EnqueueOpenCLKernel()的地方要传入lsize参数，而不是现在的NULL，且可能要增加：

// for ResizeHorizontalFilter()
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))

或者：

// for ResizeVerticalFilter()
__kernel __attribute__((reqd_work_group_size(1, 256, 1)))

否则程序将卡死在ScaleImage()的硬件加速函数ScaleFilter()中。

此外在“如何写ScaleImage()的硬件加速函数（二）”中ScaleFilter()的简单实现中，对于缩小操作可以达到效果，但放大操作的图片像个筛子，所以我这个基础上我“完善”了一下，缩放后的效果还能勉强看出原图的轮廓，虽然效果不好但至少不是筛子了，尝试的代码如下：

// 大方块
{
  const int x = get_global_id(0);
  const int y = get_global_id(1);
  const unsigned int columns = get_global_size(0);

  float4 pixel = ReadAllChannels(inputImage, 4, columns, x, y);
  int num_x = get_global_size(0) / get_local_size(0);
  int num_y = get_global_size(1) / get_local_size(1);

  int dst_local_w = (filteredColumns + num_x - 1) / num_x;
  int dst_local_h = (filteredRows + num_y - 1) / num_y;

  for (int j = 0; j < dst_local_h; ++j)
    for (int i = 0; i < dst_local_w; ++i) {
      // pixel_res/=4.5;
      int filtered_x = x / get_local_size(0) * dst_local_w + i;
      int filtered_y = y / get_local_size(1) * dst_local_h + j;
      WriteAllChannels(filteredImage, 4, filteredColumns, filtered_x, filtered_y, pixel);
    }
}

或者可以用它的“升级”版本：

// 比大方块好点儿
{
  const int x = get_global_id(0);
  const int y = get_global_id(1);
  const unsigned int columns = get_global_size(0);
  int cx = x;
  int cy = y;

  float4 pixel = ReadAllChannels(inputImage, 4, columns, cx, cy);
  int num_x = get_global_size(0) / get_local_size(0);
  int num_y = get_global_size(1) / get_local_size(1);

  int dst_local_w = (filteredColumns + num_x - 1) / num_x;
  int dst_local_h = (filteredRows + num_y - 1) / num_y;

  for (int j = 0; j < dst_local_h; ++j) {
    int filtered_y = y / get_local_size(1) * dst_local_h + j;
    if (fabs((float)filtered_y / filteredRows - (float)y / inputRows) < 0.1) {
      cy++;
    }

    for (int i = 0; i < dst_local_w; ++i) {
      int filtered_x = x / get_local_size(0) * dst_local_w + i;

      if (fabs((float)filtered_x / filteredColumns - (float)x / inputColumns) < 0.1) {
        pixel = ReadAllChannels(inputImage, 4, columns, cx++, cy);
      }
      WriteAllChannels(filteredImage, 4, filteredColumns, filtered_x, filtered_y, pixel);
    }
  }
}

在我心里，对参考ResizeHorizontalFilter()和ResizeVerticalFilter()来实现ScaleFilter()有一种执着，花了许多时间，也总结了遇到的一些坑：

如果不修改EnqueueOpenCLKernel()函数的调用方式的话，可能在某个时刻电脑就死机了
只使用ResizeHorizontalFilter()和ResizeVerticalFilter()其中一个函数的代码来改写，似乎达不到效果，最好的效果是水平方向已缩小，垂直方向只显示了原图的上半部
好像我是这么修改的可以达到上面第二点提到的最好效果：

event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements * 2, 0);
wait_group_events(1, &e);

for (unsigned int i = startStep; i < stopStep * 2; i++, cacheIndex++)

如果要参考其中之一的话，估计渐渐改着改着会发现我需要用到两个函数，会是ScaleHorizontalFilter()和ScaleVerticalFilter()，然后到最后会是一版不一样的ResizeImage()的硬件加速版本，这样的话，意义在哪里？

所以我昨天下午又换回了自己来写，目前放大缩小操作可以实现，但效果差，会像好多小方格拼成的图片那样，且缩放速度相比原函数慢好多。另我知道的还有一个小问题，即缩小后的图片宽度比显示部分要宽，代码先贴在这里，因为有新任务要处理。

修改了accelerate.c的scaleFilter()函数：

static MagickBooleanType scaleFilter(MagickCLDevice device,
  cl_command_queue queue,const Image *image,Image *filteredImage,
  cl_mem imageBuffer,cl_uint matte_or_cmyk,cl_uint columns,cl_uint rows,
  cl_mem scaledImageBuffer,cl_uint scaledColumns,cl_uint scaledRows,
  ExceptionInfo *exception)
{
  cl_kernel
    scaleKernel;

  cl_int
    status;

  const unsigned int
    workgroupSize = 256;

  float
    scale=1.0;

  int
    numCachedPixels;

  MagickBooleanType
    outputReady;

  size_t
    gsize[2],
    i,
    imageCacheLocalMemorySize,
    lsize[2],
    totalLocalMemorySize,
    x_vector,
    y_vector,
    y_volumes;

  unsigned int
    chunkSize,
    pixelPerWorkgroup;

  scaleKernel=NULL;
  outputReady=MagickFalse;

  scale=1.0/scale; // TODO(ocl)

  if (scaledColumns < workgroupSize)
  {
    chunkSize=32;
    pixelPerWorkgroup=32;
  }
  else
  {
    chunkSize=workgroupSize;
    pixelPerWorkgroup=workgroupSize;
  }

DisableMSCWarning(4127)
  while(1)
RestoreMSCWarning
  {
    /* calculate the local memory size needed per workgroup */
    numCachedPixels=(int) pixelPerWorkgroup;
    imageCacheLocalMemorySize=numCachedPixels*sizeof(CLQuantum)*4;
    totalLocalMemorySize=imageCacheLocalMemorySize;

    /* local size for the pixel accumulator */
    x_vector=chunkSize*sizeof(cl_float4);
    totalLocalMemorySize+=x_vector;

    /* local memory size for the weight accumulator */
    y_vector=chunkSize*sizeof(cl_float4);
    totalLocalMemorySize+=y_vector;

    /* local memory size for the gamma accumulator */
    y_volumes =chunkSize*sizeof(float);
    totalLocalMemorySize+=y_volumes;

    if (totalLocalMemorySize <= device->local_memory_size)
      break;
    else
    {
      pixelPerWorkgroup=pixelPerWorkgroup/2;
      chunkSize=chunkSize/2;
      if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
      {
        /* quit, fallback to CPU */
        goto cleanup;
      }
    }
  }

  scaleKernel=AcquireOpenCLKernel(device,"ScaleFilter");
  if (scaleKernel == (cl_kernel) NULL)
  {
    (void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
      ResourceLimitWarning,"AcquireOpenCLKernel failed.", ".");
    goto cleanup;
  }

  i=0;
  status =SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&imageBuffer);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&matte_or_cmyk);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&columns);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&rows);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&scaledImageBuffer);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledColumns);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledRows);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(float),(void*)&scale);
  status|=SetOpenCLKernelArg(scaleKernel,i++,imageCacheLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&numCachedPixels);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&pixelPerWorkgroup);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&chunkSize);
  status|=SetOpenCLKernelArg(scaleKernel,i++,x_vector,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,y_vector,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,y_volumes,NULL);

  if (status != CL_SUCCESS)
  {
    (void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
      ResourceLimitWarning,"SetOpenCLKernelArg failed.",".");
    goto cleanup;
  }

  gsize[0]=image->columns;
  gsize[1]=image->rows;
  outputReady=EnqueueOpenCLKernel(queue,scaleKernel,2,
    (const size_t *) NULL,gsize,(Image *)NULL,image,filteredImage,MagickFalse,
    exception);

cleanup:

  if (scaleKernel != (cl_kernel) NULL)
    ReleaseOpenCLKernel(scaleKernel);

  return(outputReady);
}

和accelerate-kernels-private.h的ScaleFilter()函数：

STRINGIFY(
__kernel // __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum *inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4 *x_vector, __local float4 *y_vector, __local float *y_volumes)
{
  if (get_local_size(0) > pixelChunkSize)
    return;

  const int x = get_global_id(0);
  const int y = get_global_id(1);
  const unsigned int columns = get_global_size(0);
  int cx = x;
  int cy = y;

  for (int i = 0; i < pixelChunkSize; ++i) {
    x_vector[i] = (float4)0.0;
    y_vector[i] = (float4)0.0;
    y_volumes[i] = 0.0;
  }

  float4 pixel1 = ReadAllChannels(inputImage, 4, columns, x, y);
  int num_x = get_global_size(0) / get_local_size(0);
  int num_y = get_global_size(1) / get_local_size(1);

  int dst_local_w = (filteredColumns + num_x - 1) / num_x;
  int dst_local_h = (filteredRows + num_y - 1) / num_y;

  int startx = x;
  int stopx = MagickMin(startx + pixelChunkSize, inputColumns);
  float y_scale = (float)filteredRows / inputRows;
  float y_span = 1.0;
  int next_row = 1;
  __local float4* s = x_vector;
  float4 pixel = 0.0;
  float factor = 0.0;
  int next_col = 0;
  float x_scale = 0.0;
  float x_span = 1.0;
  float x_volume = 0.0;
  float4 result[256];
  float4* t = result;

  for (int j = 0; j < dst_local_h; ++j) {
    if (filteredRows == inputRows) {
      for (int i = 0; i < get_local_size(0); ++i) {
        x_vector[i] = ReadAllChannels(inputImage, 4, columns, x + i, cy);
      }
    }
    else {
      while (y_scale < y_span) {
        if (next_row) {
          for (int i = 0; i < get_local_size(0); ++i) {
            x_vector[i] = ReadAllChannels(inputImage, 4, columns, x + i, cy);
          }
          cy++;
        }
        for (int i = 0; i < get_local_size(0); ++i) {
          if (x_vector[i].w < 255.0)
            y_volumes[i] += y_scale;
          y_vector[i] += y_scale * x_vector[i];
        }
        y_span -= y_scale;
        y_scale = (float)filteredRows / inputRows;
        next_row = 1;
      }
      if (next_row) {
        for (int i = 0; i < get_local_size(0); ++i) {
          x_vector[i] = ReadAllChannels(inputImage, 4, columns, x + i, cy);
        }
        cy++;
        next_row = 0;
      }
      for (int i = 0; i < get_local_size(0); ++i) {
        if (x_vector[i].w < 255.0)
          y_volumes[x] += y_span;
        pixel = y_vector[i] + y_span * x_vector[i];
        if (y_volumes[i] > 0.0 && y_volumes[i] < 1.0) {
          factor = 1 / y_volumes[i];
          pixel *= factor;
        }
        s->x = pixel.x > 255.0 ? 255.0 : pixel.x;
        s->y = pixel.y > 255.0 ? 255.0 : pixel.y;
        s->z = pixel.z > 255.0 ? 255.0 : pixel.z;
        s->w = pixel.w > 255.0 ? 255.0 : pixel.w;
        s++;
        y_vector[i] = 0.0;
        y_volumes[i] = 0.0;
      }
      y_scale -= y_span;
      if (y_scale < 0) {
        y_scale = (float)filteredRows / inputRows;
        next_row = 1;
      }
      y_span = 1.0;
    }
    if (filteredColumns == inputColumns) {
      //
    }
    else {
      pixel = 0.0;
      s = x_vector;
      for (int i = 0; i < get_local_size(0); ++i) {
        x_scale = (float)filteredColumns / inputColumns;
        while (x_scale >= x_span) {
          if (next_col) {
            if (x_volume < 0.0 && x_volume < 1.0) {
              factor = 1 / x_volume;
              *t *= factor;
            }
            x_volume = 0.0;
            pixel = 0.0;
            t++;
          }
          if (s->w < 255.0)
            x_volume += x_span;
          pixel += x_span * *s;
          t->x = pixel.x > 255.0 ? 255.0 : pixel.x;
          t->y = pixel.y > 255.0 ? 255.0 : pixel.y;
          t->z = pixel.z > 255.0 ? 255.0 : pixel.z;
          t->w = pixel.w > 255.0 ? 255.0 : pixel.w;
          x_scale -= x_span;
          x_span = 1.0;
          next_col = 1;
        }
        if (x_scale > 0.0) {
          if (next_col) {
            if (x_volume > 0.0 && x_volume < 1.0) {
              factor = 1 / x_volume;
              *t *= factor;
            }
            x_volume = 0.0;
            pixel = 0.0;
            next_col = 0;
            t++;
          }
          if (s->w < 255.0)
            x_volume += x_scale;
          pixel += x_scale * *s;
          x_span -= x_scale;
        }
        s++;
      }
      if (x_span > 0.0) {
        s--;
        if (s->w < 255.0)
          x_volume += x_scale;
        pixel += x_span * *s;
      }
      if (!next_col && ((t - result) < filteredColumns)) {
        t->x = pixel.x > 255.0 ? 255.0 : pixel.x;
        t->y = pixel.y > 255.0 ? 255.0 : pixel.y;
        t->z = pixel.z > 255.0 ? 255.0 : pixel.z;
        t->w = pixel.w > 255.0 ? 255.0 : pixel.w;
      }
      t = result;
    }

    for (int i = 0; i < dst_local_w; ++i) {
      int filtered_x = x / get_local_size(0) * dst_local_w + i;
      int filtered_y = y / get_local_size(1) * dst_local_h + j;
      WriteAllChannels(filteredImage, 4, filteredColumns, filtered_x, filtered_y, *(t + i));
    }
  }
}
)

<2022-04-27 周三>

如何写`ScaleImage()`的硬件加速函数（五）

晚上做梦都在一直想这事儿，早上花了一个多小时，小有成果。

这里是参考ResizeHorizontalFilter()，居然把之前没有想明白的一些代码整清楚了：

accelerate.c:resizeHorizontalFilter()中传参gsize和lsize的地方，是拿目标宽高进行计算的，我的脑海中却一直用原始宽高去理解

gsize[0]=(resizedColumns+pixelPerWorkgroup-1)/pixelPerWorkgroup*
  workgroupSize;
gsize[1]=resizedRows;
lsize[0]=workgroupSize;
lsize[1]=1;
outputReady=EnqueueOpenCLKernel(queue,horizontalKernel,2,
  (const size_t *) NULL,gsize,lsize,image,filteredImage,MagickFalse,
  exception);

这里的用意是在kernel函数ResizeHorizontalFilter()中可以方便的使用get_global_id(0)和get_local_id(0)，不用像我在“如何写ScaleImage()的硬件加速函数（四）”中那样去计算dst_local_w和filtered_x了
但是似乎这里引出了另外一个问题，那原始宽高怎么用于计算呢？答案就是get_group_id(0)

贴上目前开发一半的代码，效果是：

原图缩小一倍，水平方向显示原图全部，但被压缩一半；垂直方向显示原图上半部
原图放大一倍，垂直方向显示原图全部，但被压缩一半；水平方向显示原图全部，但被拉长一倍

附scaleFilter()和ScaleFilter()代码：

static MagickBooleanType scaleFilter(MagickCLDevice device,
  cl_command_queue queue,const Image *image,Image *filteredImage,
  cl_mem imageBuffer,cl_uint matte_or_cmyk,cl_uint columns,cl_uint rows,
  cl_mem scaledImageBuffer,cl_uint scaledColumns,cl_uint scaledRows,
  ExceptionInfo *exception)
{
  cl_kernel
    scaleKernel;

  cl_int
    status;

  const unsigned int
    workgroupSize = 256;

  float
    scale=1.0;

  int
    numCachedPixels;

  MagickBooleanType
    outputReady;

  size_t
    gammaAccumulatorLocalMemorySize,
    gsize[2],
    i,
    imageCacheLocalMemorySize,
    pixelAccumulatorLocalMemorySize,
    lsize[2],
    totalLocalMemorySize,
    weightAccumulatorLocalMemorySize;

  unsigned int
    chunkSize,
    pixelPerWorkgroup;

  int
    scale_ratio = 8; // related to the upper limit of zoom in?

  scaleKernel=NULL;
  outputReady=MagickFalse;

  scale=1.0/scale; // TODO(ocl)

  if (scaledColumns < workgroupSize)
  {
    chunkSize=32;
    pixelPerWorkgroup=32;
  }
  else
  {
    chunkSize=workgroupSize;
    pixelPerWorkgroup=workgroupSize;
  }

DisableMSCWarning(4127)
  while(1)
RestoreMSCWarning
  {
    /* calculate the local memory size needed per workgroup */
    numCachedPixels=(int) pixelPerWorkgroup;
    imageCacheLocalMemorySize=numCachedPixels*sizeof(CLQuantum)*4;
    totalLocalMemorySize=imageCacheLocalMemorySize;

    /* local size for the pixel accumulator */
    pixelAccumulatorLocalMemorySize=chunkSize*sizeof(cl_float4);
    totalLocalMemorySize+=pixelAccumulatorLocalMemorySize;

    /* local memory size for the weight accumulator */
    weightAccumulatorLocalMemorySize=chunkSize*sizeof(float);
    totalLocalMemorySize+=weightAccumulatorLocalMemorySize;

    /* local memory size for the gamma accumulator */
    gammaAccumulatorLocalMemorySize=chunkSize*sizeof(float);
    totalLocalMemorySize+=gammaAccumulatorLocalMemorySize;

    if (totalLocalMemorySize <= device->local_memory_size)
      break;
    else
    {
      pixelPerWorkgroup=pixelPerWorkgroup/2;
      chunkSize=chunkSize/2;
      if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
      {
        /* quit, fallback to CPU */
        goto cleanup;
      }
    }
  }

  scaleKernel=AcquireOpenCLKernel(device,"ScaleFilter");
  if (scaleKernel == (cl_kernel) NULL)
  {
    (void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
      ResourceLimitWarning,"AcquireOpenCLKernel failed.", ".");
    goto cleanup;
  }

  i=0;
  status =SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&imageBuffer);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&matte_or_cmyk);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&columns);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&rows);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&scaledImageBuffer);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledColumns);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledRows);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(float),(void*)&scale);
  status|=SetOpenCLKernelArg(scaleKernel,i++,imageCacheLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&numCachedPixels);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&pixelPerWorkgroup);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&chunkSize);
  status|=SetOpenCLKernelArg(scaleKernel,i++,pixelAccumulatorLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,weightAccumulatorLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,gammaAccumulatorLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&scale_ratio);

  if (status != CL_SUCCESS)
  {
    (void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
      ResourceLimitWarning,"SetOpenCLKernelArg failed.",".");
    goto cleanup;
  }

  gsize[0] = (scaledColumns + pixelPerWorkgroup - 1) / pixelPerWorkgroup *
    workgroupSize;
  gsize[1] = scaledRows;
  lsize[0] = workgroupSize;
  lsize[1] = 1;
  outputReady=EnqueueOpenCLKernel(queue,scaleKernel,2,
    (const size_t *) NULL,gsize,(Image *)NULL,image,filteredImage,MagickFalse,
    exception);

cleanup:

  if (scaleKernel != (cl_kernel) NULL)
    ReleaseOpenCLKernel(scaleKernel);

  return(outputReady);
}

STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum *inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache,
    const int scale_ratio)
{
  // calculate the range of resized image pixels computed by this workgroup
  const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
  const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
  const unsigned int actualNumPixelToCompute = stopX - startX;

  float xFactor = (float)filteredColumns / inputColumns;

  // calculate the range of input image pixels to cache
  const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
  const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);

  // cache the input pixels into local memory
  const unsigned int y = get_global_id(1);
  const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y);
  const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4 * scale_ratio;
  event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
  wait_group_events(1, &e);

  unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
  for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
  {
    const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
    const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
    const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;

    // determine which resized pixel computed by this workitem
    const unsigned int itemID = get_local_id(0);
    const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));

    const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));

    float4 filteredPixel = (float4)0.0f;

    // -1 means this workitem doesn't participate in the computation
    if (pixelIndex != -1)
    {
      // x coordinated of the resized pixel computed by this workitem
      const int x = chunkStartX + pixelIndex;

      // calculate how many steps required for this pixel
      const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
      const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
      const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
      const unsigned int n = stop - start;

      // calculate how many steps this workitem will contribute
      unsigned int numStepsPerWorkItem = n / numItems;
      numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);

      const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
      if (startStep < n)
      {
        const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);

        unsigned int cacheIndex = start + startStep - cacheRangeStartX;
        for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
        {
          float4 cp = (float4)0.0f;

          __local CLQuantum* p = inputImageCache + (cacheIndex * 4);
          cp.x = (float)*(p);
          cp.y = (float)*(p + 1);
          cp.z = (float)*(p + 2);
          cp.w = (float)*(p + 3);

          filteredPixel += cp;
        }
      }
    }

    if (itemID < actualNumPixelInThisChunk)
    {
      WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
    }
  }
}
)

如何写`ScaleImage()`的硬件加速函数（六）

不管什么事儿看来都怕琢磨，如果连做梦都能梦到你正在琢磨的事儿，估计离成功也就不远了。似乎目前已经达到了最好的效果，离目标越来越近了。

要理解clEnqueueNDRangeKernel()函数的第五第六个参数意义，但目前为止只能说暂时理解了

cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,
  cl_kernel kernel,
  cl_uint work_dim,
  const size_t *global_work_offset,
  const size_t *global_work_size,
  const size_t *local_work_size,
  cl_uint num_events_in_wait_list,
  const cl_event *event_wait_list,
  cl_event *event)

第五第六个参数要结合__attribute__，否则无法调用kernel函数

__kernel __attribute__((reqd_work_group_size(256, 1, 1)))

来回理解ResizeImage()和ScaleImage()函数的实现，对它们的内部逻辑流程了如指掌了可以说
我选择以ResizeHorizontalFilter()为模板修改，之所以ResizeImage()处理速度慢，因为它的处理效果好，且有多种过滤效果可供选择，ResizeHorizontalFilter()是处理水平方向缩放，所以它将读入一整行原图像素，这正好和ScaleImage()的最内层循环处理方式相同
ResizeHorizontalFilter()的最内层循环（如下），因为有累加操作，所以这正是处理水平缩放的操作

for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
  /* float weight = getResizeFilterWeight(resizeFilterCubicCoefficients,
    (ResizeWeightingFunctionType) resizeFilterType,
    (ResizeWeightingFunctionType) resizeWindowType,
    resizeFilterScale, resizeFilterWindowSupport,
    resizeFilterBlur, scale*(start + i - bisect + 0.5)); */

  float weight = getResizeFilterWeightForGM(resizeFilterType,
    scale*(start + i - bisect + 0.5), support);

  float4 cp = (float4)0.0f;

  __local CLQuantum *p = inputImageCache + (cacheIndex*4);
  cp.x = (float) *(p);
  cp.y = (float) *(p + 1);
  cp.z = (float) *(p + 2);

  if (matte_or_cmyk != 0)
  {
    cp.w = (float) *(p + 3);

    // float alpha = weight * QuantumScale * cp.w;
    // error: use of type 'double' requires cl_khr_fp64 support
    float alpha = weight * (1 - (float) cp.w / 255);

    filteredPixel.x += alpha * cp.x;
    filteredPixel.y += alpha * cp.y;
    filteredPixel.z += alpha * cp.z;
    filteredPixel.w += weight * cp.w;
    gamma += alpha;
  }
  else
    filteredPixel += ((float4) weight)*cp;

  density += weight;
}

“如何写ScaleImage()的硬件加速函数（五）”的问题在于没有办法处理图片下半部分（如何缩小一半的话），这里主要是因为y变量的限定（代码如下），因为传入kernel函数的gsize[1]=resizedRows;被限定的死死的

const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y);

目前只考虑按比例缩放，所以这里的y需要除以缩放比
缩小后图片如果垂直方向相间着黑色宽竖条，那可能是因为numCachedPixels参数没有计算正确，这正可以修复“如何写ScaleImage()的硬件加速函数（五）”中的scale_ratio变量。

numCachedPixels=(int) ceil((pixelPerWorkgroup-1)/xFactor+2*support);

附上目前代码：

static MagickBooleanType scaleFilter(MagickCLDevice device,
  cl_command_queue queue,const Image *image,Image *filteredImage,
  cl_mem imageBuffer,cl_uint matte_or_cmyk,cl_uint columns,cl_uint rows,
  cl_mem scaledImageBuffer,cl_uint scaledColumns,cl_uint scaledRows,
  ExceptionInfo *exception)
{
  cl_kernel
    scaleKernel;

  cl_int
    status;

  const unsigned int
    workgroupSize = 256;

  float
    scale;

  int
    numCachedPixels;

  MagickBooleanType
    outputReady;

  size_t
    gammaAccumulatorLocalMemorySize,
    gsize[2],
    i,
    imageCacheLocalMemorySize,
    pixelAccumulatorLocalMemorySize,
    lsize[2],
    totalLocalMemorySize,
    weightAccumulatorLocalMemorySize;

  unsigned int
    chunkSize,
    pixelPerWorkgroup;

  scaleKernel=NULL;
  outputReady=MagickFalse;

  scale=(float) scaledColumns/columns; // TODO(ocl)

  if (scaledColumns < workgroupSize)
  {
    chunkSize=32;
    pixelPerWorkgroup=32;
  }
  else
  {
    chunkSize=workgroupSize;
    pixelPerWorkgroup=workgroupSize;
  }

DisableMSCWarning(4127)
  while(1)
RestoreMSCWarning
  {
    /* calculate the local memory size needed per workgroup */
    numCachedPixels=(int) ceil((pixelPerWorkgroup-1)/scale+2*(0.5+MagickEpsilon)); // TODO(ocl)
    imageCacheLocalMemorySize=numCachedPixels*sizeof(CLQuantum)*4;
    totalLocalMemorySize=imageCacheLocalMemorySize;

    /* local size for the pixel accumulator */
    pixelAccumulatorLocalMemorySize=chunkSize*sizeof(cl_float4);
    totalLocalMemorySize+=pixelAccumulatorLocalMemorySize;

    /* local memory size for the weight accumulator */
    weightAccumulatorLocalMemorySize=chunkSize*sizeof(float);
    totalLocalMemorySize+=weightAccumulatorLocalMemorySize;

    /* local memory size for the gamma accumulator */
    gammaAccumulatorLocalMemorySize=chunkSize*sizeof(float);
    totalLocalMemorySize+=gammaAccumulatorLocalMemorySize;

    if (totalLocalMemorySize <= device->local_memory_size)
      break;
    else
    {
      pixelPerWorkgroup=pixelPerWorkgroup/2;
      chunkSize=chunkSize/2;
      if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
      {
        /* quit, fallback to CPU */
        goto cleanup;
      }
    }
  }

  scaleKernel=AcquireOpenCLKernel(device,"ScaleFilter");
  if (scaleKernel == (cl_kernel) NULL)
  {
    (void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
      ResourceLimitWarning,"AcquireOpenCLKernel failed.", ".");
    goto cleanup;
  }

  i=0;
  status =SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&imageBuffer);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&matte_or_cmyk);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&columns);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&rows);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_mem),(void*)&scaledImageBuffer);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledColumns);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(cl_uint),(void*)&scaledRows);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(float),(void*)&scale);
  status|=SetOpenCLKernelArg(scaleKernel,i++,imageCacheLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(int),&numCachedPixels);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&pixelPerWorkgroup);
  status|=SetOpenCLKernelArg(scaleKernel,i++,sizeof(unsigned int),&chunkSize);
  status|=SetOpenCLKernelArg(scaleKernel,i++,pixelAccumulatorLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,weightAccumulatorLocalMemorySize,NULL);
  status|=SetOpenCLKernelArg(scaleKernel,i++,gammaAccumulatorLocalMemorySize,NULL);

  if (status != CL_SUCCESS)
  {
    (void) OpenCLThrowMagickException(device,exception,GetMagickModule(),
      ResourceLimitWarning,"SetOpenCLKernelArg failed.",".");
    goto cleanup;
  }

  gsize[0] = (scaledColumns + pixelPerWorkgroup - 1) / pixelPerWorkgroup *
    workgroupSize;
  gsize[1] = scaledRows;
  lsize[0] = workgroupSize;
  lsize[1] = 1;
  outputReady=EnqueueOpenCLKernel(queue,scaleKernel,2,
    (const size_t *) NULL,gsize,lsize,image,filteredImage,MagickFalse,
    exception);

cleanup:

  if (scaleKernel != (cl_kernel) NULL)
    ReleaseOpenCLKernel(scaleKernel);

  return(outputReady);
}

STRINGIFY(
__kernel __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum *inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum *filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum *inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4 *outputPixelCache, __local float *densityCache, __local float *gammaCache)
{
  // calculate the range of resized image pixels computed by this workgroup
  const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
  const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
  const unsigned int actualNumPixelToCompute = stopX - startX;

  float xFactor = (float)filteredColumns / inputColumns;

  // calculate the range of input image pixels to cache
  const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
  const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);

  // cache the input pixels into local memory
  const unsigned int y = get_global_id(1);
  const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
  const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
  event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
  wait_group_events(1, &e);

  unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
  for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
  {
    const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
    const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
    const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;

    // determine which resized pixel computed by this workitem
    const unsigned int itemID = get_local_id(0);
    const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));

    const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));

    float4 filteredPixel = (float4)0.0f;

    // -1 means this workitem doesn't participate in the computation
    if (pixelIndex != -1)
    {
      // x coordinated of the resized pixel computed by this workitem
      const int x = chunkStartX + pixelIndex;

      // calculate how many steps required for this pixel
      const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
      const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
      const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
      const unsigned int n = stop - start;

      // calculate how many steps this workitem will contribute
      unsigned int numStepsPerWorkItem = n / numItems;
      numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);

      const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
      if (startStep < n)
      {
        const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);

        unsigned int cacheIndex = start + startStep - cacheRangeStartX;
        for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
        {
          float4 cp = (float4)0.0f;

          __local CLQuantum* p = inputImageCache + (cacheIndex * 4);
          cp.x = (float)*(p);
          cp.y = (float)*(p + 1);
          cp.z = (float)*(p + 2);
          cp.w = (float)*(p + 3);

          filteredPixel += cp;
        }
      }
    }

    if (itemID < actualNumPixelInThisChunk)
    {
      WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
    }
  }
}
)

<2022-04-28 周四>

如何写`ScaleImage()`的硬件加速函数（七）

其实“如何写ScaleImage()的硬件加速函数（六）”的实现就是一个ResizeHorizontalFilter()将y改成y / xFactor的精简版，并不是ScaleImage()的硬件加速函数。虽然它不是，但至少省掉了ResizeVerticalFilter()的调用，速度上更快了。

但是目前发现的问题还是竖条纹，连续多次缩小一倍，最终图片被黑色竖条纹全部覆盖住，不断缩小或者放大，右侧会出现密集竖条纹，等等等的问题啦。

经过分析，黑色竖纹的产生原因是因为kernel函数ScaleFilter()的最内层的循环没有执行，导致将初始值0.0f赋进了目标地址。

for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{
  float4 cp = (float4)0.0f;

  __local CLQuantum* p = inputImageCache + (cacheIndex * 4);
  cp.x = (float)*(p);
  cp.y = (float)*(p + 1);
  cp.z = (float)*(p + 2);
  cp.w = (float)*(p + 3);

  filteredPixel += cp;
}

可以这样解决：

STRINGIFY(
  __kernel __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum* inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache)
{
  // calculate the range of resized image pixels computed by this workgroup
  const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
  const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
  const unsigned int actualNumPixelToCompute = stopX - startX;

  float xFactor = (float)filteredColumns / inputColumns;

  // calculate the range of input image pixels to cache
  const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
  const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);

  // cache the input pixels into local memory
  const unsigned int y = get_global_id(1);
  const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
  const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
  event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
  wait_group_events(1, &e);

  unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
  for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
  {
    const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
    const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
    const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;

    // determine which resized pixel computed by this workitem
    const unsigned int itemID = get_local_id(0);
    const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));

    const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));

    float4 filteredPixel = (float4)0.0f;

    // -1 means this workitem doesn't participate in the computation
    if (pixelIndex != -1)
    {
      // x coordinated of the resized pixel computed by this workitem
      const int x = chunkStartX + pixelIndex;

      // calculate how many steps required for this pixel
      const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
      const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
      const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
      const unsigned int n = stop - start;

      // calculate how many steps this workitem will contribute
      unsigned int numStepsPerWorkItem = n / numItems;
      numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);

      const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
      if (startStep < n)
      {
        const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);

        unsigned int cacheIndex = start + startStep - cacheRangeStartX;
        for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
        {
          float4 cp = (float4)0.0f;

          __local CLQuantum* p = inputImageCache + (cacheIndex * 4);
          cp.x = (float)*(p);
          cp.y = (float)*(p + 1);
          cp.z = (float)*(p + 2);
          cp.w = (float)*(p + 3);

          filteredPixel += cp;
        }
      }
    }

    if (itemID < actualNumPixelInThisChunk) {
      outputPixelCache[itemID] = (float4)0.0f;
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    for (unsigned int i = 0; i < numItems; i++) {
      if (pixelIndex != -1) {
        if (itemID % numItems == i) {
          outputPixelCache[pixelIndex] += filteredPixel;
        }
      }
      barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (itemID < actualNumPixelInThisChunk)
    {
      float4 filteredPixel = outputPixelCache[itemID];
      WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
    }
  }
}
)

测试了一下性能，感觉提升不少（原图缩小一半，共三次操作，原图连续放大一倍两次，共三次操作）：

ScaleImage()加速版本：

20220428104719 0:3.229821  1.672 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104719 0:3.230185  1.672 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 1360
20220428104725 0:9.628057  1.875 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104725 0:9.628288  1.875 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104732 0:16.078872 2.234 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104732 0:16.079057 2.234 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104740 0:24.253815 2.484 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104740 0:24.254118 2.484 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104749 0:33.888819 2.875 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104749 0:33.889007 2.875 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 31
20220428104752 0:36.173104 3.047 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104752 0:36.173301 3.047 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 156
20220428104800 0:44.287153 3.469 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104800 0:44.287372 3.469 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 47
20220428104801 0:45.546271 3.656 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104801 0:45.546588 3.656 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 140
20220428104806 0:49.973027 4.047 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104806 0:49.973217 4.047 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 31
20220428104806 0:50.640522 4.250 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104806 0:50.640730 4.250 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 141

ScaleImage()原先版本：

20220428104934 0:1.982873  0.266 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104934 0:2.040677  0.328 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 63
20220428104940 0:7.854823  0.578 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104940 0:7.913365  0.625 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 47
20220428104944 0:11.896725 0.875 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104944 0:11.956722 0.938 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 63
20220428104951 0:18.070817 1.219 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104951 0:18.378405 1.516 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 297
20220428104952 0:19.394056 1.531 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104953 0:20.634341 2.781 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1250
20220428104958 0:25.534006 3.063 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104958 0:25.836584 3.375 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 312
20220428104959 0:26.729520 3.406 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105000 0:27.930533 4.609 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1203
20220428105011 0:38.879392 5.438 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105012 0:39.210382 5.766 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 328
20220428105012 0:39.872525 5.797 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105014 0:41.176969 7.094 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1297

<2022-05-05 周四>

如何写`ScaleImage()`的硬件加速函数（八）

我觉得Y方向的缩放以下面这种ScaleFilter()的方法是实现不了的，我只能添加进X方向的处理，缩小正常，放大的话图片变亮。

STRINGIFY(
  __kernel __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum* inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache)
{
  // calculate the range of resized image pixels computed by this workgroup
  const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
  const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
  const unsigned int actualNumPixelToCompute = stopX - startX;

  float xFactor = (float)filteredColumns / inputColumns;

  // calculate the range of input image pixels to cache
  const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
  const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);

  // cache the input pixels into local memory
  const unsigned int y = get_global_id(1);
  const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
  const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
  event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
  wait_group_events(1, &e);

  unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
  for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
  {
    const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
    const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
    const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;

    // determine which resized pixel computed by this workitem
    const unsigned int itemID = get_local_id(0);
    const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));

    const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));

    float4 filteredPixel = (float4)0.0f;

    // -1 means this workitem doesn't participate in the computation
    if (pixelIndex != -1)
    {
      // x coordinated of the resized pixel computed by this workitem
      const int x = chunkStartX + pixelIndex;

      // calculate how many steps required for this pixel
      const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
      const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
      const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
      const unsigned int n = stop - start;

      // calculate how many steps this workitem will contribute
      unsigned int numStepsPerWorkItem = n / numItems;
      numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);

      const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
      if (startStep < n)
      {
        float x_scale = (float)filteredColumns / inputColumns;
        float x_span = 1.0;
        float x_volume = 0.0;
        float factor = 0.0;

        const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);

        unsigned int cacheIndex = start + startStep - cacheRangeStartX;
        for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
        {
          float4 cp = (float4)0.0f;

          __local CLQuantum* p = inputImageCache + (cacheIndex * 4);
          cp.x = (float)*(p);
          cp.y = (float)*(p + 1);
          cp.z = (float)*(p + 2);
          cp.w = (float)*(p + 3);

          while (x_scale >= x_span) {
            if (x_volume > 0.0 && x_volume < 1.0) {
              factor = 1 / x_volume;
              filteredPixel.x *= factor;
              filteredPixel.y *= factor;
              filteredPixel.z *= factor;
            }
            if (cp.w < 255.0) {
              x_volume += x_span;
            }
            filteredPixel += x_span * cp;
            filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
            filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
            filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
            filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
            x_scale -= x_span;
            x_span = 1.0;
          }

          if (x_scale > 0.0) {
            if (x_volume > 0.0 && x_volume < 1.0) {
              factor = 1 / x_volume;
              filteredPixel.x *= factor;
              filteredPixel.y *= factor;
              filteredPixel.z *= factor;
            }
            if (cp.w < 255.0)
              x_volume += x_scale;
            filteredPixel += x_scale * cp;
            x_span -= x_scale;
          }

          if (x_span > 0.0) {
            if (cp.w < 255.0)
              x_volume += x_span;
            filteredPixel += x_span * cp;
          }

          filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
          filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
          filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
          filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
        }
      }
    }

    if (itemID < actualNumPixelInThisChunk) {
      outputPixelCache[itemID] = (float4)0.0f;
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    for (unsigned int i = 0; i < numItems; i++) {
      if (pixelIndex != -1) {
        if (itemID % numItems == i) {
          outputPixelCache[pixelIndex] += filteredPixel;
        }
      }
      barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (itemID < actualNumPixelInThisChunk)
    {
      float4 filteredPixel = outputPixelCache[itemID];
      WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
    }
  }
}
)

如何写`ScaleImage()`的硬件加速函数（九）

这么快在“如何写ScaleImage()的硬件加速函数（八）”的基础上就解决了放大后图片变亮的问题，目前看效果还可以。

STRINGIFY(
  __kernel __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum* inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache)
{
  // calculate the range of resized image pixels computed by this workgroup
  const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
  const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
  const unsigned int actualNumPixelToCompute = stopX - startX;

  float xFactor = (float)filteredColumns / inputColumns;

  // calculate the range of input image pixels to cache
  const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
  const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);

  // cache the input pixels into local memory
  const unsigned int y = get_global_id(1);
  const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
  const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
  event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);
  wait_group_events(1, &e);

  unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
  for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
  {
    const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
    const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
    const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;

    // determine which resized pixel computed by this workitem
    const unsigned int itemID = get_local_id(0);
    unsigned int local_idx = itemID;
    const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));

    const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));

    float4 filteredPixel = (float4)0.0f;

    if (itemID < actualNumPixelInThisChunk) {
      outputPixelCache[itemID] = (float4)0.0f;
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    // -1 means this workitem doesn't participate in the computation
    if (pixelIndex != -1)
    {
      // x coordinated of the resized pixel computed by this workitem
      const int x = chunkStartX + pixelIndex;

      // calculate how many steps required for this pixel
      const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
      const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
      const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
      const unsigned int n = stop - start;

      // calculate how many steps this workitem will contribute
      unsigned int numStepsPerWorkItem = n / numItems;
      numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);

      const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
      if (startStep < n)
      {
        float x_scale = (float)filteredColumns / inputColumns;
        float x_span = 1.0;
        float x_volume = 0.0;
        float factor = 0.0;
        unsigned next_column = 0;

        const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);

        unsigned int cacheIndex = start + startStep - cacheRangeStartX;
        for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
        {
          float4 cp = (float4)0.0f;

          __local CLQuantum* p = inputImageCache + (cacheIndex * 4);
          cp.x = (float)*(p);
          cp.y = (float)*(p + 1);
          cp.z = (float)*(p + 2);
          cp.w = (float)*(p + 3);

          while (x_scale >= x_span) {
            if (next_column) {
              if (x_volume > 0.0 && x_volume < 1.0) {
                factor = 1 / x_volume;
                outputPixelCache[local_idx].x *= factor;
                outputPixelCache[local_idx].y *= factor;
                outputPixelCache[local_idx].z *= factor;
              }
              x_volume = 0.0;
              filteredPixel = 0.0;
              local_idx++;
            }
            if (cp.w < 255.0) {
              x_volume += x_span;
            }
            filteredPixel += x_span * cp;
            filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
            filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
            filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
            filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
            x_scale -= x_span;
            x_span = 1.0;
            next_column = 1;
          }

          if (x_scale > 0.0) {
            if (next_column) {
              if (x_volume > 0.0 && x_volume < 1.0) {
                factor = 1 / x_volume;
                outputPixelCache[local_idx].x *= factor;
                outputPixelCache[local_idx].y *= factor;
                outputPixelCache[local_idx].z *= factor;
              }
              x_volume = 0.0;
              filteredPixel = 0.0;
              next_column = 0;
              local_idx++;
            }
            if (cp.w < 255.0)
              x_volume += x_scale;
            filteredPixel += x_scale * cp;
            x_span -= x_scale;
          }

          if (x_span > 0.0) {
            if (cp.w < 255.0)
              x_volume += x_span;
            filteredPixel += x_span * cp;
          }

          filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
          filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
          filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
          filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
        }
      }
    }

    for (unsigned int i = 0; i < numItems; i++) {
      if (pixelIndex != -1) {
        if (itemID % numItems == i) {
          outputPixelCache[pixelIndex] += filteredPixel;
        }
      }
      barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (itemID < actualNumPixelInThisChunk)
    {
      float4 filteredPixel = outputPixelCache[itemID];
      WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
    }
  }
}
)

如何写`ScaleImage()`的硬件加速函数（十）

难道就这么被我轻松实现了？

“如何写ScaleImage()的硬件加速函数（九）”是在“如何写ScaleImage()的硬件加速函数（八）”的基础上处理了图片放大变亮的问题，但是他们都只是X方向的处理，没有实现原始函数ScaleImage()的Y方向缩放。

目前先处理Y方向再处理X方向的代码都有了，如下：

static MagickBooleanType scaleFilter(MagickCLDevice device,
  cl_command_queue queue, const Image* image, Image* filteredImage,
  cl_mem imageBuffer, cl_uint matte_or_cmyk, cl_uint columns, cl_uint rows,
  cl_mem scaledImageBuffer, cl_uint scaledColumns, cl_uint scaledRows,
  ExceptionInfo* exception)
{
  cl_kernel
    scaleKernel;

  cl_int
    status;

  const unsigned int
    workgroupSize = 256;

  float
    scale;

  int
    numCachedPixels;

  MagickBooleanType
    outputReady;

  size_t
    gammaAccumulatorLocalMemorySize,
    gsize[2],
    i,
    imageCacheLocalMemorySize,
    pixelAccumulatorLocalMemorySize,
    pixelAccumulatorLocalMemorySize2,
    lsize[2],
    totalLocalMemorySize,
    weightAccumulatorLocalMemorySize;

  unsigned int
    chunkSize,
    pixelPerWorkgroup;

  scaleKernel = NULL;
  outputReady = MagickFalse;

  scale = (float)scaledColumns / columns; // TODO(ocl)

  unsigned int stop = 0;
  unsigned int next_row = 1;
  float y_span = 1.0;
  float y_scale = (float)scaledRows / rows;
  if (scaledRows == rows)
    stop++;
  else {
    while (y_scale < y_span) {
      if (next_row) {
        stop++;
      }
      y_span -= y_scale;
      y_scale = (float)scaledRows / rows;
      next_row = 1;
    }

    if (next_row) {
      stop++;
      next_row = 0;
    }
  }

  if (scaledColumns < workgroupSize)
  {
    chunkSize = 32;
    pixelPerWorkgroup = 32;
  }
  else
  {
    chunkSize = workgroupSize;
    pixelPerWorkgroup = workgroupSize;
  }

  DisableMSCWarning(4127)
    while (1)
      RestoreMSCWarning
    {
      /* calculate the local memory size needed per workgroup */
      numCachedPixels=(int) ceil((pixelPerWorkgroup-1)/scale+2*(0.5+MagickEpsilon)); // TODO(ocl)
      imageCacheLocalMemorySize = numCachedPixels * sizeof(CLQuantum) * 4 * stop;
      totalLocalMemorySize = imageCacheLocalMemorySize;

      /* local size for the pixel accumulator */
      pixelAccumulatorLocalMemorySize = chunkSize * sizeof(cl_float4);
      totalLocalMemorySize += pixelAccumulatorLocalMemorySize;

      pixelAccumulatorLocalMemorySize2 = pixelAccumulatorLocalMemorySize;
      totalLocalMemorySize += pixelAccumulatorLocalMemorySize2;

      /* local memory size for the weight accumulator */
      weightAccumulatorLocalMemorySize = chunkSize * sizeof(float);
      totalLocalMemorySize += weightAccumulatorLocalMemorySize;

      /* local memory size for the gamma accumulator */
      gammaAccumulatorLocalMemorySize = chunkSize * sizeof(float);
      totalLocalMemorySize += gammaAccumulatorLocalMemorySize;

      if (totalLocalMemorySize <= device->local_memory_size)
        break;
      else
      {
        pixelPerWorkgroup = pixelPerWorkgroup / 2;
        chunkSize = chunkSize / 2;
        if ((pixelPerWorkgroup == 0) || (chunkSize == 0))
        {
          /* quit, fallback to CPU */
          goto cleanup;
        }
      }
    }

  scaleKernel = AcquireOpenCLKernel(device, "ScaleFilter");
  if (scaleKernel == (cl_kernel)NULL)
  {
    (void)OpenCLThrowMagickException(device, exception, GetMagickModule(),
      ResourceLimitWarning, "AcquireOpenCLKernel failed.", ".");
    goto cleanup;
  }

  i = 0;
  status = SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_mem), (void*)&imageBuffer);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&matte_or_cmyk);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&columns);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&rows);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_mem), (void*)&scaledImageBuffer);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&scaledColumns);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(cl_uint), (void*)&scaledRows);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(float), (void*)&scale);
  status |= SetOpenCLKernelArg(scaleKernel, i++, imageCacheLocalMemorySize, NULL);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(int), &numCachedPixels);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(unsigned int), &pixelPerWorkgroup);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(unsigned int), &chunkSize);
  status |= SetOpenCLKernelArg(scaleKernel, i++, pixelAccumulatorLocalMemorySize, NULL);
  status |= SetOpenCLKernelArg(scaleKernel, i++, pixelAccumulatorLocalMemorySize2, NULL);
  status |= SetOpenCLKernelArg(scaleKernel, i++, weightAccumulatorLocalMemorySize, NULL);
  status |= SetOpenCLKernelArg(scaleKernel, i++, gammaAccumulatorLocalMemorySize, NULL);
  status |= SetOpenCLKernelArg(scaleKernel, i++, sizeof(unsigned int), &stop);

  if (status != CL_SUCCESS)
  {
    (void)OpenCLThrowMagickException(device, exception, GetMagickModule(),
      ResourceLimitWarning, "SetOpenCLKernelArg failed.", ".");
    goto cleanup;
  }

  gsize[0] = (scaledColumns + pixelPerWorkgroup - 1) / pixelPerWorkgroup *
    workgroupSize;
  gsize[1] = scaledRows;
  lsize[0] = workgroupSize;
  lsize[1] = 1;
  outputReady = EnqueueOpenCLKernel(queue, scaleKernel, 2,
    (const size_t*)NULL, gsize, lsize, image, filteredImage, MagickFalse,
    exception);

cleanup:

  if (scaleKernel != (cl_kernel)NULL)
    ReleaseOpenCLKernel(scaleKernel);

  return(outputReady);
}

STRINGIFY(
  __kernel __attribute__((reqd_work_group_size(256, 1, 1)))
  void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,
    const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,
    const unsigned int filteredColumns, const unsigned int filteredRows,
    const float resizeFilterScale,
    __local CLQuantum* inputImageCache, const int numCachedPixels,
    const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,
    __local float4* outputPixelCache, __local float* outputPixelCache2, __local float* densityCache, __local float* gammaCache,
    const unsigned int stopn)
{
  // calculate the range of resized image pixels computed by this workgroup
  const unsigned int startX = get_group_id(0) * pixelPerWorkgroup;
  const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);
  const unsigned int actualNumPixelToCompute = stopX - startX;

  float xFactor = (float)filteredColumns / inputColumns;

  // calculate the range of input image pixels to cache
  const int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));
  const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);

  // cache the input pixels into local memory
  const unsigned int y = get_global_id(1);
  const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);
  const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;
  for (unsigned stopi = 0; stopi < stopn; ++stopi) {
    event_t e = async_work_group_copy(inputImageCache + num_elements * stopi, inputImage + pos + num_elements * stopi, num_elements, 0);
    wait_group_events(1, &e);
  }

  for (unsigned t = 0; t < num_elements / 4; ++t) {
    outputPixelCache[t] = 0.0;
    outputPixelCache2[t] = 0.0;
  }

  float y_scale = (float)filteredRows / inputRows;
  float y_span = 1.0;
  unsigned next_row = 1;
  unsigned stopi = 0;
  float4 y_vector = 0.0;
  float y_volume = 0.0;
  float factor = 0.0;
  while (y_scale < y_span) {
    /*if (next_row) {
      stopi++;
    }*/
    for (unsigned ix = 0; ix < num_elements / 4; ++ix) {
      unsigned tempi = num_elements / 4 * stopi + ix;
      if (((float4)inputImageCache[tempi]).w < 255.0)
        outputPixelCache2[tempi] += y_scale;
      outputPixelCache[tempi] += y_scale * (float4)inputImageCache[tempi];
    }
    y_span -= y_scale;
    y_scale = (float)filteredRows / inputRows;
    next_row = 1;
    if (next_row) {
      stopi++;
      next_row = 0;
    }
  }

  stopi = 0;
  for (unsigned t = 0; t < stopi; ++t) {
    for (unsigned ix = 0; ix < num_elements / 4; ++ix) {
      unsigned tempi = num_elements / 4 * t + ix;
      if (((float4)inputImageCache[tempi]).w < 255.0)
        outputPixelCache2[tempi] += y_span;
      outputPixelCache[tempi] += outputPixelCache[tempi] + y_span * (float4)inputImageCache[tempi];

      if (outputPixelCache2[tempi] > 0.0 && outputPixelCache2[tempi] < 1.0) {
        factor = 1 / outputPixelCache2[tempi];
        outputPixelCache[tempi] *= factor;
      }

      inputImageCache[tempi] = outputPixelCache[tempi].x > 255.0 ? 255.0 : outputPixelCache[tempi].x;
      inputImageCache[tempi + 1] = outputPixelCache[tempi].y > 255.0 ? 255.0 : outputPixelCache[tempi].y;
      inputImageCache[tempi + 2] = outputPixelCache[tempi].z > 255.0 ? 255.0 : outputPixelCache[tempi].z;
      inputImageCache[tempi + 3] = outputPixelCache[tempi].w > 255.0 ? 255.0 : outputPixelCache[tempi].w;
    }
  }

  unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;
  for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++)
  {
    const unsigned int chunkStartX = startX + chunk * pixelChunkSize;
    const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);
    const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;

    // determine which resized pixel computed by this workitem
    const unsigned int itemID = get_local_id(0);
    unsigned int local_idx = itemID;
    const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));

    const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));

    float4 filteredPixel = (float4)0.0f;

    if (itemID < actualNumPixelInThisChunk) {
      outputPixelCache[itemID] = (float4)0.0f;
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    // -1 means this workitem doesn't participate in the computation
    if (pixelIndex != -1)
    {
      // x coordinated of the resized pixel computed by this workitem
      const int x = chunkStartX + pixelIndex;

      // calculate how many steps required for this pixel
      const float bisect = (x + 0.5) / xFactor + MagickEpsilon;
      const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);
      const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);
      const unsigned int n = stop - start;

      // calculate how many steps this workitem will contribute
      unsigned int numStepsPerWorkItem = n / numItems;
      numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);

      const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;
      if (startStep < n)
      {
        float x_scale = (float)filteredColumns / inputColumns;
        float x_span = 1.0;
        float x_volume = 0.0;
        float factor = 0.0;
        unsigned next_column = 0;

        const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);

        unsigned int cacheIndex = start + startStep - cacheRangeStartX;
        for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
        {
          float4 cp = (float4)0.0f;

          __local CLQuantum* p = inputImageCache + (cacheIndex * 4);
          cp.x = (float)*(p);
          cp.y = (float)*(p + 1);
          cp.z = (float)*(p + 2);
          cp.w = (float)*(p + 3);

          while (x_scale >= x_span) {
            if (next_column) {
              if (x_volume > 0.0 && x_volume < 1.0) {
                factor = 1 / x_volume;
                outputPixelCache[local_idx].x *= factor;
                outputPixelCache[local_idx].y *= factor;
                outputPixelCache[local_idx].z *= factor;
              }
              x_volume = 0.0;
              filteredPixel = 0.0;
              local_idx++;
            }
            if (cp.w < 255.0) {
              x_volume += x_span;
            }
            filteredPixel += x_span * cp;
            filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
            filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
            filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
            filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
            x_scale -= x_span;
            x_span = 1.0;
            next_column = 1;
          }

          if (x_scale > 0.0) {
            if (next_column) {
              if (x_volume > 0.0 && x_volume < 1.0) {
                factor = 1 / x_volume;
                outputPixelCache[local_idx].x *= factor;
                outputPixelCache[local_idx].y *= factor;
                outputPixelCache[local_idx].z *= factor;
              }
              x_volume = 0.0;
              filteredPixel = 0.0;
              next_column = 0;
              local_idx++;
            }
            if (cp.w < 255.0)
              x_volume += x_scale;
            filteredPixel += x_scale * cp;
            x_span -= x_scale;
          }

          if (x_span > 0.0) {
            if (cp.w < 255.0)
              x_volume += x_span;
            filteredPixel += x_span * cp;
          }

          filteredPixel.x = filteredPixel.x > 255.0 ? 255.0 : filteredPixel.x;
          filteredPixel.y = filteredPixel.y > 255.0 ? 255.0 : filteredPixel.y;
          filteredPixel.z = filteredPixel.z > 255.0 ? 255.0 : filteredPixel.z;
          filteredPixel.w = filteredPixel.w > 255.0 ? 255.0 : filteredPixel.w;
        }
      }
    }

    for (unsigned int i = 0; i < numItems; i++) {
      if (pixelIndex != -1) {
        if (itemID % numItems == i) {
          outputPixelCache[pixelIndex] += filteredPixel;
        }
      }
      barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (itemID < actualNumPixelInThisChunk)
    {
      float4 filteredPixel = outputPixelCache[itemID];
      WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);
    }
  }
}
)

把“如何写ScaleImage()的硬件加速函数（九）”和“如何写ScaleImage()的硬件加速函数（十）”放到linux平台下测试了一下，蛋疼，都有问题！（OpenCL不是跨平台的嘛，怎么windows上正确的在linux上却不对？）前者有黑色竖线；后者有红色雪花覆盖整个图片（如果gm的预览图显示就不正确，那不用再缩放了，这已经说明肯定有问题了）。

<2022-05-06 周五>

如何写`ScaleImage()`的硬件加速函数（十一）

“如何写ScaleImage()的硬件加速函数（十）”这里的代码写得比较随意，其中stopi = 0;赋值为0后，下面的循环根本没有执行，这才使得显示的图片变正确了；且async_work_group_copy()的参数传得可能也不对，等等等。

我原来的想法是在外部计算好需要的行数传入kernel函数，并按照此行数申请好一定长度的一维数组，在async_work_group_copy()时拷贝相应行数的像素
然后在async_work_group_copy()拷贝好的local内存里按照ScaleImage()的算法实现Y方向的缩放
有一个顾虑需不需要考虑？因为local内存是对应一个work-group的，它的各work-item共享这片local内存，那每个work-item是不是都会按照ScaleImage()的算法处理一次Y方向的缩放？
目前看好像不要考虑这个问题，即不影响结果也不影响效率
同时也在考虑，可不可以在进入kernel函数之前就缩放好Y方向呢？

<2022-05-18 Wed>

`AccelerateScaleImage()`和`AccelerateResizeImage()`的性能测试

迭代100次，缩小图片50%，如下：

[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 4.92s user 5.146311s total 19.431 iter/s 20.325 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 5.88s user 5.887496s total 16.985 iter/s 17.007 iter/cpu

迭代100次，放大图片200%，如下：

[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 17.73s user 15.998019s total 6.251 iter/s 5.640 iter/cpu
[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 17.61s user 15.812017s total 6.324 iter/s 5.679 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 23.15s user 23.203446s total 4.310 iter/s 4.320 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -scale 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 23.57s user 23.621860s total 4.233 iter/s 4.243 iter/cpu

主要看total前面的值，是运行总时间。同样的方法再对比一下AccelerateResizeImage()。

迭代100次，缩小图片50%，如下：

[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 11.28s user 8.047808s total 12.426 iter/s 8.865 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 50x50% ~/temp/out.jpg
Results: 8 threads 100 iter 44.43s user 6.364194s total 15.713 iter/s 2.251 iter/cpu

迭代100次，放大图片200%，如下：

[ysouyno@arch gm-ocl]$ MAGICK_OCL_DEVICE=true gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 24.67s user 18.713505s total 5.344 iter/s 4.054 iter/cpu
[ysouyno@arch gm-ocl]$ gm benchmark -iterations 100 convert ~/temp/bg1a.jpg -resize 200x200% ~/temp/out.jpg
Results: 8 threads 100 iter 160.27s user 26.635967s total 3.754 iter/s 0.624 iter/cpu

Files

DEV.md

Latest commit

History

DEV.md

File metadata and controls

gm-ocl开发记录

<2022-02-24 Thu>

增加--enable-opencl参数

<2022-02-25 Fri>

关于AccelerateResizeImage()的链接问题

<2022-03-02 Wed>

在archlinux上为Intel启用OpenCL

调试RunOpenCLBenchmark()时的崩溃问题

<2022-03-07 Mon>

调用clCreateBuffer()产生异常问题（一）

vscode环境

调用clCreateBuffer()产生异常问题（二）

<2022-03-09 Wed>

调用clCreateBuffer()产生异常问题（三）

<2022-03-15 Tue>

调用clCreateBuffer()产生异常问题（四）

调用clCreateBuffer()产生异常问题（五）

<2022-03-16 Wed>

调用clCreateBuffer()产生异常问题（六）