diff --git a/samples/core/multi-device/README.md b/samples/core/multi-device/README.md
index 7bccdd12..f259482b 100644
--- a/samples/core/multi-device/README.md
+++ b/samples/core/multi-device/README.md
@@ -1,7 +1,7 @@
 # Multi-device Convolution Example
 
 ## Sample purpose
-This example showcases how to set up a multi-device execution of a given kernel, being the latter a convolution kernel in this particular case.
+This example showcases how to set up a multi-device execution of a given kernel using two OpenCL-compatible devices.
 
 ## Key APIs and Concepts
 The main idea behind this example is that a given kernel can be run simultaneously by two (or potentially more) devices, therefore reducing its execution time. One can essentially think of two strategies for this workflow:
@@ -13,28 +13,73 @@ This example implements the first approach.
 ### Kernel logic
 The kernel is a simple $3 \times 3$ convolution, meaning that the convolution over the input matrix is performed using a $3 \times 3$ mask matrix.
 
-In this implementation of the convolution kernel we assume that the input matrix is padded with 0s, so no extra conditional logic is necessary to ensure that the mask is applied to out-of-bounds elements (e.g. when processing element $(0,0)$ of the output matrix).
+In this implementation of the convolution kernel we assume that the input matrix is padded with 0s, so no extra conditional logic is necessary to ensure that the mask is not applied to out-of-bounds elements (e.g. when processing element $(0,0)$ of the output matrix).
+
+### Device fission
+In order to simplify the conditions under which the example can be executed, we introduced the use of OpenCL's device fission. This feature allows the user to partition a device into *sub-devices*. These sub-devices correspond physically to a certain region of the original device, but are virtually perceived as whole new devices. This partition of the device can be made in several ways.
+- Partition equally by the number compute units (threads). After specifying the number of compute units that each sub-device should have, OpenCL creates as many sub-devices as possible under that restriction. If the number of compute units specified does not divide the total amount of compute units available, the leftovers do not get assigned to any sub-device. This option may be used when we want to enable task parallelism in our program, as tasks can be evenly distributed among the sub-devices.
+
+- Partition by counts (of compute units). With this option we can specify the exact number of compute units that we want for each sub-device. This approach may be used when we want to isolate some part of the device for high priority tasks while preventing the lower priority ones from interrupting/interfering with them.
+
+- Partition by affinity domain. The device is split into sub-devices containing compute units that share part of a cache hierarchy. For instance, when executing high-throughput jobs with little shared memory in a NUMA multiprocessor it could be beneficial for maximizing the throughput to partition the device so compute units from the same NUMA node are grouped together. That way each job can run on a sub-device (NUMA node) and get all of its resources without competing with the other jobs. On the other hand, if the program requires a great amount of shared memory, creating sub-devices that group compute units sharing the same cache can be the best option.
+
+This sample tries to exploit task parallelism, so the first approach is the one used: from one device we create two sub-devices, each with half of the available compute units.
+
+_Note: A device can be fissioned in more than one level, meaning that a sub-device of a device can also be partitioned into multiple (sub-)sub-devices._
+
+#### Sub-buffers
+Global buffer objects in OpenCL are one-dimensional collections of elements. From these objects can be obtained new buffer objects, known as sub-buffers. The main use-cases of these sub-buffers are the following:
+- When we need accessing a buffer with different access flags than were specified in buffer creation. E.g. if we create a global buffer with `READ_WRITE` permissions, we then can create two sub-buffers from it, one with `READ_ONLY` permissions and other with `WRITE_ONLY` permissions. Therefore, being able to pass the same information with different permissions to different kernels, which can come in handy when one of the kernels that access the buffer does not perform writes or reads on the buffer as some internal coherence routines can be omitted when launching the kernels.
+- When it's necessary to pass subsets of the same buffer to different kernels calls. E.g. in this sample we need to enqueue one kernel call to one of the sub-devices that convolutes the left half of the matrix and another one to the other sub-device which convolutes the right part of the matrix.
+
+_Note: Unlike sub-devices, a sub-buffer of a global buffer cannot be partitioned again into (sub-)sub-buffers._
 
 ## Application flow
 ### Overview
-By default the application will select whichever two devices are first found in the first platform available. A command-line option is added though, so the user can specify which type of device is preferable to be used (e.g. "cpu" or "gpu").
+1. Select a device. By default the application will select the first device available, but we provide a command-line option to let user specify which type of device prefers to use (e.g. "cpu" or "gpu").
+2. Query compute units available on the device and create two sub-devices from it with half of the compute units each.
+3. Compile kernel.
+4. Initialize host-side input and output matrices. Pad input matrix with 0s so the convolution kernel does not access to out-of-bounds elements.
+5. Initialize device-side global buffers.
+6. Set up OpenCL objects for the sub-devices. In particular, create sub-buffers for input and output matrices.
+7. Enqueue kernel calls on each device with the correspondent arguments and wait until they finish.
+8. Run the host-side convolution algorithm.
+9. Fetch and combine results from devices. Compare the solution obtained with the host's and print to the standard output the result of this validation.
+10. Free memory and OpenCL resources.
 
-A random input matrix and mask are generated and the workload is equally divided between both devices: one of them performs the convolution over the left half of the matrix and the other one does the same with the right half. When both devices finish the execution of the kernel, the results are fetched and combined by the host.
+### Device fission
+Before creating sub-devices from a given device we must think about which partitioning approach is the most appropriate for the kernel/s at hand. In our case, we would like to exploit task parallelism, as the objective is to perform the convolution using 2 devices at the same time to speed it up. Therefore, the best approach is to create two sub-devices with equal number of compute units.
 
-### Device selection
-As mentioned above, by default the first two devices of any type found will be the ones to be used. If there is only one device available, a single-device convolution will be performed.
+As we don't need to perform any other task, we can use all the compute units, so we query how many compute units are available in total using the `clGetDeviceInfo`/`cl::Device::getInfo` function with the `CL_DEVICE_MAX_COMPUTE_UNITS` parameter.
 
-If the user specifies a type of device to be used, the program iterates over all the devices available from any platform and selects the first two devices of that type found. Note that they do not necessarily belong to the same platform.
+With this information we can then create an array of `cl_device_partition_property` containing the properties of the partition of the device. In our case, we must specify that we want to partition the device equally by adding the macro `CL_DEVICE_PARTITION_EQUALLY` and we must indicate how many compute units each device will get, which is half of the maximum available.
+
+Lastly, we use the `clCreateSubDevices`/`cl::Device::createSubDevices` function to fission the device.
+
+### Sub-buffers creation
+For creating a sub-buffer from a global buffer object we first need to determine two important parameters:
+- Which permissions will it have. It cannot have more permissions than the original buffer, e.g. if the global buffer was declared as read-only, the subsequent sub-buffers created from it cannot be write-only or read-write.
+
+- What range from the global buffer will be mapped onto the sub-buffer. We need to consider which kernel is going to take the sub-buffer as input and/or output and determine which range from the global buffer must be mapped.
+
+In our case we use two read-only input buffers, one for the input matrix and one for the mask, and one write-only output buffer. However, we only need sub-buffers for the input and output matrix, as the mask is the same for both kernel calls. Thus, we only create two sub-buffers from the input global buffer and two more from the output one.
+The flags that we set when creating them are `CL_MEM_READ_ONLY` for the input sub-buffers and `CL_MEM_WRITE_ONLY` for the output ones.
+
+For the ranges mapped into the sub-buffers, we take half of the input matrix[^1] for each sub-buffer and half of the output buffer for each too.
+
+[^1]:_The input buffers are actually overlapped, as we need one extra column after/before the middle column when enqueuing the first/second call to the kernel for performing the convolution correctly._
 
 ### Kernel launch
-The rest of the program does not differ much from the usual single-device kernel launch. The only difference is that each device will need a separate set of runtime objects to be created: context, program, command queue, kernel functor, input/mask/output buffers and so on. Note that the input buffers will also contain different data in them, because the first device selected performs the convolution over the left half of the matrix while the second device calculates the convolution over the right half of it.
+The rest of the program does not differ much from the usual single-device kernel launch. The only difference is that each sub-device will need a separate set of runtime objects to be created: device objects, kernel functors, command queues and events.
 
-The kernel is then enqueued to the command queues of each device, and two different events are used to wait for them to be finished. When the devices finish the computations the results are combined in a single host matrix and compared to the host-side results.
+Once everything is set up, a kernel call is enqueued to the command queue of each device with the correspondent input and output parameters, and two different events are used to wait for them to be finished. When the devices finish the computations, the results are combined in a single host matrix and compared to the host-side results.
 
 ## Used API surface
 ### C
 ```c
 CL_BLOCKING
+CL_DEVICE_MAX_COMPUTE_UNITS
+CL_DEVICE_PARTITION_EQUALLY
 CL_DEVICE_PLATFORM
 CL_DEVICE_TYPE_ALL
 CL_HPP_TARGET_OPENCL_VERSION
@@ -49,22 +94,26 @@ CL_PROFILING_COMMAND_START
 CL_QUEUE_PROFILING_ENABLE
 CL_QUEUE_PROPERTIES
 CL_SUCCESS
+cl_buffer_create_type
 cl_command_queue
 cl_command_queue_properties
 cl_context
+cl_device_partition_property
 cl_device_type
 cl_event
 cl_float
 cl_int
 cl_kernel
 cl_mem
+cl_mem_flags
 cl_platform_id
 cl_program
 cl_sdk_fill_with_random_ints_range(pcg32_random_t*, cl_int*, size_t, cl_int, cl_int)
 cl_sdk_options_DeviceTriplet
 cl_sdk_options_Diagnostic
-cl_sdk_options_MultiDevice
+cl_sdk_options_SingleDevice
 cl_uint
+cl_uint2
 cl_ulong
 cl_util_build_program(cl_program, cl_device_id, char*)
 cl_util_get_device(cl_uint, cl_uint, cl_device_type, cl_int*)
@@ -74,6 +123,7 @@ cl_util_print_error(cl_int)
 cl_util_read_text_file(char*const, size_t*const, cl_int*)
 get_dev_type(char*)
 clCreateBuffer(cl_context, cl_mem_flags, size_t, void*, cl_int*)
+clCreateSubBuffer(cl_mem, cl_mem_flags, cl_buffer_create_type, const void*, cl_int*)
 clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)
 clCreateCommandQueueWithProperties(cl_context, cl_device_id, cl_queue_properties*, cl_int*) -> OpenCL >= 2.0
 clCreateContext(cl_context_properties*, cl_uint, cl_device_id*, void *(char*, void*,size_t, void*), void*, cl_int*)
@@ -96,11 +146,14 @@ clWaitForEvents(cl_uint, cl_event*)
 
 ### C++
 ```c++
-cl::Buffer::Buffer(const Context&, IteratorType, IteratorType, bool, bool=false, cl_int*=NULL)
+cl::Buffer::Buffer(const Context&, cl_mem_flags, size_type, void*, cl_int*=NULL)
+cl::Buffer::createSubBuffer(cl_mem_flags, cl_buffer_create_type, const void*, cl_int*=NULL)
 cl::BuildError
 cl::CommandQueue::CommandQueue(const cl::Context&, cl::QueueProperties, cl_int*=NULL)
+cl::CommandQueue::enqueueReadBuffer(const Buffer&, cl_bool, size_type, size_type, void*, const std::vector<cl::Event>*=nullptr, cl::Event*=nullptr)
 cl::Context
 cl::Device::Device()
+cl::Device::createSubDevices(const cl_device_partition_property*, std::vector<cl::Device>*)
 cl::EnqueueArgs::EnqueueArgs(cl::CommandQueue&, cl::NDRange, cl::NDRange)
 cl::Error
 cl::Event
@@ -118,7 +171,7 @@ cl::copy(const CommandQueue&, const cl::Buffer&, IteratorType, IteratorType)
 cl::sdk::comprehend()
 cl::sdk::fill_with_random()
 cl::sdk::get_context(cl_uint, cl_uint, cl_device_type, cl_int*)
-cl::sdk::options::MultiDevice
+cl::sdk::options::SingleDevice
 cl::sdk::parse()
 cl::sdk::parse_cli()
 cl::sdk::options::DeviceTriplet
diff --git a/samples/core/multi-device/convolution.cl b/samples/core/multi-device/convolution.cl
index ca923f08..90b229a5 100644
--- a/samples/core/multi-device/convolution.cl
+++ b/samples/core/multi-device/convolution.cl
@@ -15,21 +15,17 @@
  */
 
 kernel void convolution_3x3(const global float* in, global float* out,
-                            const global float* mask, const unsigned long x_dim,
-                            const unsigned long y_dim)
+                            const global float* mask, const uint2 out_dim)
 {
+    const uint2 gid = (uint2)(get_global_id(0), get_global_id(1));
     const uint mask_dim = 3;
     const uint pad_width = mask_dim / 2;
 
-    const size_t x = get_global_id(0);
-    const size_t y = get_global_id(1);
-
     // Padded constants.
-    const size_t pad_x_dim = x_dim + 2 * pad_width;
-    const size_t pad_y_dim = y_dim + 2 * pad_width;
+    const uint2 in_dim = out_dim + pad_width * 2;
 
     // Check possible out of bounds.
-    if (!(x < x_dim && y < y_dim))
+    if (!(gid.x < out_dim.x && gid.y < out_dim.y))
     {
         return;
     }
@@ -40,20 +36,17 @@ kernel void convolution_3x3(const global float* in, global float* out,
     #if __OPENCL_C_VERSION__ >= 200
     __attribute__((opencl_unroll_hint))
     #endif
-    for (size_t grid_column = x, mask_column = 0; mask_column < mask_dim;
-         ++grid_column, ++mask_column)
+    for(uint y = 0; y < mask_dim; ++y)
     {
         #if __OPENCL_C_VERSION__ >= 200
         __attribute__((opencl_unroll_hint))
         #endif
-        for (size_t grid_row = y, mask_row = 0; mask_row < mask_dim;
-             ++grid_row, ++mask_row)
+        for(uint x = 0; x < mask_dim; ++x)
         {
-            result += mask[mask_column + mask_row * mask_dim]
-                * in[grid_column + grid_row * pad_x_dim];
+            result += mask[y * mask_dim + x] * in[(gid.y + y) * in_dim.x + (gid.x + x)];
         }
     }
 
     // Write result to correspoding output cell.
-    out[x + y * x_dim] = result;
+    out[gid.y * out_dim.x + gid.x] = result;
 }
diff --git a/samples/core/multi-device/main.c b/samples/core/multi-device/main.c
index 9292b57b..6d6adf63 100644
--- a/samples/core/multi-device/main.c
+++ b/samples/core/multi-device/main.c
@@ -79,103 +79,9 @@ ParseState parse_ConvolutionOptions(const char identifier,
     return NotParsed;
 }
 
-// Add option to CLI parsing SDK utility for multi-device type.
-cag_option MultiDeviceOptions[] = { { .identifier = 't',
-                                      .access_letters = "t",
-                                      .access_name = "type",
-                                      .value_name = "(all|cpu|gpu|acc|cus|def)",
-                                      .description =
-                                          "Type of device to use" } };
-
-ParseState parse_MultiDeviceOptions(const char identifier,
-                                    cag_option_context* cag_context,
-                                    struct cl_sdk_options_MultiDevice* opts)
-{
-    const char* value;
-
-    switch (identifier)
-    {
-        case 't':
-            if ((value = cag_option_get_value(cag_context)))
-            {
-                // If the user selects a device type, query all devices and
-                // filter the ones of that type.
-                struct cl_sdk_options_DeviceTriplet* triplets;
-                size_t number = 0;
-                cl_device_type dev_type = get_dev_type(value);
-
-                // Get platforms IDs.
-                cl_uint num_platforms = 0;
-                clGetPlatformIDs(0, NULL, &num_platforms);
-
-                cl_platform_id* platforms = (cl_platform_id*)malloc(
-                    num_platforms * sizeof(cl_platform_id));
-                clGetPlatformIDs(num_platforms, platforms, NULL);
-
-                // Calculate total number of triplets to add.
-                for (cl_uint platform_id = 0; platform_id < num_platforms;
-                     ++platform_id)
-                {
-                    // Get devices IDs.
-                    cl_uint num_devices = 0;
-                    clGetDeviceIDs(platforms[platform_id], dev_type, 0, NULL,
-                                   &num_devices);
-                    number += num_devices;
-                }
-
-                if (!number)
-                {
-                    fprintf(stderr,
-                            "Error: No OpenCL devices of type %s available",
-                            value);
-                    return ParseError;
-                }
-                else if (number < 2)
-                {
-                    printf("Not enough OpenCL devices of type %s available for "
-                           "multi-device. Using only one device.",
-                           value);
-                }
-
-                // Register triplets {platform, device, device_type} for each
-                // device in each platform.
-                triplets = malloc(
-                    number * sizeof(struct cl_sdk_options_DeviceTriplet));
-                number = 0;
-                for (cl_uint platform_id = 0; platform_id < num_platforms;
-                     ++platform_id)
-                {
-                    cl_uint num_devices;
-                    clGetDeviceIDs(platforms[platform_id], dev_type, 0, NULL,
-                                   &num_devices);
-
-                    // Register triplets.
-                    for (cl_uint device_id = 0; device_id < num_devices;
-                         ++device_id, ++number)
-                    {
-                        struct cl_sdk_options_DeviceTriplet triplet = {
-                            platform_id, device_id, dev_type
-                        };
-                        triplets[number] = triplet;
-                    }
-                }
-
-                opts->triplets = triplets;
-                opts->number = number;
-
-                return ParsedOK;
-            }
-            else
-            {
-                return ParseError;
-            }
-    }
-    return NotParsed;
-}
-
 cl_int parse_options(int argc, char* argv[],
                      struct cl_sdk_options_Diagnostic* diag_opts,
-                     struct cl_sdk_options_MultiDevice* devs_opts,
+                     struct cl_sdk_options_SingleDevice* dev_opts,
                      struct convolution_options* convolution_opts)
 {
     cl_int error = CL_SUCCESS;
@@ -186,8 +92,8 @@ cl_int parse_options(int argc, char* argv[],
     MEM_CHECK(opts = add_CLI_options(opts, &n, DiagnosticOptions,
                                      CAG_ARRAY_SIZE(DiagnosticOptions)),
               error, end);
-    MEM_CHECK(tmp = add_CLI_options(opts, &n, MultiDeviceOptions,
-                                    CAG_ARRAY_SIZE(MultiDeviceOptions)),
+    MEM_CHECK(tmp = add_CLI_options(opts, &n, SingleDeviceOptions,
+                                    CAG_ARRAY_SIZE(SingleDeviceOptions)),
               error, end);
     opts = tmp;
     MEM_CHECK(tmp = add_CLI_options(opts, &n, ConvolutionOptions,
@@ -207,7 +113,7 @@ cl_int parse_options(int argc, char* argv[],
 
         PARS_OPTIONS(parse_DiagnosticOptions(identifier, diag_opts), state);
         PARS_OPTIONS(
-            parse_MultiDeviceOptions(identifier, &cag_context, devs_opts),
+            parse_SingleDeviceOptions(identifier, &cag_context, dev_opts),
             state);
         PARS_OPTIONS(parse_ConvolutionOptions(identifier, &cag_context,
                                               convolution_opts),
@@ -215,11 +121,11 @@ cl_int parse_options(int argc, char* argv[],
 
         if (identifier == 'h')
         {
-            printf("Usage: multidevice [OPTION]...\n");
+            printf("Usage: dev_optsdevice [OPTION]...\n");
             printf("Option name and value should be separated by '=' or a "
                    "space\n");
-            printf(
-                "Demonstrates convolution calculation with two devices.\n\n");
+            printf("Demonstrates convolution calculation with two "
+                   "(sub)devices.\n\n");
             cag_option_print(opts, n, stdout);
             exit((state == ParseError) ? CL_INVALID_ARG_VALUE : CL_SUCCESS);
         }
@@ -232,28 +138,27 @@ cl_int parse_options(int argc, char* argv[],
 
 // Host-side implementation of the convolution for verification. Padded input
 // assumed.
-void host_convolution(cl_float* in, cl_float* out, cl_float* mask, size_t x_dim,
-                      size_t y_dim)
+void host_convolution(const cl_float* in, cl_float* out, const cl_float* mask,
+                      const size_t x_dim, const size_t y_dim)
 {
     const cl_uint mask_dim = 3;
     const cl_uint pad_width = mask_dim / 2;
-    const size_t pad_x_dim = x_dim + 2 * pad_width;
-    for (size_t x = 0; x < x_dim; ++x)
+    const cl_uint in_dim_x = x_dim + pad_width * 2;
+
+    for (cl_uint gid_x = 0; gid_x < x_dim; ++gid_x)
     {
-        for (size_t y = 0; y < y_dim; ++y)
+        for (cl_uint gid_y = 0; gid_y < y_dim; ++gid_y)
         {
             float result = 0.f;
-            for (size_t grid_column = x, mask_column = 0;
-                 mask_column < mask_dim; ++grid_column, ++mask_column)
+            for (cl_uint y = 0; y < mask_dim; ++y)
             {
-                for (size_t grid_row = y, mask_row = 0; mask_row < mask_dim;
-                     ++grid_row, ++mask_row)
+                for (cl_uint x = 0; x < mask_dim; ++x)
                 {
-                    result += mask[mask_column + mask_row * mask_dim]
-                        * in[grid_column + grid_row * pad_x_dim];
+                    result += mask[y * mask_dim + x]
+                        * in[(gid_y + y) * in_dim_x + (gid_x + x)];
                 }
             }
-            out[x + y * x_dim] = result;
+            out[gid_y * x_dim + gid_x] = result;
         }
     }
 }
@@ -262,12 +167,16 @@ int main(int argc, char* argv[])
 {
     cl_int error = CL_SUCCESS;
     cl_int end_error = CL_SUCCESS;
-    cl_platform_id platform1, platform2;
-    cl_device_id dev1, dev2;
-    cl_context context1, context2;
-    cl_command_queue queue1, queue2 = NULL;
+    cl_device_id dev;
+    cl_context context;
+    cl_program program;
+    cl_mem dev_input_grid, dev_output_grid, dev_mask;
 
-    cl_program program1, program2;
+    cl_kernel convolutions[2] = { 0 };
+    cl_command_queue sub_queues[2] = { 0 };
+    cl_mem sub_input_grids[2] = { 0 };
+    cl_mem sub_output_grids[2] = { 0 };
+    cl_event events[2] = { 0 };
 
     // Parse command-line options.
     struct cl_sdk_options_Diagnostic diag_opts = { .quiet = false,
@@ -276,62 +185,86 @@ int main(int argc, char* argv[])
     // By default assume that there is only one device available.
     // dev_opts->number is set to 1 so that when calling to cl_util_get_device
     // for the second device there is no index out of range.
-    struct cl_sdk_options_MultiDevice devs_opts = {
-        .triplets =
-            (struct cl_sdk_options_DeviceTriplet[]){
-                { 0, 0, CL_DEVICE_TYPE_ALL }, { 0, 0, CL_DEVICE_TYPE_ALL } },
-        .number = 1
+    struct cl_sdk_options_SingleDevice dev_opts = {
+        .triplet = { 0, 0, CL_DEVICE_TYPE_ALL }
     };
     struct convolution_options convolution_opts = { .x_dim = 4000,
                                                     .y_dim = 4000 };
 
     OCLERROR_RET(
-        parse_options(argc, argv, &diag_opts, &devs_opts, &convolution_opts),
+        parse_options(argc, argv, &diag_opts, &dev_opts, &convolution_opts),
         error, end);
 
     // Create runtime objects based on user preference or default.
-    OCLERROR_PAR(dev1 =
-                     cl_util_get_device(devs_opts.triplets[0].plat_index,
-                                        devs_opts.triplets[0].dev_index,
-                                        devs_opts.triplets[0].dev_type, &error),
-                 error, end);
-    OCLERROR_PAR(dev2 = cl_util_get_device(
-                     devs_opts.triplets[(devs_opts.number > 1)].plat_index,
-                     devs_opts.triplets[(devs_opts.number > 1)].dev_index,
-                     devs_opts.triplets[(devs_opts.number > 1)].dev_type,
-                     &error),
-                 error, end);
-    OCLERROR_PAR(context1 = clCreateContext(NULL, 1, &dev1, NULL, NULL, &error),
-                 error, end);
-    OCLERROR_PAR(context2 = clCreateContext(NULL, 1, &dev2, NULL, NULL, &error),
-                 error, end);
-    OCLERROR_RET(clGetDeviceInfo(dev1, CL_DEVICE_PLATFORM,
-                                 sizeof(cl_platform_id), &platform1, NULL),
-                 error, cont);
-    OCLERROR_RET(clGetDeviceInfo(dev2, CL_DEVICE_PLATFORM,
-                                 sizeof(cl_platform_id), &platform2, NULL),
-                 error, cont);
+    OCLERROR_PAR(dev = cl_util_get_device(dev_opts.triplet.plat_index,
+                                          dev_opts.triplet.dev_index,
+                                          dev_opts.triplet.dev_type, &error),
+                 error, dev);
 
     if (!diag_opts.quiet)
     {
-        cl_util_print_device_info(dev1);
-        cl_util_print_device_info(dev2);
+        cl_util_print_device_info(dev);
+    }
+
+    if (diag_opts.verbose)
+    {
+        printf("Creating sub-devices...");
+        fflush(stdout);
+    }
+
+#if CL_HPP_TARGET_OPENCL_VERSION < 120
+    fprintf(stderr,
+            "Error: OpenCL subdevices not supported before version 1.2 ");
+    exit(EXIT_FAILURE);
+#endif
+
+    // Create subdevices, each with half of the compute units available.
+    cl_uint subdev_count = 0, max_compute_units = 0;
+    OCLERROR_RET(clGetDeviceInfo(dev, CL_DEVICE_MAX_COMPUTE_UNITS,
+                                 sizeof(cl_uint), &max_compute_units, NULL),
+                 error, dev);
+    cl_device_partition_property subdevices_properties[] = {
+        (cl_device_partition_property)CL_DEVICE_PARTITION_EQUALLY,
+        (cl_device_partition_property)max_compute_units / 2, 0
+    };
+    OCLERROR_RET(clCreateSubDevices(dev, subdevices_properties,
+                                    max_compute_units, NULL, &subdev_count),
+                 error, dev);
+
+    if (subdev_count < 2)
+    {
+        fprintf(stderr, "Error: OpenCL cannot create subdevices");
+        exit(EXIT_FAILURE);
     }
 
-    // Compile kernels.
+    cl_device_id* subdevices =
+        (cl_device_id*)malloc(subdev_count * sizeof(cl_device_id));
+    OCLERROR_RET(clCreateSubDevices(dev, subdevices_properties, subdev_count,
+                                    subdevices, NULL),
+                 error, subdevs);
+
+    OCLERROR_PAR(context = clCreateContext(NULL, subdev_count, subdevices, NULL,
+                                           NULL, &error),
+                 error, subdevs);
+
+    // Read kernel file.
     const char* kernel_location = "./convolution.cl";
     char *kernel = NULL, *tmp = NULL;
     size_t program_size = 0;
     OCLERROR_PAR(
         kernel = cl_util_read_text_file(kernel_location, &program_size, &error),
-        error, cont);
+        error, contx);
     MEM_CHECK(tmp = (char*)realloc(kernel, program_size), error, ker);
     kernel = tmp;
-    OCLERROR_PAR(program1 = clCreateProgramWithSource(
-                     context1, 1, (const char**)&kernel, &program_size, &error),
-                 error, ker);
-    OCLERROR_PAR(program2 = clCreateProgramWithSource(
-                     context2, 1, (const char**)&kernel, &program_size, &error),
+
+    // Compile kernel.
+    if (diag_opts.verbose)
+    {
+        printf("done.\nCompiling kernel...");
+        fflush(stdout);
+    }
+    OCLERROR_PAR(program = clCreateProgramWithSource(
+                     context, 1, (const char**)&kernel, &program_size, &error),
                  error, ker);
 
     // If no -cl-std option is specified then the highest 1.x version
@@ -345,29 +278,9 @@ int main(int argc, char* argv[])
     strcat(compiler_options, "-cl-std=CL2.0 ");
 #endif
 
-    OCLERROR_RET(cl_util_build_program(program1, dev1, compiler_options), error,
-                 prg);
-    OCLERROR_RET(cl_util_build_program(program2, dev2, compiler_options), error,
-                 prg);
-
-    // Query maximum workgroup size (WGS) of kernel supported on each device
-    // based on private mem (registers) constraints.
-    size_t wgs1, wgs2;
-    cl_kernel convolution1, convolution2;
-    OCLERROR_PAR(convolution1 =
-                     clCreateKernel(program1, "convolution_3x3", &error),
-                 error, prg);
-    OCLERROR_PAR(convolution2 =
-                     clCreateKernel(program2, "convolution_3x3", &error),
-                 error, prg);
-    OCLERROR_RET(clGetKernelWorkGroupInfo(convolution1, dev1,
-                                          CL_KERNEL_WORK_GROUP_SIZE,
-                                          sizeof(size_t), &wgs1, NULL),
-                 error, conv);
-    OCLERROR_RET(clGetKernelWorkGroupInfo(convolution2, dev2,
-                                          CL_KERNEL_WORK_GROUP_SIZE,
-                                          sizeof(size_t), &wgs2, NULL),
-                 error, conv);
+    OCLERROR_RET(
+        clBuildProgram(program, 2, subdevices, compiler_options, NULL, NULL),
+        error, prg);
 
     // Initialize host-side storage.
     const cl_uint mask_dim = 3;
@@ -377,236 +290,239 @@ int main(int argc, char* argv[])
     const size_t pad_x_dim = x_dim + 2 * pad_width;
     const size_t pad_y_dim = y_dim + 2 * pad_width;
 
-    // Check that the WGSs can divide the global size (MacOS reports
-    // CL_INVALID_WORK_GROUP_SIZE otherwise). If WGS is smaller than the x
-    // dimension, then a NULL pointer will be used when calling
-    // clEnqueueNDRangeKernel for enqueuing the kernels.
-    if (pad_x_dim % wgs1 && pad_x_dim > wgs1)
-    {
-        size_t div = pad_x_dim / wgs1;
-        wgs1 = sqrt(div * wgs1);
-    }
+    const size_t input_bytes = sizeof(cl_float) * pad_x_dim * pad_y_dim;
+    const size_t output_bytes = sizeof(cl_float) * x_dim * y_dim;
+    const size_t mask_bytes = sizeof(cl_float) * mask_dim * mask_dim;
 
-    if (pad_x_dim % wgs2 && pad_x_dim > wgs2)
+    if (diag_opts.verbose)
     {
-        size_t div = pad_x_dim / wgs2;
-        wgs2 = sqrt(div * wgs2);
+        printf("done.\nInitializing host-side storage...\n");
+        fflush(stdout);
     }
 
     // Random number generator.
     pcg32_random_t rng;
     pcg32_srandom_r(&rng, 11111, -2222);
 
+    cl_float* h_input_grid;
+    cl_float* h_output_grid;
+    cl_float* h_mask;
+
     // Initialize input matrix. The input will be padded to remove
     // conditional branches from the convolution kernel for determining
     // out-of-bounds.
-    cl_float* h_input_grid;
-    MEM_CHECK(h_input_grid =
-                  (cl_float*)malloc(sizeof(cl_float) * pad_x_dim * pad_y_dim),
-              error, conv);
+    MEM_CHECK(h_input_grid = (cl_float*)malloc(input_bytes), error, prg);
     if (diag_opts.verbose)
     {
-        printf("Generating %zu random numbers for convolution input grid.\n",
+        printf("  Generating %zu random numbers for convolution input grid...",
                x_dim * y_dim);
+        fflush(stdout);
     }
     cl_sdk_fill_with_random_ints_range(&rng, (cl_int*)h_input_grid,
                                        pad_x_dim * pad_y_dim, -1000, 1000);
-
     // Fill with 0s the extra rows and columns added for padding.
-    for (size_t j = 0; j < pad_x_dim; ++j)
+    for (size_t y = 0; y < pad_y_dim; ++y)
     {
-        for (size_t i = 0; i < pad_y_dim; ++i)
+        for (size_t x = 0; x < pad_x_dim; ++x)
         {
-            if (i == 0 || j == 0 || i == (pad_y_dim - 1)
-                || j == (pad_x_dim - 1))
+            if (x == 0 || y == 0 || x == (pad_x_dim - 1)
+                || y == (pad_y_dim - 1))
             {
-                h_input_grid[j + i * pad_x_dim] = 0;
+                h_input_grid[y * pad_x_dim + x] = 0;
             }
         }
     }
 
     // Declare output matrix. Output will not be padded.
-    cl_float* h_output_grid;
-    MEM_CHECK(h_output_grid =
-                  (cl_float*)malloc(sizeof(cl_float) * x_dim * y_dim),
-              error, hinput);
+    MEM_CHECK(h_output_grid = (cl_float*)malloc(output_bytes), error, hinput);
 
     // Initialize convolution mask.
-    cl_float* h_mask;
-    MEM_CHECK(h_mask =
-                  (cl_float*)malloc(sizeof(cl_float) * mask_dim * mask_dim),
-              error, hinput);
+    MEM_CHECK(h_mask = (cl_float*)malloc(mask_bytes), error, houtput);
     if (diag_opts.verbose)
     {
-        printf("Generating %u random numbers for convolution mask.\n",
+        printf("done.\n  Generating %u random numbers for convolution mask...",
                mask_dim * mask_dim);
+        fflush(stdout);
     }
     cl_sdk_fill_with_random_ints_range(&rng, (cl_int*)h_mask,
                                        mask_dim * mask_dim, -1000, 1000);
 
-    /// Initialize device-side storage.
+    // Create device buffers, from which we will create the subbuffers for the
+    // subdevices.
     const size_t grid_midpoint = y_dim / 2;
     const size_t pad_grid_midpoint = pad_y_dim / 2;
 
     if (diag_opts.verbose)
     {
-        printf("Initializing device-side storage...");
+        printf("done.\nInitializing device-side storage...");
+        fflush(stdout);
+    }
+
+    OCLERROR_PAR(dev_input_grid =
+                     clCreateBuffer(context,
+                                    CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
+                                        | CL_MEM_HOST_NO_ACCESS,
+                                    input_bytes, h_input_grid, &error),
+                 error, hmask);
+    OCLERROR_PAR(dev_output_grid =
+                     clCreateBuffer(context,
+                                    CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR
+                                        | CL_MEM_HOST_READ_ONLY,
+                                    output_bytes, NULL, &error),
+                 error, bufin);
+    OCLERROR_PAR(dev_mask =
+                     clCreateBuffer(context,
+                                    CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
+                                        | CL_MEM_HOST_NO_ACCESS,
+                                    mask_bytes, h_mask, &error),
+                 error, bufout);
+
+    if (diag_opts.verbose)
+    {
+        printf("done.\nSetting up sub-devices...");
+        fflush(stdout);
     }
 
-    // Initialize queues for command execution on each device.
+    // Set up subdevices for kernel execution.
+    const size_t half_input_bytes =
+        sizeof(cl_float) * pad_x_dim * (pad_grid_midpoint + 1);
+    const size_t input_offset =
+        sizeof(cl_float) * pad_x_dim * (pad_grid_midpoint - 1);
+    const size_t half_output_bytes = sizeof(cl_float) * x_dim * grid_midpoint;
+
+    cl_uint subdevice = 0;
+    for (; subdevice < subdev_count; ++subdevice)
+    {
+        // Create kernel.
+        if (diag_opts.verbose)
+        {
+            printf("\n  Creating kernel and command queue of sub-device %d...",
+                   subdevice);
+            fflush(stdout);
+        }
+
+        OCLERROR_PAR(convolutions[subdevice] =
+                         clCreateKernel(program, "convolution_3x3", &error),
+                     error, bufmask);
+
+        // Initialize queues for command execution on each device.
 #if CL_HPP_TARGET_OPENCL_VERSION >= 200
-    cl_command_queue_properties props[] = { CL_QUEUE_PROPERTIES,
-                                            CL_QUEUE_PROFILING_ENABLE, 0 };
-    OCLERROR_PAR(queue1 = clCreateCommandQueueWithProperties(context1, dev1,
-                                                             props, &error),
-                 error, hinput);
-    OCLERROR_PAR(queue2 = clCreateCommandQueueWithProperties(context2, dev2,
-                                                             props, &error),
-                 error, que1);
+        cl_command_queue_properties props[] = { CL_QUEUE_PROPERTIES,
+                                                CL_QUEUE_PROFILING_ENABLE, 0 };
+        OCLERROR_PAR(sub_queues[subdevice] = clCreateCommandQueueWithProperties(
+                         context, subdevices[subdevice], props, &error),
+                     error, conv);
 #else
-    OCLERROR_PAR(queue1 = clCreateCommandQueue(
-                     context1, dev1, CL_QUEUE_PROFILING_ENABLE, &error),
-                 error, hinput);
-    OCLERROR_PAR(queue2 = clCreateCommandQueue(
-                     context2, dev2, CL_QUEUE_PROFILING_ENABLE, &error),
-                 error, que1);
+        OCLERROR_PAR(sub_queues[subdevice] = clCreateCommandQueue(
+                         context, subdevices[subdevice],
+                         CL_QUEUE_PROFILING_ENABLE, &error),
+                     error, conv);
 #endif
 
-    // First device performs the convolution in the upper half and second device
-    // in the lower half.
-    cl_mem dev1_input_grid;
-    OCLERROR_PAR(dev1_input_grid = clCreateBuffer(
-                     context1, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
-                     sizeof(cl_float) * pad_x_dim * (pad_grid_midpoint + 1),
-                     h_input_grid, &error),
-                 error, que2);
-    // Second device performs the convolution in the lower half (middle
-    // border included).
-    cl_mem dev2_input_grid;
-    OCLERROR_PAR(dev2_input_grid = clCreateBuffer(
-                     context2, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
-                     sizeof(cl_float) * pad_x_dim * (pad_grid_midpoint + 1),
-                     h_input_grid + pad_x_dim * (pad_grid_midpoint - 1),
-                     &error),
-                 error, bufin1);
-
-    cl_mem dev1_output_grid;
-    OCLERROR_PAR(dev1_output_grid = clCreateBuffer(
-                     context1, CL_MEM_WRITE_ONLY | CL_MEM_HOST_READ_ONLY,
-                     sizeof(cl_float) * x_dim * y_dim, NULL, &error),
-                 error, bufin2);
-    cl_mem dev2_output_grid;
-    OCLERROR_PAR(dev2_output_grid = clCreateBuffer(
-                     context2, CL_MEM_WRITE_ONLY | CL_MEM_HOST_READ_ONLY,
-                     sizeof(cl_float) * x_dim * y_dim, NULL, &error),
-                 error, bufout1);
-
-    cl_mem dev1_mask;
-    OCLERROR_PAR(dev1_mask = clCreateBuffer(
-                     context1, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
-                     sizeof(cl_float) * mask_dim * mask_dim, h_mask, &error),
-                 error, bufout2);
-    cl_mem dev2_mask;
-    OCLERROR_PAR(dev2_mask = clCreateBuffer(
-                     context2, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
-                     sizeof(cl_float) * mask_dim * mask_dim, h_mask, &error),
-                 error, bufmask1);
-
-    // Set kernels arguments.
-    OCLERROR_RET(
-        clSetKernelArg(convolution1, 0, sizeof(cl_mem), &dev1_input_grid),
-        error, bufmask2);
-    OCLERROR_RET(
-        clSetKernelArg(convolution1, 1, sizeof(cl_mem), &dev1_output_grid),
-        error, bufmask2);
-    OCLERROR_RET(clSetKernelArg(convolution1, 2, sizeof(cl_mem), &dev1_mask),
-                 error, bufmask2);
-    OCLERROR_RET(clSetKernelArg(convolution1, 3, sizeof(size_t), &x_dim), error,
-                 bufmask2);
-    OCLERROR_RET(
-        clSetKernelArg(convolution1, 4, sizeof(size_t), &grid_midpoint), error,
-        bufmask2);
+        // Initialize device-side storage.
+        // First device performs the convolution in the upper half and second
+        // device in the lower half (middle borders included).
+        if (diag_opts.verbose)
+        {
+            printf("done.\n  Initializing device-side storage of sub-device "
+                   "%d...",
+                   subdevice);
+            fflush(stdout);
+        }
 
-    OCLERROR_RET(
-        clSetKernelArg(convolution2, 0, sizeof(cl_mem), &dev2_input_grid),
-        error, bufmask2);
-    OCLERROR_RET(
-        clSetKernelArg(convolution2, 1, sizeof(cl_mem), &dev2_output_grid),
-        error, bufmask2);
-    OCLERROR_RET(clSetKernelArg(convolution2, 2, sizeof(cl_mem), &dev2_mask),
-                 error, bufmask2);
-    OCLERROR_RET(clSetKernelArg(convolution2, 3, sizeof(size_t), &x_dim), error,
-                 bufmask2);
-    OCLERROR_RET(
-        clSetKernelArg(convolution2, 4, sizeof(size_t), &grid_midpoint), error,
-        bufmask2);
+        cl_buffer_region input_region = { subdevice * input_offset,
+                                          half_input_bytes },
+                         output_region = { subdevice * half_output_bytes,
+                                           half_output_bytes },
+                         mask_region = { 0, mask_bytes };
+        OCLERROR_PAR(sub_input_grids[subdevice] = clCreateSubBuffer(
+                         dev_input_grid, CL_MEM_READ_ONLY,
+                         CL_BUFFER_CREATE_TYPE_REGION, &input_region, &error),
+                     error, subqueue);
+        OCLERROR_PAR(sub_output_grids[subdevice] = clCreateSubBuffer(
+                         dev_output_grid, CL_MEM_WRITE_ONLY,
+                         CL_BUFFER_CREATE_TYPE_REGION, &output_region, &error),
+                     error, subbufin);
+
+        if (diag_opts.verbose)
+        {
+            printf("done.");
+            fflush(stdout);
+        }
+
+        // Set kernels arguments.
+        OCLERROR_RET(clSetKernelArg(convolutions[subdevice], 0, sizeof(cl_mem),
+                                    &sub_input_grids[subdevice]),
+                     error, subbufout);
+        OCLERROR_RET(clSetKernelArg(convolutions[subdevice], 1, sizeof(cl_mem),
+                                    &sub_output_grids[subdevice]),
+                     error, subbufout);
+        OCLERROR_RET(clSetKernelArg(convolutions[subdevice], 2, sizeof(cl_mem),
+                                    &dev_mask),
+                     error, subbufout);
+        OCLERROR_RET(clSetKernelArg(convolutions[subdevice], 3,
+                                    sizeof(cl_uint2), (&x_dim, &y_dim)),
+                     error, subbufout);
+    }
 
     // Launch kernels.
     if (diag_opts.verbose)
     {
-        printf("done.\nExecuting on device... ");
+        printf("\nExecuting on device... ");
         fflush(stdout);
     }
 
-    // Initialize global and local buffers for device execution.
+    // Enqueue kernel calls and wait for them to finish.
     const size_t* global = (size_t[]){ pad_x_dim, pad_y_dim };
-    const size_t* local1 = (size_t[]){ wgs1, 1 };
-    const size_t* local2 = (size_t[]){ wgs2, 1 };
 
-    // Enqueue kernel calls and wait for them to finish.
-    cl_event *dev1_kernel_run = NULL, *dev2_kernel_run = NULL;
-    MEM_CHECK(dev1_kernel_run = (cl_event*)malloc(sizeof(cl_event)), error,
-              bufmask2);
-    MEM_CHECK(dev2_kernel_run = (cl_event*)malloc(sizeof(cl_event)), error,
-              run1);
     GET_CURRENT_TIMER(dev_start)
-    OCLERROR_RET(clEnqueueNDRangeKernel(queue1, convolution1, 2, NULL, global,
-                                        (pad_x_dim < wgs1) ? NULL : local1, 0,
-                                        NULL, dev1_kernel_run),
-                 error, run2);
-    OCLERROR_RET(clEnqueueNDRangeKernel(queue2, convolution2, 2, NULL, global,
-                                        (pad_x_dim < wgs2) ? NULL : local2, 0,
-                                        NULL, dev2_kernel_run),
-                 error, run2);
-
-    OCLERROR_RET(clWaitForEvents(1, dev1_kernel_run), error, run2);
-    OCLERROR_RET(clWaitForEvents(1, dev2_kernel_run), error, run2);
+    OCLERROR_RET(clEnqueueNDRangeKernel(sub_queues[0], convolutions[0], 2, NULL,
+                                        global, NULL, 0, NULL, &events[0]),
+                 error, subbufout);
+    OCLERROR_RET(clEnqueueNDRangeKernel(sub_queues[1], convolutions[1], 2, NULL,
+                                        global, NULL, 0, NULL, &events[1]),
+                 error, event1);
+
+    OCLERROR_RET(clWaitForEvents(1, &events[0]), error, event1);
+    OCLERROR_RET(clWaitForEvents(1, &events[1]), error, event2);
     GET_CURRENT_TIMER(dev_end)
-    cl_ulong dev_time;
+    size_t dev_time;
     TIMER_DIFFERENCE(dev_time, dev_start, dev_end)
 
     // Compute reference host-side convolution.
     if (diag_opts.verbose)
     {
         printf("done.\nExecuting on host... ");
+        fflush(stdout);
     }
 
     GET_CURRENT_TIMER(host_start)
     host_convolution(h_input_grid, h_output_grid, h_mask, x_dim, y_dim);
     GET_CURRENT_TIMER(host_end)
-    cl_ulong host_time;
+    size_t host_time;
     TIMER_DIFFERENCE(host_time, host_start, host_end)
 
     if (diag_opts.verbose)
     {
         printf("done.\n");
+        fflush(stdout);
     }
 
     // Fetch and combine results from devices.
     cl_float* concatenated_results;
-    MEM_CHECK(concatenated_results =
-                  (cl_float*)malloc(sizeof(cl_float) * x_dim * y_dim),
-              error, run2);
-    OCLERROR_RET(clEnqueueReadBuffer(queue1, dev1_output_grid, CL_BLOCKING, 0,
-                                     sizeof(cl_float) * x_dim * grid_midpoint,
-                                     concatenated_results, 0, NULL, NULL),
-                 error, run2);
-    OCLERROR_RET(
-        clEnqueueReadBuffer(queue2, dev2_output_grid, CL_BLOCKING, 0,
-                            sizeof(cl_float) * x_dim * grid_midpoint,
-                            concatenated_results + x_dim * grid_midpoint, 0,
-                            NULL, NULL),
-        error, run2);
+    const size_t mid_output_count = x_dim * grid_midpoint;
+    const size_t mid_output_bytes = sizeof(cl_float) * mid_output_count;
+    MEM_CHECK(concatenated_results = (cl_float*)malloc(output_bytes), error,
+              event2);
+    for (cl_uint i = 0; i < subdev_count; ++i)
+    {
+        OCLERROR_RET(
+            clEnqueueReadBuffer(sub_queues[i], sub_output_grids[i], CL_BLOCKING,
+                                0, mid_output_bytes,
+                                &concatenated_results[i * mid_output_count], 0,
+                                NULL, NULL),
+            error, result);
+    }
 
     // Validate device-side solution.
     cl_float deviation = 0.f;
@@ -623,10 +539,12 @@ int main(int argc, char* argv[])
         printf("Failed convolution! Normalized deviation %.6f between host and "
                "device exceeds tolerance %.6f\n",
                deviation, tolerance);
+        fflush(stdout);
     }
     else
     {
         printf("Successful convolution!\n");
+        fflush(stdout);
     }
 
     if (!diag_opts.quiet)
@@ -637,56 +555,81 @@ int main(int argc, char* argv[])
         printf("Kernels execution time as measured by devices :\n");
         printf("\t%llu us.\n",
                (unsigned long long)(cl_util_get_event_duration(
-                                        *dev1_kernel_run,
-                                        CL_PROFILING_COMMAND_START,
+                                        events[0], CL_PROFILING_COMMAND_START,
                                         CL_PROFILING_COMMAND_END, &error)
                                     + 500)
                    / 1000);
         printf("\t%llu us.\n",
                (unsigned long long)(cl_util_get_event_duration(
-                                        *dev2_kernel_run,
-                                        CL_PROFILING_COMMAND_START,
+                                        events[1], CL_PROFILING_COMMAND_START,
                                         CL_PROFILING_COMMAND_END, &error)
                                     + 500)
                    / 1000);
 
         printf("Reference execution as seen by host: %llu us.\n",
                (unsigned long long)(host_time + 500) / 1000);
+        fflush(stdout);
     }
 
-run2:
-    free(dev2_kernel_run);
-run1:
-    free(dev1_kernel_run);
-bufmask2:
-    OCLERROR_RET(clReleaseMemObject(dev2_mask), end_error, bufmask1);
-bufmask1:
-    OCLERROR_RET(clReleaseMemObject(dev1_mask), end_error, bufout2);
-bufout2:
-    OCLERROR_RET(clReleaseMemObject(dev2_output_grid), end_error, bufout1);
-bufout1:
-    OCLERROR_RET(clReleaseMemObject(dev1_output_grid), end_error, bufin2);
-bufin2:
-    OCLERROR_RET(clReleaseMemObject(dev2_input_grid), end_error, bufin1);
-bufin1:
-    OCLERROR_RET(clReleaseMemObject(dev1_input_grid), end_error, hinput);
-que2:
-    OCLERROR_RET(clReleaseCommandQueue(queue2), end_error, que1);
-que1:
-    OCLERROR_RET(clReleaseCommandQueue(queue1), end_error, cont);
+result:
+    free(concatenated_results);
+event2:
+    OCLERROR_RET(clReleaseEvent(events[1]), end_error, event1);
+event1:
+    OCLERROR_RET(clReleaseEvent(events[0]), end_error, subbufout);
+subbufout:
+    if (subdevice == 1)
+    {
+        OCLERROR_RET(clReleaseMemObject(sub_output_grids[1]), end_error,
+                     subbufout0);
+    }
+subbufout0:
+    OCLERROR_PAR(clReleaseMemObject(sub_output_grids[0]), end_error, subbufin);
+subbufin:
+    if (subdevice == 1)
+    {
+        OCLERROR_RET(clReleaseMemObject(sub_input_grids[1]), end_error,
+                     subbufin0);
+    }
+subbufin0:
+    OCLERROR_RET(clReleaseMemObject(sub_input_grids[0]), end_error, subqueue);
+subqueue:
+    if (subdevice == 1)
+    {
+        OCLERROR_RET(clReleaseCommandQueue(sub_queues[1]), end_error,
+                     subqueue0);
+    }
+subqueue0:
+    OCLERROR_RET(clReleaseCommandQueue(sub_queues[1]), end_error, conv);
+conv:
+    if (subdevice == 1)
+    {
+        OCLERROR_RET(clReleaseKernel(convolutions[1]), end_error, conv0);
+    }
+conv0:
+    OCLERROR_RET(clReleaseKernel(convolutions[0]), end_error, bufmask);
+bufmask:
+    OCLERROR_RET(clReleaseMemObject(dev_mask), end_error, bufout);
+bufout:
+    OCLERROR_RET(clReleaseMemObject(dev_output_grid), end_error, bufin);
+bufin:
+    OCLERROR_RET(clReleaseMemObject(dev_input_grid), end_error, hmask);
+hmask:
+    free(h_mask);
+houtput:
+    free(h_output_grid);
 hinput:
     free(h_input_grid);
-conv:
-    OCLERROR_RET(clReleaseKernel(convolution1), end_error, prg);
-    OCLERROR_RET(clReleaseKernel(convolution2), end_error, prg);
 prg:
-    OCLERROR_RET(clReleaseProgram(program1), end_error, ker);
-    OCLERROR_RET(clReleaseProgram(program2), end_error, ker);
+    OCLERROR_RET(clReleaseProgram(program), end_error, subdevs);
 ker:
     free(kernel);
-cont:
-    OCLERROR_RET(clReleaseContext(context1), end_error, end);
-    OCLERROR_RET(clReleaseContext(context2), end_error, end);
+contx:
+    OCLERROR_RET(clReleaseContext(context), end_error, end);
+subdevs:
+    free(subdevices);
+dev:
+    OCLERROR_RET(clReleaseDevice(dev), end_error, end);
 end:
     if (error) cl_util_print_error(error);
     return error;
diff --git a/samples/core/multi-device/main.cpp b/samples/core/multi-device/main.cpp
index 8f06c41b..00ef559c 100644
--- a/samples/core/multi-device/main.cpp
+++ b/samples/core/multi-device/main.cpp
@@ -46,116 +46,53 @@
 // Sample-specific options.
 struct ConvolutionOptions
 {
-    size_t x_dim;
-    size_t y_dim;
+    cl_uint x_dim;
+    cl_uint y_dim;
 };
 
 // Add option to CLI-parsing SDK utility for input dimensions.
 template <> auto cl::sdk::parse<ConvolutionOptions>()
 {
-    return std::make_tuple(std::make_shared<TCLAP::ValueArg<size_t>>(
+    return std::make_tuple(std::make_shared<TCLAP::ValueArg<cl_uint>>(
                                "x", "x_dim", "x dimension of input", false,
-                               4'000, "positive integral"),
-                           std::make_shared<TCLAP::ValueArg<size_t>>(
+                               4'096, "positive integral"),
+                           std::make_shared<TCLAP::ValueArg<cl_uint>>(
                                "y", "y_dim", "y dimension of input", false,
-                               4'000, "positive integral"));
+                               4'096, "positive integral"));
 }
 template <>
 ConvolutionOptions cl::sdk::comprehend<ConvolutionOptions>(
-    std::shared_ptr<TCLAP::ValueArg<size_t>> x_dim_arg,
-    std::shared_ptr<TCLAP::ValueArg<size_t>> y_dim_arg)
+    std::shared_ptr<TCLAP::ValueArg<cl_uint>> x_dim_arg,
+    std::shared_ptr<TCLAP::ValueArg<cl_uint>> y_dim_arg)
 {
     return ConvolutionOptions{ x_dim_arg->getValue(), y_dim_arg->getValue() };
 }
 
-// Add option to CLI parsing SDK utility for multi-device type.
-template <> auto cl::sdk::parse<cl::sdk::options::MultiDevice>()
-{
-    std::vector<std::string> valid_dev_strings{ "all", "cpu", "gpu",
-                                                "acc", "cus", "def" };
-    valid_dev_constraint =
-        std::make_unique<TCLAP::ValuesConstraint<std::string>>(
-            valid_dev_strings);
-
-    return std::make_tuple(std::make_shared<TCLAP::ValueArg<std::string>>(
-        "t", "type", "Type of device to use", false, "all",
-        valid_dev_constraint.get()));
-}
-template <>
-cl::sdk::options::MultiDevice
-cl::sdk::comprehend<cl::sdk::options::MultiDevice>(
-    std::shared_ptr<TCLAP::ValueArg<std::string>> type_arg)
-{
-    cl_device_type device_type = [](std::string in) -> cl_device_type {
-        if (in == "all")
-            return CL_DEVICE_TYPE_ALL;
-        else if (in == "cpu")
-            return CL_DEVICE_TYPE_CPU;
-        else if (in == "gpu")
-            return CL_DEVICE_TYPE_GPU;
-        else if (in == "acc")
-            return CL_DEVICE_TYPE_ACCELERATOR;
-        else if (in == "cus")
-            return CL_DEVICE_TYPE_CUSTOM;
-        else if (in == "def")
-            return CL_DEVICE_TYPE_DEFAULT;
-        else
-            throw std::logic_error{ "Unkown device type after CLI parse." };
-    }(type_arg->getValue());
-
-    std::vector<cl::Platform> platforms;
-    cl::Platform::get(&platforms);
-    cl::sdk::options::MultiDevice devices;
-
-    // Get number of platforms.
-    cl_uint num_platforms = 0;
-    clGetPlatformIDs(0, nullptr, &num_platforms);
-
-    // Register triplets {platform, device, device_type} for each device in each
-    // platform.
-    for (cl_uint platform_id = 0; platform_id < num_platforms; ++platform_id)
-    {
-        // Get devices IDs.
-        cl::Platform platform = platforms[platform_id];
-        std::vector<cl::Device> platform_devices;
-        platform.getDevices(device_type, &platform_devices);
-
-        // Register triplets.
-        for (cl_uint device_id = 0; device_id < platform_devices.size();
-             ++device_id)
-        {
-            devices.triplets.push_back(cl::sdk::options::DeviceTriplet{
-                platform_id, device_id, device_type });
-        }
-    }
-
-    return devices;
-}
-
 // Host-side implementation of the convolution for verification. Padded input
 // assumed.
-void host_convolution(std::vector<cl_float> in, std::vector<cl_float>& out,
-                      std::vector<cl_float> mask, size_t x_dim, size_t y_dim)
+void host_convolution(const std::vector<cl_float> in,
+                      std::vector<cl_float>& out,
+                      const std::vector<cl_float> mask, const cl_uint x_dim,
+                      const cl_uint y_dim)
 {
     constexpr cl_uint mask_dim = 3;
     constexpr cl_uint pad_width = mask_dim / 2;
-    const size_t pad_x_dim = x_dim + 2 * pad_width;
-    for (size_t x = 0; x < x_dim; ++x)
+    const cl_uint in_dim_x = x_dim + pad_width * 2;
+
+    for (cl_uint gid_x = 0; gid_x < x_dim; ++gid_x)
     {
-        for (size_t y = 0; y < y_dim; ++y)
+        for (cl_uint gid_y = 0; gid_y < y_dim; ++gid_y)
         {
             float result = 0.f;
-            for (size_t grid_column = x, mask_column = 0;
-                 mask_column < mask_dim; ++grid_column, ++mask_column)
+            for (cl_uint y = 0; y < mask_dim; ++y)
             {
-                for (size_t grid_row = y, mask_row = 0; mask_row < mask_dim;
-                     ++grid_row, ++mask_row)
+                for (cl_uint x = 0; x < mask_dim; ++x)
                 {
-                    result += mask[mask_column + mask_row * mask_dim]
-                        * in[grid_column + grid_row * pad_x_dim];
+                    result += mask[y * mask_dim + x]
+                        * in[(gid_y + y) * in_dim_x + (gid_x + x)];
                 }
             }
-            out[x + y * x_dim] = result;
+            out[gid_y * x_dim + gid_x] = result;
         }
     }
 }
@@ -166,60 +103,60 @@ int main(int argc, char* argv[])
     {
         // Parse command-line options.
         auto opts = cl::sdk::parse_cli<cl::sdk::options::Diagnostic,
-                                       cl::sdk::options::MultiDevice,
+                                       cl::sdk::options::SingleDevice,
                                        ConvolutionOptions>(argc, argv);
         const auto& diag_opts = std::get<0>(opts);
-        const auto& devs_opts = std::get<1>(opts);
+        const auto& dev_opts = std::get<1>(opts);
         const auto& conv_opts = std::get<2>(opts);
 
-        // Check availability of OpenCL devices.
-        std::vector<cl::sdk::options::DeviceTriplet> triplets =
-            devs_opts.triplets;
-
-        if (!triplets.size())
-        {
-            std::cerr << "Error: No OpenCL devices available" << std::endl;
-            return -1;
-        }
-        else if (triplets.size() < 2)
-        {
-            std::cout << "Not enough OpenCL devices available for "
-                         "multi-device. Using only one device."
-                      << std::endl;
-        }
-
         // Create runtime objects based on user preference or default.
-        cl::Context context1 = cl::sdk::get_context(triplets.at(0));
-        cl::Context context2 =
-            cl::sdk::get_context(triplets.at((triplets.size() >= 2)));
-        cl::Device dev1 = context1.getInfo<CL_CONTEXT_DEVICES>().at(0);
-        cl::Device dev2 = context2.getInfo<CL_CONTEXT_DEVICES>().at(0);
-        cl::Platform platform1{
-            dev1.getInfo<CL_DEVICE_PLATFORM>()
+        cl::Device dev = cl::sdk::get_context(dev_opts.triplet)
+                             .getInfo<CL_CONTEXT_DEVICES>()
+                             .at(0);
+
+        cl::Platform platform{
+            dev.getInfo<CL_DEVICE_PLATFORM>()
         }; // https://github.com/KhronosGroup/OpenCL-CLHPP/issues/150
-        cl::Platform platform2{ dev2.getInfo<CL_DEVICE_PLATFORM>() };
 
         if (!diag_opts.quiet)
         {
-            std::cout << "First selected device: "
-                      << dev1.getInfo<CL_DEVICE_NAME>() << "\n"
-                      << "Using " << platform1.getInfo<CL_PLATFORM_VENDOR>()
-                      << " platform\n"
-                      << std::endl;
-            std::cout << "\nSecond selected device: "
-                      << dev2.getInfo<CL_DEVICE_NAME>() << "\n"
-                      << "Using " << platform2.getInfo<CL_PLATFORM_VENDOR>()
+            std::cout << "Selected device: " << dev.getInfo<CL_DEVICE_NAME>()
+                      << "\n"
+                      << "from " << platform.getInfo<CL_PLATFORM_VENDOR>()
                       << " platform\n"
                       << std::endl;
         }
 
-        // Query device and runtime capabilities.
-        auto d1_highest_device_opencl_c_is_2_x =
-            cl::util::opencl_c_version_contains(dev1, "2.");
-        auto d1_highest_device_opencl_c_is_3_x =
-            cl::util::opencl_c_version_contains(dev1, "3.");
+        if (diag_opts.verbose)
+        {
+            std::cout << "Creating sub-devices...";
+            std::cout.flush();
+        }
 
-        // Compile kernel.
+#if CL_HPP_TARGET_OPENCL_VERSION < 120
+        std::cerr
+            << "Error: OpenCL subdevices not supported before version 1.2 "
+            << std::endl;
+        exit(EXIT_FAILURE);
+#endif
+
+        // Create subdevices, each with half of the compute units available.
+        cl_uint max_compute_units = dev.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
+        cl_device_partition_property subdevices_properties[] = {
+            CL_DEVICE_PARTITION_EQUALLY, max_compute_units / 2, 0
+        };
+        std::vector<cl::Device> subdevices{};
+        dev.createSubDevices(subdevices_properties, &subdevices);
+
+        if (subdevices.size() < 2)
+        {
+            std::cerr << "Error: OpenCL cannot create subdevices" << std::endl;
+            exit(EXIT_FAILURE);
+        }
+
+        cl::Context context(subdevices);
+
+        // Read kernel file.
         const char* kernel_location = "./convolution.cl";
         std::ifstream kernel_stream{ kernel_location };
         if (!kernel_stream.is_open())
@@ -227,66 +164,56 @@ int main(int argc, char* argv[])
                 std::string{ "Cannot open kernel source: " } + kernel_location
             };
 
-        cl::Program program1{ context1,
-                              std::string{ std::istreambuf_iterator<char>{
-                                               kernel_stream },
-                                           std::istreambuf_iterator<char>{} } };
-        // Clear error state flags and reset read position to the beginning of
-        // the .cl file before creating the second program.
-        kernel_stream.clear();
-        kernel_stream.seekg(0, kernel_stream.beg);
-        cl::Program program2{ context2,
-                              std::string{ std::istreambuf_iterator<char>{
-                                               kernel_stream },
-                                           std::istreambuf_iterator<char>{} } };
+        // Compile kernel.
+        if (diag_opts.verbose)
+        {
+            std::cout << "done.\nCompiling kernel...";
+            std::cout.flush();
+        }
+        cl::Program program(
+            context,
+            std::string{ std::istreambuf_iterator<char>{ kernel_stream },
+                         std::istreambuf_iterator<char>{} });
 
+        // Query device and runtime capabilities.
         // If no -cl-std option is specified then the highest 1.x version
         // supported by each device is used to compile the program. Therefore,
         // it's only necessary to add the -cl-std option for 2.0 and 3.0 OpenCL
         // versions.
-        cl::string compiler_options =
-            cl::string{ d1_highest_device_opencl_c_is_2_x ? "-cl-std=CL2.0 "
-                                                          : "" }
-            + cl::string{ d1_highest_device_opencl_c_is_3_x ? "-cl-std=CL3.0 "
-                                                            : "" };
-        program1.build(dev1, compiler_options.c_str());
-        program2.build(dev2, compiler_options.c_str());
-
-        // Query maximum workgroup size (WGS) of kernel supported on each device
-        // based on private mem (registers) constraints.
-        auto convolution1 =
-            cl::KernelFunctor<cl::Buffer, cl::Buffer, cl::Buffer, cl_ulong,
-                              cl_ulong>(program1, "convolution_3x3");
-        auto convolution2 =
-            cl::KernelFunctor<cl::Buffer, cl::Buffer, cl::Buffer, cl_ulong,
-                              cl_ulong>(program2, "convolution_3x3");
-        auto wgs1 = convolution1.getKernel()
-                        .getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(dev1);
-        auto wgs2 = convolution2.getKernel()
-                        .getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(dev2);
+        cl::string compiler_options;
+        constexpr int max_major_version = 3;
+        for (auto i = 2; i <= max_major_version; ++i)
+        {
+            std::string version_str = std::to_string(i) + "."; // "i."
+            std::string compiler_opt_str =
+                "-cl-std=CL" + std::to_string(i) + ".0 "; // -cl-std=CLi.0
+
+            compiler_options += cl::string{ cl::util::opencl_c_version_contains(
+                                                dev, version_str)
+                                                ? compiler_opt_str
+                                                : "" };
+        }
+        program.build(subdevices, compiler_options.c_str());
 
         // Initialize host-side storage.
         constexpr cl_uint mask_dim = 3;
         constexpr cl_uint pad_width = mask_dim / 2;
-        const size_t x_dim = conv_opts.x_dim;
-        const size_t y_dim = conv_opts.y_dim;
-        const size_t pad_x_dim = x_dim + 2 * pad_width;
-        const size_t pad_y_dim = y_dim + 2 * pad_width;
-
-        // Check that the WGSs can divide the global size (MacOS reports
-        // CL_INVALID_WORK_GROUP_SIZE otherwise). If WGS is smaller than the x
-        // dimension, then a NULL pointer will be used when initialising
-        // cl::EnqueueArgs for enqueuing the kernels.
-        if (pad_x_dim % wgs1 && pad_x_dim > wgs1)
-        {
-            size_t div = pad_x_dim / wgs1;
-            wgs1 = sqrt(div * wgs1);
-        }
+        const cl_uint x_dim = conv_opts.x_dim;
+        const cl_uint y_dim = conv_opts.y_dim;
+        const cl_uint pad_x_dim = x_dim + 2 * pad_width;
+        const cl_uint pad_y_dim = y_dim + 2 * pad_width;
+
+        const size_t input_size = pad_x_dim * pad_y_dim;
+        const size_t output_size = x_dim * y_dim;
+        const size_t mask_size = mask_dim * mask_dim;
+        const size_t input_bytes = sizeof(cl_float) * input_size;
+        const size_t output_bytes = sizeof(cl_float) * output_size;
+        const size_t mask_bytes = sizeof(cl_float) * mask_size;
 
-        if (pad_x_dim % wgs2 && pad_x_dim > wgs2)
+        if (diag_opts.verbose)
         {
-            size_t div = pad_x_dim / wgs2;
-            wgs2 = sqrt(div * wgs2);
+            std::cout << "done.\nInitializing host-side storage...";
+            std::cout.flush();
         }
 
         // Random number generator.
@@ -297,107 +224,179 @@ int main(int argc, char* argv[])
         // Initialize input matrix. The input will be padded to remove
         // conditional branches from the convolution kernel for determining
         // out-of-bounds.
-        std::vector<cl_float> h_input_grid(pad_x_dim * pad_y_dim);
+        std::vector<cl_float> h_input_grid(input_size);
         if (diag_opts.verbose)
         {
-            std::cout << "Generating " << x_dim * y_dim
-                      << " random numbers for convolution input grid."
-                      << std::endl;
+            std::cout << "\n  Generating " << output_size
+                      << " random numbers for convolution input grid...";
+            std::cout.flush();
         }
         cl::sdk::fill_with_random(prng, h_input_grid);
 
         // Fill with 0s the extra rows and columns added for padding.
-        for (size_t j = 0; j < pad_x_dim; ++j)
+        for (cl_uint y = 0; y < pad_y_dim; ++y)
         {
-            for (size_t i = 0; i < pad_y_dim; ++i)
+            for (cl_uint x = 0; x < pad_x_dim; ++x)
             {
-                if (i == 0 || j == 0 || i == (pad_y_dim - 1)
-                    || j == (pad_x_dim - 1))
+                if (x == 0 || y == 0 || x == (pad_x_dim - 1)
+                    || y == (pad_y_dim - 1))
                 {
-                    h_input_grid[j + i * pad_x_dim] = 0;
+                    h_input_grid[y * pad_x_dim + x] = 0;
                 }
             }
         }
 
         // Declare output matrix. Output will not be padded.
-        std::vector<cl_float> h_output_grid(x_dim * y_dim);
+        std::vector<cl_float> h_output_grid(output_size, 0);
 
         // Initialize convolution mask.
-        std::vector<cl_float> h_mask(mask_dim * mask_dim);
+        std::vector<cl_float> h_mask(mask_size);
         if (diag_opts.verbose)
         {
-            std::cout << "Generating " << mask_dim * mask_dim
-                      << " random numbers for convolution mask." << std::endl;
+            std::cout << "done.  \nGenerating " << mask_size
+                      << " random numbers for convolution mask...";
+            std::cout.flush();
         }
         cl::sdk::fill_with_random(prng, h_mask);
 
-        // Initialize device-side storage.
-        const size_t grid_midpoint = y_dim / 2;
-        const size_t pad_grid_midpoint = pad_y_dim / 2;
+        // Create device buffers, from which we will create the subbuffers for
+        // the subdevices.
+        const cl_uint grid_midpoint = y_dim / 2;
+        const cl_uint pad_grid_midpoint = pad_y_dim / 2;
 
         if (diag_opts.verbose)
         {
-            std::cout << "Initializing device-side storage...";
+            std::cout << "done.\nInitializing device-side storage...";
             std::cout.flush();
         }
 
-        // Initialize queues for command execution on each device.
-        cl::CommandQueue queue1{ context1, dev1,
-                                 cl::QueueProperties::Profiling };
-        cl::CommandQueue queue2{ context2, dev2,
-                                 cl::QueueProperties::Profiling };
+        cl::Buffer dev_input_grid(context,
+                                  CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
+                                      | CL_MEM_HOST_NO_ACCESS,
+                                  input_bytes, h_input_grid.data());
+        cl::Buffer dev_output_grid(context,
+                                   CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR
+                                       | CL_MEM_HOST_READ_ONLY,
+                                   output_bytes);
+        cl::Buffer dev_mask(context,
+                            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
+                                | CL_MEM_HOST_NO_ACCESS,
+                            mask_bytes, h_mask.data());
+
+        if (diag_opts.verbose)
+        {
+            std::cout << "done.\nSetting up sub-devices...";
+            std::cout.flush();
+        }
+
+        // Set up subdevices for kernel execution.
+        const size_t half_input_bytes =
+            sizeof(cl_float) * pad_x_dim * (pad_grid_midpoint + 1);
+        const size_t input_offset =
+            sizeof(cl_float) * pad_x_dim * (pad_grid_midpoint - 1);
+        const size_t half_output_bytes =
+            sizeof(cl_float) * x_dim * grid_midpoint;
+
+        std::vector<
+            cl::KernelFunctor<cl::Buffer, cl::Buffer, cl::Buffer, cl_uint2>>
+            convolutions{};
+        std::vector<cl::CommandQueue> sub_queues{};
+        std::vector<cl::Buffer> sub_input_grids{}, sub_output_grids{};
+
+        for (size_t i = 0; i < subdevices.size(); ++i)
+        {
+            auto subdevice = subdevices[i];
+
+            if (diag_opts.verbose)
+            {
+                std::cout
+                    << "\n  Creating kernel and command queue of sub-device "
+                    << i << "...";
+                std::cout.flush();
+            }
+
+            auto convolution =
+                cl::KernelFunctor<cl::Buffer, cl::Buffer, cl::Buffer, cl_uint2>(
+                    program, "convolution_3x3")
+                    .getKernel();
+
+            cl::CommandQueue queue(context, subdevice,
+                                   cl::QueueProperties::Profiling);
 
-        // First device performs the convolution in the upper half (middle
-        // border included).
-        cl::Buffer dev1_input_grid(
-            queue1, h_input_grid.begin(),
-            h_input_grid.begin() + pad_x_dim * (pad_grid_midpoint + 1), false);
+            // Initialize device-side storage.
+            // First device performs the convolution in the upper half and
+            // second device in the lower half (middle borders included).
+            if (diag_opts.verbose)
+            {
+                std::cout << "done.\n  Initializing device-side storage of "
+                             "sub-device "
+                          << i << "...";
+                std::cout.flush();
+            }
 
-        // Second device performs the convolution in the lower half (middle
-        // border included).
-        cl::Buffer dev2_input_grid(
-            queue2, h_input_grid.begin() + pad_x_dim * (pad_grid_midpoint - 1),
-            h_input_grid.end(), false);
+            cl_buffer_region input_region = { i * input_offset,
+                                              half_input_bytes },
+                             output_region = { i * half_output_bytes,
+                                               half_output_bytes },
+                             mask_region = { 0, mask_bytes };
 
-        cl::Buffer dev1_output_grid(queue1, h_output_grid.begin(),
-                                    h_output_grid.end(), false);
-        cl::Buffer dev2_output_grid(queue2, h_output_grid.begin(),
-                                    h_output_grid.end(), false);
+            const cl_uint align =
+                subdevice.getInfo<CL_DEVICE_MEM_BASE_ADDR_ALIGN>();
+            if (input_region.origin % align || output_region.origin % align)
+            {
+                std::cerr << "Error: Memory should be aligned to "
+                          << subdevice.getInfo<CL_DEVICE_MEM_BASE_ADDR_ALIGN>()
+                          << std::endl;
+                exit(EXIT_FAILURE);
+            }
 
-        cl::Buffer dev1_mask{ queue1, h_mask.begin(), h_mask.end(), false };
-        cl::Buffer dev2_mask{ queue2, h_mask.begin(), h_mask.end(), false };
+            cl::Buffer sub_input_grid = dev_input_grid.createSubBuffer(
+                CL_MEM_READ_ONLY, CL_BUFFER_CREATE_TYPE_REGION, &input_region);
+            cl::Buffer sub_output_grid = dev_output_grid.createSubBuffer(
+                CL_MEM_WRITE_ONLY, CL_BUFFER_CREATE_TYPE_REGION,
+                &output_region);
+
+            if (diag_opts.verbose)
+            {
+                std::cout << "done.";
+                std::cout.flush();
+            }
+
+            convolutions.push_back(convolution.clone());
+            sub_queues.push_back(queue);
+            sub_input_grids.push_back(sub_input_grid);
+            sub_output_grids.push_back(sub_output_grid);
+        }
 
         // Launch kernels.
         if (diag_opts.verbose)
         {
-            std::cout << " done.\nExecuting on device... ";
+            std::cout << "\nExecuting on device... ";
             std::cout.flush();
         }
 
         // Initialize global and local buffers for device execution.
         const cl::NDRange global{ pad_x_dim, pad_y_dim };
-        const cl::NDRange local1{ wgs1, 1 };
-        const cl::NDRange local2{ wgs2, 1 };
 
         // Enqueue kernel calls and wait for them to finish.
         std::vector<cl::Event> dev1_kernel_runs;
+        dev1_kernel_runs.reserve(1);
         std::vector<cl::Event> dev2_kernel_runs;
+        dev2_kernel_runs.reserve(1);
         auto dev_start = std::chrono::high_resolution_clock::now();
 
-        dev1_kernel_runs.push_back(convolution1(
-            cl::EnqueueArgs{ queue1, global,
-                             (pad_x_dim < wgs1) ? cl::NullRange : local1 },
-            dev1_input_grid, dev1_output_grid, dev1_mask, x_dim,
-            grid_midpoint));
-        dev2_kernel_runs.push_back(convolution2(
-            cl::EnqueueArgs{ queue2, global,
-                             (pad_x_dim < wgs2) ? cl::NullRange : local2 },
-            dev2_input_grid, dev2_output_grid, dev2_mask, x_dim,
-            grid_midpoint));
+        dev1_kernel_runs.push_back(convolutions[0](
+            cl::EnqueueArgs{ sub_queues[0], global }, sub_input_grids[0],
+            sub_output_grids[0], dev_mask,
+            (cl_uint2){ { x_dim, grid_midpoint } }));
+
+        dev2_kernel_runs.push_back(convolutions[1](
+            cl::EnqueueArgs{ sub_queues[1], global }, sub_input_grids[1],
+            sub_output_grids[1], dev_mask,
+            (cl_uint2){ { x_dim, grid_midpoint } }));
 
         cl::WaitForEvents(dev1_kernel_runs);
         cl::WaitForEvents(dev2_kernel_runs);
-
         auto dev_end = std::chrono::high_resolution_clock::now();
 
         // Compute reference host-side convolution.
@@ -412,15 +411,15 @@ int main(int argc, char* argv[])
 
         auto host_end = std::chrono::high_resolution_clock::now();
 
-        if (diag_opts.verbose) std::cout << "done." << std::endl;
+        if (diag_opts.verbose)
+        {
+            std::cout << "done." << std::endl;
+        }
 
         // Fetch and combine results from devices.
-        std::vector<cl_float> concatenated_results(x_dim * y_dim);
-        cl::copy(queue1, dev1_output_grid, concatenated_results.begin(),
-                 concatenated_results.begin() + x_dim * grid_midpoint);
-        cl::copy(queue2, dev2_output_grid,
-                 concatenated_results.begin() + x_dim * grid_midpoint,
-                 concatenated_results.end());
+        std::vector<cl_float> concatenated_results(output_size);
+        cl::copy(sub_queues.front(), dev_output_grid,
+                 concatenated_results.begin(), concatenated_results.end());
 
         // Validate device-side solution.
         cl_float deviation = 0.f;