Merge pull request #484 from Pennycook/synchronization

Clarify sections previously using "synchronize"
KhronosGroup · Nov 10, 2023 · 463a26c · 463a26c
2 parents e8d6bb6 + fb391b3
commit 463a26c
Show file tree

Hide file tree

Showing 6 changed files with 149 additions and 163 deletions.
diff --git a/adoc/chapters/architecture.adoc b/adoc/chapters/architecture.adoc
@@ -461,7 +461,7 @@ each other without side effects (i.e., given two requirements _r~j~_ and _r~k~_,
 _(r~j~, r~k~)_ {equiv} _(r~k~, r~j~)_). The intersection of two
 actions is not necessarily empty.
 *Actions* can include (but are not limited to): memory copy operations,
-mapping operations, host side synchronization, or implementation-specific
+memory mapping operations, coordination with the host, or implementation-specific
 behavior.
 
 Finally, _Performing an action_ ({SYCLperform}(_a~i~_)) executes the
@@ -562,10 +562,10 @@ image::{images}/three-cg-three-queue.svg[align="center",opts="{imageopts}"]
 ==== Controlling execution order with events
 
 Submitting an action for execution returns an [code]#event# object. Programmers
-may use these events to explicitly synchronize programs. Host code can wait for an
-event to complete, which will block execution on the host until the action represented
-by the event has completed.  The [code]#event# class is described in greater detail
-in <<sec:interface.event>>.
+may use these events to explicitly coordinate host and device execution. Host
+code can wait for an event to complete, which will block execution on the host
+until the action(s) represented by the event have completed.  The [code]#event#
+class is described in greater detail in <<sec:interface.event>>.
 
 Events may also be used to explicitly order the execution of kernels. Host code may
 wait for the completion of specific event, which blocks execution on the host until
@@ -627,7 +627,7 @@ Each work-item in the ND-range is identified by a value of type
 [code]#nd_item<N>#.  The type [code]#nd_item<N># encapsulates a
 global id, local id and work-group id, all of type [code]#id<N>#
 (the iteration space offset also of type [code]#id<N>#, but this is deprecated in SYCL 2020), as well as
-global and local ranges and synchronization operations necessary to
+global and local ranges and coordination mechanisms necessary to
 make work-groups useful. Work-groups are assigned ids using a similar
 approach to that used for work-item global ids.  Work-items are
 assigned to a work-group and given a local id with components in the
@@ -872,7 +872,7 @@ execute concurrently with the barrier.
 Finally, _CG~c~_ will be enqueued after _H(b1~RW~)_ is finished,
 but still has to wait for _CG~b~_ to conclude for all its requirements to
 be satisfied.
-See <<sec:synchronization>> for details on synchronization rules.
+See <<sec:coordination>> for details on host-device coordination.
 
 // Formerly in host_acc.tex
 // Uses same definitions for cga, cgb, cgc, and hostA in code and picture,
@@ -1011,11 +1011,11 @@ made consistent across the work-items in that set through the use of
 <<mem-fence>> and atomic operations.
 
 Memory consistency between the host and SYCL device(s), or different SYCL
-devices in the same context, can be guaranteed through synchronization in
-the host application as defined in <<sec:synchronization>>.  On SYCL devices
-supporting concurrent atomic accesses to USM allocations and acquire-release or sequentially
-consistent memory orderings, cross-device memory consistency can be
-enforced through the use of <<mem-fence>> and atomic operations.
+devices in the same context, can be guaranteed through library calls in the
+host application, as defined in <<sec:coordination>>. On SYCL devices
+supporting concurrent atomic accesses to USM allocations and acquire-release or
+sequentially consistent memory orderings, cross-device memory consistency can
+be enforced through the use of <<mem-fence>> and atomic operations.
 
 [[sec:memory-ordering]]
 ==== Memory ordering
@@ -1207,10 +1207,10 @@ sections of this specification.
 === Basic data parallel kernels
 
 Data-parallel <<kernel,kernels>> that execute as
-multiple <<work-item,work-items>> and where no local synchronization is required are enqueued
-with the [code]#sycl::parallel_for# function parameterized by a
-[code]#sycl::range# parameter. These kernels will execute the kernel
-function body once for each work-item in the specified <<range>>.
+multiple <<work-item,work-items>> and where no work-group-local coordination is
+required are enqueued with the [code]#sycl::parallel_for# function
+parameterized by a [code]#sycl::range# parameter. These kernels will execute
+the kernel function body once for each work-item in the specified <<range>>.
 
 Functionality tied to <<group, groups>> of work-items, including
 <<group-barrier, group barriers>> and <<local-memory>>, must not be used
@@ -1228,10 +1228,10 @@ work-group size as parameters to the [code]#sycl::parallel_for# function with a
 [code]#sycl::nd_range# parameter. In this mode of execution,
 kernels execute over the <<nd-range>> in work-groups of the specified
 size. It is possible to share data among work-items within the same
-work-group in <<local-memory,local>> or <<global-memory>> and to synchronize between
-work-items in the same work-group by calling the
-[code]#group_barrier# function. All work-groups in a given
-[code]#parallel_for# will be the same size, and the global size
+work-group in <<local-memory,local>> or <<global-memory>>, and the
+[code]#group_barrier# function can be used to block a work-item until all
+work-items in the same work-group arrive at the barrier. All work-groups in a
+given [code]#parallel_for# will be the same size, and the global size
 defined in the nd-range must either be a multiple of the work-group size in
 each dimension, or the global size must be zero. When the global size
 is zero, the kernel function is not executed, the local size is ignored, and
@@ -1246,8 +1246,8 @@ work-items in any sub-group in a kernel is based on a combination of the kernel
 and its dispatch dimensions.  The size of any sub-group in the dispatch is
 between 1 and this maximum sub-group size, and the size of an individual
 sub-group is invariant for the duration of a kernel's execution.  Similarly
-to work-groups, the work-items within the same sub-group can be synchronized
-by calling the [code]#group_barrier# function.
+to work-groups, the [code]#group_barrier# function can be used to block a
+work-item until all work-items in the same sub-group arrive at the barrier.
 
 Portable device code must not assume that work-items within a sub-group execute
 in any particular order, that work-groups are subdivided into sub-groups in a
@@ -1319,55 +1319,61 @@ These kernels are not programmable, hence they are not bound by the SYCL
 implementation-defined.
 
 
-[[sec:synchronization]]
-=== Synchronization
+[[sec:coordination]]
+=== Coordination and Synchronization
 
-Synchronization of processing elements executing inside a device is handled
-by the SYCL device kernel following the SYCL kernel execution model.
-The synchronization of the different SYCL device kernels executing with
-the host memory is handled by the SYCL application via the SYCL runtime.
+Coordination between the host and any devices can be expressed in the host SYCL
+application using calls into the SYCL runtime. Coordination between work-items
+executing inside of device code can be expressed using group barriers.
 
-==== Synchronization in the SYCL application
+Some function calls synchronize with other function calls performed by another
+thread (potentially on another device). Other functions are defined in terms of
+their synchronization operations. Such functions can be used to ensure that
+the host and any devices do not access data concurrently, and/or to reason
+about the ordering of operations across the host and any devices.
 
-Synchronization points between host and device(s) are exposed through
-the following operations:
+==== Host-Device Coordination
+
+The following operations can be used to coordinate host and device(s):
 
   * _Buffer destruction_: The destructors for
     [code]#sycl::buffer#, [code]#sycl::unsampled_image# and
-    [code]#sycl::sampled_image# objects wait for all submitted work on
-    those objects to complete and to copy the data back to host memory
-    before returning. These destructors only wait if the object was
-    constructed with attached host memory and if data needs to be copied
-    back to the host.
+    [code]#sycl::sampled_image# objects block until all submitted work on
+    those objects completes and copy the data back to host memory before
+    returning. These destructors only block if the object was constructed with
+    attached host memory and if data needs to be copied back to the host.
 +
 --
-More complex forms of synchronization on buffer destruction
-can be specified by the user by constructing buffers with other kinds of
-references to memory, such as [code]#shared_ptr# and [code]#unique_ptr#.
+More complex forms of buffer destruction can be specified by the user by
+constructing buffers with other kinds of references to memory, such as
+[code]#shared_ptr# and [code]#unique_ptr#.
 --
-  * _Host Accessors_: The constructor for a host accessor waits for all
-    kernels that modify the same buffer (or image) in any queues to complete
+  * _Host Accessors_: The constructor for a host accessor blocks until all
+    kernels that modify the same buffer (or image) in any queues complete
     and then copies data back to host memory before the constructor returns.
     Any command groups with requirements to the same memory object cannot
     execute until the host accessor is destroyed as shown on <<fig:host-acc>>.
   * _Command group enqueue_: The <<sycl-runtime>> internally ensures
     that any command groups added to queues have the correct event
     dependencies added to those queues to ensure correct operation. Adding
-    command groups to queues never blocks. Instead any required
-    synchronization is added to the queue and events of type
-    [code]#sycl::event# are returned by the queue's submit function
-    that contain event information related to the specific command group.
+    command groups to queues never blocks, and the [code]#sycl::event# returned
+    by the queue's submit function contains event information related to the
+    specific command group.
   * _Queue operations_: The user can manually use queue operations,
     such as [code]#sycl::queue::wait()# to block execution of the calling thread until all
     the command groups submitted to the queue have finished execution. Note
     that this will also affect the dependencies of those command groups in
     other queues.
   * _SYCL event objects_: SYCL provides [code]#sycl::event#
-    objects which can be used for synchronization. If synchronization is
-    required across SYCL contexts from different <<backend, SYCL backends>>, then the
-    <<sycl-runtime>> ensures that extra host-based synchronization is
-    added to enable the SYCL event objects to operate between contexts
-    correctly.
+    objects which can be used to track and specify dependencies. The
+    <<sycl-runtime>> must ensure that these objects can be used to
+    enforce dependencies that span SYCL contexts from different <<backend, SYCL
+    backends>>.
+
+The specification for each of these blocking functions defines some set of
+operations that cause the function to unblock. These operations always happen
+before the blocking function returns (using the definition of "happens before"
+from the C++ specification).
 
 Note that the destructors of other SYCL objects
 ([code]#sycl::queue#, [code]#sycl::context#,{ldots}) do
@@ -1380,22 +1386,18 @@ cases if this does not fit the goal.
 See <<sec:managing-object-lifetimes>> for more information
 on object life time.
 
-==== Synchronization in SYCL kernels
-
-In SYCL, synchronization can be either global or local within a
-group of work-items.  Synchronization between work-items in a single
-group is achieved using a <<group-barrier>>.
+==== Work-item Coordination
 
-All the work-items of a group must execute the barrier before any are
-allowed to continue execution beyond the barrier. Note that the
-group barrier must be encountered by all work-items of a
-group executing the kernel or by none at all. In SYCL,
-<<work-group-barrier>> and <<sub-group-barrier>> functionality is
+A <<group-barrier>> provides a mechanism to coordinate all work-items in the
+same group. All work-items in a group must execute the barrier before any are
+allowed to continue execution beyond the barrier. Note that the group barrier
+must be encountered by all work-items of a group executing the kernel or by
+none at all. <<work-group-barrier>> and <<sub-group-barrier>> functionality is
 exposed via the [code]#group_barrier# function.
 
-Synchronization between work-items in different work-groups via atomic
-operations is possible only on SYCL devices with certain capabilities,
-as described in <<sec:memoryconsistency>>.
+Coordination between work-items in different work-groups must take place via
+atomic operations, and is possible only on SYCL device with certain
+capabilities, as described in <<sec:memoryconsistency>>.
 
 === Error handling
 
@@ -1501,7 +1503,7 @@ provided, which uses move semantics for initializing and using the
 associated host memory. In this case, the behavior of the buffer in
 relation to the user application will be non-blocking on destruction.
 
-As said in <<sec:synchronization>>, the only blocking
+As said in <<sec:coordination>>, the only blocking
 operations in SYCL (apart from explicit wait operations) are:
 
   * host accessor constructor, which waits for any kernels enqueued before
@@ -1624,19 +1626,18 @@ the appropriate image format specified such as four 8-bit
 elements, for example.
 --
 
-Users may want fine-grained control of the synchronization, memory management
-and storage semantics of SYCL image or buffer objects. For example, a user may
-wish to specify the host memory for a memory object to use, but may not want the
-memory object to block on destruction.
+Users may want fine-grained control of the memory management and storage
+semantics of SYCL image or buffer objects. For example, a user may wish to
+specify the host memory for a memory object to use, but may not want the memory
+object to block on destruction.
 
 Depending on the control and the use cases of the SYCL applications,
 well established {cpp} classes and patterns can be used for reference counting and
 sharing data between user applications and the <<sycl-runtime>>. For control over
 memory allocation on the host and mapping between host and device memory,
 pre-defined or user-defined {cpp} [code]#std::allocator# classes are
-used. For better control of synchronization between a SYCL and a non SYCL
-application that share data, [code]#std::shared_ptr# and
-[code]#std::mutex# classes are used.
+used. To avoid data races when sharing data between SYCL and non-SYCL
+applications, [code]#std::shared_ptr# and [code]#std::mutex# classes are used.
 
 
 == Multi-dimensional objects and linearization

diff --git a/adoc/chapters/glossary.adoc b/adoc/chapters/glossary.adoc
@@ -49,9 +49,9 @@
     <<subsec:exception.class>>.
 
 [[barrier]]barrier::
-    A barrier is either a <<queue-barrier>>, or a kernel execution
-    <<group-barrier>> depending on whether it is a synchronization point on
-    the command queue or on a group of work-items in a kernel execution.
+    A barrier may refer to either a <<queue-barrier>> used for host-device
+    coordination, or a <<group-barrier>> used to coordinate work-items in
+    a kernel.
 
 [[blocking-accessor]]blocking accessor::
     A blocking accessor is an <<accessor>> which provides immediate access
@@ -112,13 +112,9 @@ object. For the full description please refer to <<subsec:buffers>>.
     details see <<sec:command.group.scope>>.
 
 [[queue-barrier]]command queue barrier::
-    The SYCL API provides two variants for functions that force
-    synchronization on a SYCL command queue. The
-    [code]#sycl::queue::wait()# and
-    [code]#sycl::queue::wait_and_throw()# functions force the SYCL
-    command queue to wait for the execution of the
-    <<command-group-function-object>> before it is able to continue
-    executing.
+    The [code]#sycl::queue::wait()# and [code]#sycl::queue::wait_and_throw()#
+    functions block the calling thread until the execution of a
+    <<command-group-function-object>> completes.
 
 [[constant-memory]]constant memory::
     A region of memory that remains constant during the execution of
@@ -212,14 +208,8 @@ object. For the full description please refer to <<subsec:buffers>>.
     such as a <<work-group>> or <<sub-group>>.
 
 [[group-barrier]]group barrier::
-    A synchronization function within a group of <<work-item,work-items>>. All the
-    <<work-item,work-items>> of a group must execute the barrier construct before any
-    <<work-item>> continues execution beyond the barrier. Additionally all work-items
-    in the group execute a release <<mem-fence>> prior to synchronizing at the
-    barrier, all work-items in the group execute an acquire <<mem-fence>> after
-    synchronizing at the barrier, and there is an implicit synchronization between
-    these acquire and release fences as if through an atomic operation on an
-    atomic object internal to the barrier implementation.
+    A coordination mechanism for all <<work-item,work-items>> of a group.
+    See the definition of the [code]#group_barrier# function.
 
 [[h-item]]h-item::
     A unique identifier representing a single <<work-item>> within the
@@ -368,13 +358,8 @@ object. For the full description please refer to <<subsec:buffers>>.
 
 [[mem-fence]]mem-fence::
     A memory fence provides control over re-ordering of memory load
-    and store operations when coupled with an atomic operation that
-    synchronizes two fences with each other (or when the fences are part of
-    a <<group-barrier>> in which case there is implicit synchronization
-    as if an atomic operation has synchronized the fences). The
-    [code]#sycl::atomic_fence# function acts as a fence across all
-    work-items and devices specified by a [code]#memory_scope#
-    argument.
+    and store operations when coupled with an atomic operation.
+    See the definition of the [code]#sycl::atomic_fence# function.
 
 [[object]]object::
     A state which a <<kernel-bundle>> can be in, representing

diff --git a/adoc/chapters/introduction.adoc b/adoc/chapters/introduction.adoc
@@ -25,7 +25,7 @@ features:
   * execution of parallel kernels on a heterogeneous device is made
     simultaneously convenient and flexible. Common parallel patterns are
     prioritized with simple syntax, which through a series {cpp} types allow
-    the programmer to express additional requirements, such as synchronization,
+    the programmer to express additional requirements, such as dependencies,
     if needed;
   * when using buffers and accessors, data access in SYCL is separated from
     data storage. By relying on the {cpp}-style resource acquisition is
@@ -43,9 +43,8 @@ features:
   * the hierarchical parallelism syntax offers a way of expressing
     data parallelism similar to the OpenCL device or OpenMP target
     device execution model in an easy-to-understand modern {cpp} form.  It
-    more cleanly layers parallel loops and synchronization points to
-    avoid fragmentation of code and to more efficiently map to CPU-style
-    architectures.
+    more cleanly layers parallel loops to avoid fragmentation of code and to
+    more efficiently map to CPU-style architectures.
 
 SYCL retains the execution model, runtime feature set and device
 capabilities inspired by the OpenCL standard. This standard imposes