diff --git a/adoc/chapters/architecture.adoc b/adoc/chapters/architecture.adoc index 193ed972..12b80e82 100644 --- a/adoc/chapters/architecture.adoc +++ b/adoc/chapters/architecture.adoc @@ -461,7 +461,7 @@ each other without side effects (i.e., given two requirements _r~j~_ and _r~k~_, _(r~j~, r~k~)_ {equiv} _(r~k~, r~j~)_). The intersection of two actions is not necessarily empty. *Actions* can include (but are not limited to): memory copy operations, -mapping operations, host side synchronization, or implementation-specific +memory mapping operations, coordination with the host, or implementation-specific behavior. Finally, _Performing an action_ ({SYCLperform}(_a~i~_)) executes the @@ -562,10 +562,10 @@ image::{images}/three-cg-three-queue.svg[align="center",opts="{imageopts}"] ==== Controlling execution order with events Submitting an action for execution returns an [code]#event# object. Programmers -may use these events to explicitly synchronize programs. Host code can wait for an -event to complete, which will block execution on the host until the action represented -by the event has completed. The [code]#event# class is described in greater detail -in <>. +may use these events to explicitly coordinate host and device execution. Host +code can wait for an event to complete, which will block execution on the host +until the action(s) represented by the event have completed. The [code]#event# +class is described in greater detail in <>. Events may also be used to explicitly order the execution of kernels. Host code may wait for the completion of specific event, which blocks execution on the host until @@ -627,7 +627,7 @@ Each work-item in the ND-range is identified by a value of type [code]#nd_item#. The type [code]#nd_item# encapsulates a global id, local id and work-group id, all of type [code]#id# (the iteration space offset also of type [code]#id#, but this is deprecated in SYCL 2020), as well as -global and local ranges and synchronization operations necessary to +global and local ranges and coordination mechanisms necessary to make work-groups useful. Work-groups are assigned ids using a similar approach to that used for work-item global ids. Work-items are assigned to a work-group and given a local id with components in the @@ -872,7 +872,7 @@ execute concurrently with the barrier. Finally, _CG~c~_ will be enqueued after _H(b1~RW~)_ is finished, but still has to wait for _CG~b~_ to conclude for all its requirements to be satisfied. -See <> for details on synchronization rules. +See <> for details on host-device coordination. // Formerly in host_acc.tex // Uses same definitions for cga, cgb, cgc, and hostA in code and picture, @@ -1011,11 +1011,11 @@ made consistent across the work-items in that set through the use of <> and atomic operations. Memory consistency between the host and SYCL device(s), or different SYCL -devices in the same context, can be guaranteed through synchronization in -the host application as defined in <>. On SYCL devices -supporting concurrent atomic accesses to USM allocations and acquire-release or sequentially -consistent memory orderings, cross-device memory consistency can be -enforced through the use of <> and atomic operations. +devices in the same context, can be guaranteed through library calls in the +host application, as defined in <>. On SYCL devices +supporting concurrent atomic accesses to USM allocations and acquire-release or +sequentially consistent memory orderings, cross-device memory consistency can +be enforced through the use of <> and atomic operations. [[sec:memory-ordering]] ==== Memory ordering @@ -1207,10 +1207,10 @@ sections of this specification. === Basic data parallel kernels Data-parallel <> that execute as -multiple <> and where no local synchronization is required are enqueued -with the [code]#sycl::parallel_for# function parameterized by a -[code]#sycl::range# parameter. These kernels will execute the kernel -function body once for each work-item in the specified <>. +multiple <> and where no work-group-local coordination is +required are enqueued with the [code]#sycl::parallel_for# function +parameterized by a [code]#sycl::range# parameter. These kernels will execute +the kernel function body once for each work-item in the specified <>. Functionality tied to <> of work-items, including <> and <>, must not be used @@ -1228,10 +1228,10 @@ work-group size as parameters to the [code]#sycl::parallel_for# function with a [code]#sycl::nd_range# parameter. In this mode of execution, kernels execute over the <> in work-groups of the specified size. It is possible to share data among work-items within the same -work-group in <> or <> and to synchronize between -work-items in the same work-group by calling the -[code]#group_barrier# function. All work-groups in a given -[code]#parallel_for# will be the same size, and the global size +work-group in <> or <>, and the +[code]#group_barrier# function can be used to block a work-item until all +work-items in the same work-group arrive at the barrier. All work-groups in a +given [code]#parallel_for# will be the same size, and the global size defined in the nd-range must either be a multiple of the work-group size in each dimension, or the global size must be zero. When the global size is zero, the kernel function is not executed, the local size is ignored, and @@ -1246,8 +1246,8 @@ work-items in any sub-group in a kernel is based on a combination of the kernel and its dispatch dimensions. The size of any sub-group in the dispatch is between 1 and this maximum sub-group size, and the size of an individual sub-group is invariant for the duration of a kernel's execution. Similarly -to work-groups, the work-items within the same sub-group can be synchronized -by calling the [code]#group_barrier# function. +to work-groups, the [code]#group_barrier# function can be used to block a +work-item until all work-items in the same sub-group arrive at the barrier. Portable device code must not assume that work-items within a sub-group execute in any particular order, that work-groups are subdivided into sub-groups in a @@ -1319,55 +1319,61 @@ These kernels are not programmable, hence they are not bound by the SYCL implementation-defined. -[[sec:synchronization]] -=== Synchronization +[[sec:coordination]] +=== Coordination and Synchronization -Synchronization of processing elements executing inside a device is handled -by the SYCL device kernel following the SYCL kernel execution model. -The synchronization of the different SYCL device kernels executing with -the host memory is handled by the SYCL application via the SYCL runtime. +Coordination between the host and any devices can be expressed in the host SYCL +application using calls into the SYCL runtime. Coordination between work-items +executing inside of device code can be expressed using group barriers. -==== Synchronization in the SYCL application +Some function calls synchronize with other function calls performed by another +thread (potentially on another device). Other functions are defined in terms of +their synchronization operations. Such functions can be used to ensure that +the host and any devices do not access data concurrently, and/or to reason +about the ordering of operations across the host and any devices. -Synchronization points between host and device(s) are exposed through -the following operations: +==== Host-Device Coordination + +The following operations can be used to coordinate host and device(s): * _Buffer destruction_: The destructors for [code]#sycl::buffer#, [code]#sycl::unsampled_image# and - [code]#sycl::sampled_image# objects wait for all submitted work on - those objects to complete and to copy the data back to host memory - before returning. These destructors only wait if the object was - constructed with attached host memory and if data needs to be copied - back to the host. + [code]#sycl::sampled_image# objects block until all submitted work on + those objects completes and copy the data back to host memory before + returning. These destructors only block if the object was constructed with + attached host memory and if data needs to be copied back to the host. + -- -More complex forms of synchronization on buffer destruction -can be specified by the user by constructing buffers with other kinds of -references to memory, such as [code]#shared_ptr# and [code]#unique_ptr#. +More complex forms of buffer destruction can be specified by the user by +constructing buffers with other kinds of references to memory, such as +[code]#shared_ptr# and [code]#unique_ptr#. -- - * _Host Accessors_: The constructor for a host accessor waits for all - kernels that modify the same buffer (or image) in any queues to complete + * _Host Accessors_: The constructor for a host accessor blocks until all + kernels that modify the same buffer (or image) in any queues complete and then copies data back to host memory before the constructor returns. Any command groups with requirements to the same memory object cannot execute until the host accessor is destroyed as shown on <>. * _Command group enqueue_: The <> internally ensures that any command groups added to queues have the correct event dependencies added to those queues to ensure correct operation. Adding - command groups to queues never blocks. Instead any required - synchronization is added to the queue and events of type - [code]#sycl::event# are returned by the queue's submit function - that contain event information related to the specific command group. + command groups to queues never blocks, and the [code]#sycl::event# returned + by the queue's submit function contains event information related to the + specific command group. * _Queue operations_: The user can manually use queue operations, such as [code]#sycl::queue::wait()# to block execution of the calling thread until all the command groups submitted to the queue have finished execution. Note that this will also affect the dependencies of those command groups in other queues. * _SYCL event objects_: SYCL provides [code]#sycl::event# - objects which can be used for synchronization. If synchronization is - required across SYCL contexts from different <>, then the - <> ensures that extra host-based synchronization is - added to enable the SYCL event objects to operate between contexts - correctly. + objects which can be used to track and specify dependencies. The + <> must ensure that these objects can be used to + enforce dependencies that span SYCL contexts from different <>. + +The specification for each of these blocking functions defines some set of +operations that cause the function to unblock. These operations always happen +before the blocking function returns (using the definition of "happens before" +from the C++ specification). Note that the destructors of other SYCL objects ([code]#sycl::queue#, [code]#sycl::context#,{ldots}) do @@ -1380,22 +1386,18 @@ cases if this does not fit the goal. See <> for more information on object life time. -==== Synchronization in SYCL kernels - -In SYCL, synchronization can be either global or local within a -group of work-items. Synchronization between work-items in a single -group is achieved using a <>. +==== Work-item Coordination -All the work-items of a group must execute the barrier before any are -allowed to continue execution beyond the barrier. Note that the -group barrier must be encountered by all work-items of a -group executing the kernel or by none at all. In SYCL, -<> and <> functionality is +A <> provides a mechanism to coordinate all work-items in the +same group. All work-items in a group must execute the barrier before any are +allowed to continue execution beyond the barrier. Note that the group barrier +must be encountered by all work-items of a group executing the kernel or by +none at all. <> and <> functionality is exposed via the [code]#group_barrier# function. -Synchronization between work-items in different work-groups via atomic -operations is possible only on SYCL devices with certain capabilities, -as described in <>. +Coordination between work-items in different work-groups must take place via +atomic operations, and is possible only on SYCL device with certain +capabilities, as described in <>. === Error handling @@ -1501,7 +1503,7 @@ provided, which uses move semantics for initializing and using the associated host memory. In this case, the behavior of the buffer in relation to the user application will be non-blocking on destruction. -As said in <>, the only blocking +As said in <>, the only blocking operations in SYCL (apart from explicit wait operations) are: * host accessor constructor, which waits for any kernels enqueued before @@ -1624,19 +1626,18 @@ the appropriate image format specified such as four 8-bit elements, for example. -- -Users may want fine-grained control of the synchronization, memory management -and storage semantics of SYCL image or buffer objects. For example, a user may -wish to specify the host memory for a memory object to use, but may not want the -memory object to block on destruction. +Users may want fine-grained control of the memory management and storage +semantics of SYCL image or buffer objects. For example, a user may wish to +specify the host memory for a memory object to use, but may not want the memory +object to block on destruction. Depending on the control and the use cases of the SYCL applications, well established {cpp} classes and patterns can be used for reference counting and sharing data between user applications and the <>. For control over memory allocation on the host and mapping between host and device memory, pre-defined or user-defined {cpp} [code]#std::allocator# classes are -used. For better control of synchronization between a SYCL and a non SYCL -application that share data, [code]#std::shared_ptr# and -[code]#std::mutex# classes are used. +used. To avoid data races when sharing data between SYCL and non-SYCL +applications, [code]#std::shared_ptr# and [code]#std::mutex# classes are used. == Multi-dimensional objects and linearization diff --git a/adoc/chapters/glossary.adoc b/adoc/chapters/glossary.adoc index 658ce23b..a83050a9 100644 --- a/adoc/chapters/glossary.adoc +++ b/adoc/chapters/glossary.adoc @@ -49,9 +49,9 @@ <>. [[barrier]]barrier:: - A barrier is either a <>, or a kernel execution - <> depending on whether it is a synchronization point on - the command queue or on a group of work-items in a kernel execution. + A barrier may refer to either a <> used for host-device + coordination, or a <> used to coordinate work-items in + a kernel. [[blocking-accessor]]blocking accessor:: A blocking accessor is an <> which provides immediate access @@ -112,13 +112,9 @@ object. For the full description please refer to <>. details see <>. [[queue-barrier]]command queue barrier:: - The SYCL API provides two variants for functions that force - synchronization on a SYCL command queue. The - [code]#sycl::queue::wait()# and - [code]#sycl::queue::wait_and_throw()# functions force the SYCL - command queue to wait for the execution of the - <> before it is able to continue - executing. + The [code]#sycl::queue::wait()# and [code]#sycl::queue::wait_and_throw()# + functions block the calling thread until the execution of a + <> completes. [[constant-memory]]constant memory:: A region of memory that remains constant during the execution of @@ -212,14 +208,8 @@ object. For the full description please refer to <>. such as a <> or <>. [[group-barrier]]group barrier:: - A synchronization function within a group of <>. All the - <> of a group must execute the barrier construct before any - <> continues execution beyond the barrier. Additionally all work-items - in the group execute a release <> prior to synchronizing at the - barrier, all work-items in the group execute an acquire <> after - synchronizing at the barrier, and there is an implicit synchronization between - these acquire and release fences as if through an atomic operation on an - atomic object internal to the barrier implementation. + A coordination mechanism for all <> of a group. + See the definition of the [code]#group_barrier# function. [[h-item]]h-item:: A unique identifier representing a single <> within the @@ -368,13 +358,8 @@ object. For the full description please refer to <>. [[mem-fence]]mem-fence:: A memory fence provides control over re-ordering of memory load - and store operations when coupled with an atomic operation that - synchronizes two fences with each other (or when the fences are part of - a <> in which case there is implicit synchronization - as if an atomic operation has synchronized the fences). The - [code]#sycl::atomic_fence# function acts as a fence across all - work-items and devices specified by a [code]#memory_scope# - argument. + and store operations when coupled with an atomic operation. + See the definition of the [code]#sycl::atomic_fence# function. [[object]]object:: A state which a <> can be in, representing diff --git a/adoc/chapters/introduction.adoc b/adoc/chapters/introduction.adoc index 35a36d41..f75fb1b1 100644 --- a/adoc/chapters/introduction.adoc +++ b/adoc/chapters/introduction.adoc @@ -25,7 +25,7 @@ features: * execution of parallel kernels on a heterogeneous device is made simultaneously convenient and flexible. Common parallel patterns are prioritized with simple syntax, which through a series {cpp} types allow - the programmer to express additional requirements, such as synchronization, + the programmer to express additional requirements, such as dependencies, if needed; * when using buffers and accessors, data access in SYCL is separated from data storage. By relying on the {cpp}-style resource acquisition is @@ -43,9 +43,8 @@ features: * the hierarchical parallelism syntax offers a way of expressing data parallelism similar to the OpenCL device or OpenMP target device execution model in an easy-to-understand modern {cpp} form. It - more cleanly layers parallel loops and synchronization points to - avoid fragmentation of code and to more efficiently map to CPU-style - architectures. + more cleanly layers parallel loops to avoid fragmentation of code and to + more efficiently map to CPU-style architectures. SYCL retains the execution model, runtime feature set and device capabilities inspired by the OpenCL standard. This standard imposes diff --git a/adoc/chapters/programming_interface.adoc b/adoc/chapters/programming_interface.adoc index 5792aa17..e0e9c20c 100644 --- a/adoc/chapters/programming_interface.adoc +++ b/adoc/chapters/programming_interface.adoc @@ -3823,7 +3823,7 @@ of the execution, or want to retrieve properties of a command that is running. Note that, although an event represents the status of a particular operation, the dependencies of a certain event can be used to keep track of multiple steps -required to synchronize said operation. +required to block on the results of said operation. A SYCL event is returned by the submission of a <>. The dependencies of the event returned via the submission of the command group @@ -4199,9 +4199,9 @@ new memory positions every call. The default allocator for const buffers will remove the const-ness of the type (therefore, the default allocator for a buffer of type [code]#const int# will be an [code]#Allocator)#. -This implies that host <> will not synchronize with the pointer given -by the user in the buffer/image constructor, but will use the memory -returned by the [code]#Allocator# itself for that purpose. +This implies that host <> will not share memory with the +pointer given by the user in the buffer/image constructor, but will use the +memory returned by the [code]#Allocator# itself for that purpose. The user can implement an allocator that returns the same address as the one passed in the buffer constructor, but it is the responsibility of the user to handle the potential race conditions. @@ -4233,7 +4233,8 @@ image_allocator |==== -See <> for details on manual host-device synchronization. +See <> for details of using manual synchronization to avoid +data races between host and device. [[subsec:buffers]] @@ -4948,8 +4949,9 @@ property::buffer::use_mutex [code]#buffer# can be shared with the application via a [code]#std::mutex# provided to the property. The mutex [code]#m# is locked by the runtime whenever the data is in - use and unlocked otherwise. Data is synchronized with - [code]#hostData#, when the mutex is unlocked by the runtime. + use and unlocked otherwise. The contents of [code]#hostData# are + guaranteed to reflect the contents of the buffer when the + [code]#std::mutex# is unlocked by the runtime. a@ [source] @@ -5023,7 +5025,7 @@ context property::buffer::context_bound::get_context() const [[sec:buf-sync-rules]] -==== Buffer synchronization rules +==== Buffer destruction rules Buffers are reference-counted. When a buffer value is constructed from another buffer, the two values reference the same buffer and a @@ -5961,8 +5963,9 @@ property::image::use_mutex application via a [code]#std::mutex# provided to the property. The [code]#std::mutex# is locked by the runtime whenever the data is in use and unlocked - otherwise. Data is synchronized with [code]#hostData#, when - the [code]#std::mutex# is unlocked by the runtime. + otherwise. The contents of [code]#hostData# are + guaranteed to reflect the contents of the image when the + [code]#std::mutex# is unlocked by the runtime. a@ [source] @@ -6035,7 +6038,7 @@ context property::image::context_bound::get_context() const [[sec:image-sync-rules]] -==== Image synchronization rules +==== Image destruction rules The rules are similar to those described in <>. @@ -6054,7 +6057,7 @@ will be copied back, if necessary, to the associated host memory. Any errors occurring during destruction are reported to any associated context's asynchronous error handler. If an image object is constructed with a storage object, then the storage object defines what -synchronization or copying behavior occurs on image object destruction. +blocking or copying behavior occurs on image object destruction. [[sec:sharing-host-memory-with-dm]] @@ -6078,8 +6081,8 @@ of the class is, by default, passed to <>, and that pointer cannot on the host side until the buffer or image is destroyed. A SYCL application can access the contents of the memory managed by a SYCL buffer by using a [code]#host_accessor# as defined in <>. -However, there is no guarantee that the host accessor synchronizes with the -original host address used in its constructor. +However, there is no guarantee that the host accessor will copy data back to +the original host address used in its constructor. The pointer passed in is the one used to copy data back to the host, if needed, before buffer destruction. The memory pointed by <> @@ -6253,7 +6256,7 @@ There are several different {cpp} classes that implement accessors: Accessor objects must always be constructed in host code, either in <> or in <>. Whether the constructor -blocks waiting for data to synchronize depends on the type of accessor. Those +blocks until data is available depends on the type of accessor. Those accessors which provide access to data within a <> do not block. Instead, these accessors define a requirement which influences the scheduling of the <>. Those accessors which provide access to data from host @@ -13836,7 +13839,7 @@ include::{code_dir}/parallelfor.cpp[lines=4..-1] The last case of a [code]#parallel_for# invocation enables low-level functionality of work-items and work-groups. This becomes valuable when an execution -requires groups of work-items that communicate and synchronize. These are +requires groups of work-items to coordinate with one another. These are exposed in SYCL through [code]#+parallel_for (nd_range,...)+# and the [code]#nd_item# class. In this case, the developer needs to define the [code]#nd_range# that the kernel will execute on in order to have fine @@ -13852,8 +13855,7 @@ from SYCL [code]#nd_item#, representing the currently executing work-item within the range specified by the [code]#nd_range# parameter. The [code]#nd_item# parameter makes all information about the work-item and its position in the range available, and provides access to functions -enabling the use of a <> to synchronize between the -<> in the <>. +enabling the use of a <>. [NOTE] ==== @@ -13863,8 +13865,8 @@ Case 3) above includes user-defined types that can be constructed from The following example shows how sixty-four work-items may be launched in a three-dimensional grid with four in each dimension, and divided -into eight work-groups. Each group of work-items synchronizes with a -<>. +into eight work-groups. Each group of work-items uses a +<> for coordination. [source,,linenums] ---- @@ -13988,11 +13990,12 @@ understanding of which instances of a underlying variable. Also within the lambda body can be a sequence of calls to -[code]#parallel_for_work_item#. At the edges of these inner parallel -executions the work-group synchronizes. As a result the pair of -[code]#parallel_for_work_item# calls in the code below is equivalent to -the parallel execution with a <> in the earlier -example. +[code]#parallel_for_work_item#. No work-item can begin executing a +[code]#parallel_for_work_item# until all work-items in the group have +completed executing the previous [code]#parallel_for_work_item#. +As a result the pair of [code]#parallel_for_work_item# calls in the code below +is equivalent to the parallel execution with a <> in the +earlier example. [source,,linenums] ---- @@ -14080,8 +14083,8 @@ of 10 elements of [code]#int# type, the host pointer must at least have A special case is the [code]#update_host# member function. This member function only requires an accessor, and instructs the runtime to update the internal copy of the data in the host, if any. This is particularly -useful when users use manual synchronization with host pointers, e.g. -via mutex objects on the [code]#buffer# constructors. +useful when used in conjunction with the [code]#buffer# constructor overloads +which accept mutex objects. <> describes the interface for the explicit copy operations. @@ -18227,22 +18230,21 @@ stored in [code]#std::array#. The available features are: * Accessor classes: Accessor classes specify acquisition and release of - buffer and image data structures to provide points at which underlying - queue synchronization primitives must be generated. + buffer and image data structures to provide points at which a SYCL runtime + must guarantee memory consistency. * Atomic operations: SYCL devices support a restricted subset of {cpp} atomics and SYCL uses the library syntax from the next {cpp} specification to make this available. * Fences: Fence primitives are made available to order loads and stores. They are exposed through the [code]#atomic_fence# function. Fences can have acquire semantics, release semantics or both. - * Barriers: Barrier primitives are made available to synchronize sets of - work-items within individual <>. They are exposed through the - [code]#group_barrier# function. + * Barriers: Barrier primitives are made available as a coordination mechanism + for work-items within individual <>. They are exposed through + the [code]#group_barrier# function. * Hierarchical parallel dispatch: In the hierarchical parallelism model of - describing computations, synchronization within the work-group is made - explicit through multiple instances of the - [code]#parallel_for_work_item# function call, rather than through - the use of explicit <> operations. + describing computations, work-items within a work-group may coordinate + via multiple instances of the [code]#parallel_for_work_item# function call, + rather than through the use of explicit <> operations. * Device event: they are used inside SYCL kernel functions to wait for asynchronous operations within a SYCL kernel function to complete. @@ -18273,7 +18275,7 @@ the [code]#order# parameter: and release fence A <> acts as both an acquire fence and a release fence: all -work-items in the group execute a release fence prior to synchronizing at +work-items in the group execute a release fence prior to signaling arrival at the barrier, and all work-items in the group execute an acquire fence afterwards. A <> provides implicit atomic synchronization as if through an internal atomic object, such that the acquire and release fences @@ -19433,9 +19435,9 @@ a@ ---- flush ---- - a@ Triggers a flush operation, which synchronizes the work-item stream buffer - with the global stream buffer, and then empties the work-item stream - buffer. After a flush, the full [code]#workItemBufferSize# is + a@ Triggers a flush operation, which copies the contents of the work-item + stream buffer to the global stream buffer, and then empties the work-item + stream buffer. After a flush, the full [code]#workItemBufferSize# is available again for subsequent streaming within the work-item. a@ @@ -19614,15 +19616,15 @@ template const stream& operator<<(const stream& os, const T& rhs) -=== Synchronization +=== Output -An instance of the SYCL [code]#stream# class is required to synchronize with the host, and must output -everything that is streamed to it via the [code]#operator<<()# operator before a flush operation (that -doesn't exceed the [code]#workItemBufferSize# or [code]#totalBufferSize# limits) within a SYCL -kernel function by the time that the event associated with a command group submission enters the completed -state. The point at which this synchronization occurs and the member function by which this synchronization is -performed are implementation-defined. For example it is valid for an implementation to use -[code]#printf()#. +An instance of the SYCL [code]#stream# class is required to output everything +that is streamed to it via the [code]#operator<<()# operator before a flush +operation (that doesn't exceed the [code]#workItemBufferSize# or +[code]#totalBufferSize# limits) within a SYCL kernel function by the time that +the event associated with a command group submission enters the completed +state. The point at which the flush operation is performed is +implementation-defined. The SYCL [code]#stream# class is required to output the content of each stream, between flushes (up to [code]#workItemBufferSize)#, without mixing with content from the same stream in other work-items. @@ -19817,7 +19819,7 @@ T operator()(const T& x, const T& y) const SYCL provides a number of functions that expose functionality tied to groups of work-items (such as <> and collective operations). -These group functions act as synchronization points and must be encountered in +These group functions act as _synchronization points_ and must be encountered in converged <> by all work-items in the group. If one work-item in a group calls a group function, then all work-items in that group must call exactly the same function under the same set of conditions --- calling the same @@ -19910,8 +19912,8 @@ within group [code]#g#. ==== [code]#group_barrier# -The [code]#group_barrier# function synchronizes all work-items in a group, -using a <>. +The [code]#group_barrier# function is a coordination mechanism for all +work-items in a group. [source,,linenums] ---- @@ -19922,15 +19924,14 @@ include::{header_dir}/groups/barrier.h[lines=4..-1] [code]#sycl::is_group_v># is true. + -- -_Effects:_ Synchronizes all work-items in group [code]#g#. The current -work-item will wait at the barrier until all work-items in group [code]#g# have -reached the barrier. In addition, the barrier performs <> operations ensuring that -memory accesses issued before the barrier are not re-ordered with those issued -after the barrier: all work-items in group [code]#g# execute a release fence -prior to synchronizing at the barrier, all work-items in group [code]#g# -execute an acquire fence afterwards, and there is an implicit synchronization -of these fences as if provided by an explicit atomic operation on an atomic -object. +_Effects:_ The current work-item will wait at the barrier until all work-items +in group [code]#g# have reached the barrier. In addition, the barrier performs +<> operations ensuring that memory accesses issued before the +barrier are not re-ordered with those issued after the barrier: all work-items +in group [code]#g# execute a release fence prior to arriving at the +barrier, all work-items in group [code]#g# execute an acquire fence afterwards, +and there is an implicit synchronization of these fences as if provided by an +explicit atomic operation on an atomic object. By default, the scope of these fences is set to the narrowest scope including all work-items in group [code]#g# (as reported by diff --git a/adoc/chapters/what_changed.adoc b/adoc/chapters/what_changed.adoc index 090d23f9..a1954808 100644 --- a/adoc/chapters/what_changed.adoc +++ b/adoc/chapters/what_changed.adoc @@ -252,8 +252,9 @@ member functions were updated to encapsulate all functionality tied to The [code]#barrier# and [code]#mem_fence# member functions of the [code]#nd_item# class have been removed. The [code]#barrier# member function has been replaced by the [code]#group_barrier()# function, which -can be used to synchronize either <> or <>. The -[code]#mem_fence# member function has been replaced by the +can be used to block work-items in either <> or +<> until all work-items in the group arrive at the +barrier. The [code]#mem_fence# member function has been replaced by the [code]#atomic_fence# function, which is more closely aligned with [code]#std::atomic_thread_fence# and offers control over memory ordering and scope. diff --git a/adoc/code/parallelforbarrier.cpp b/adoc/code/parallelforbarrier.cpp index 67df9f0b..ce51a332 100644 --- a/adoc/code/parallelforbarrier.cpp +++ b/adoc/code/parallelforbarrier.cpp @@ -5,7 +5,6 @@ myQueue.submit([&](handler& cgh) { cgh.parallel_for(nd_range<3>(range<3>(4, 4, 4), range<3>(2, 2, 2)), [=](nd_item<3> item) { //[kernel code] - // Internal synchronization group_barrier(item.get_group()); //[kernel code] });