alpaka-group · bernhardmgruber · Jun 1, 2021 · Jun 1, 2021 · Jun 1, 2021
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@ arbitrary depths. It is not limited to struct of array and array of struct
 data layouts but also capable to explicitly define memory layouts with padding, blocking,
 striding or any other run time or compile time access pattern.
 
-To archieve this goal LLAMA is split into mostly independent, orthogonal parts
+To achieve this goal LLAMA is split into mostly independent, orthogonal parts
 completely written in modern C++17 to run on as many architectures and with as
 many compilers as possible while still supporting extensions needed e.g. to run
 on GPU or other many core hardware.

diff --git a/docs/index.rst b/docs/index.rst
@@ -1,8 +1,3 @@
-.. LLAMA documentation master file, created by
-   sphinx-quickstart on Wed Sep 26 13:28:02 2018.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
 .. image:: images/logo.svg
 
 Low Level Abstraction of Memory Access
@@ -18,7 +13,7 @@ arbitrary depths. It is not limited to struct of array and array of struct
 data layouts but also capable to explicitly define memory layouts with padding, blocking,
 striding or any other run time or compile time access pattern.
 
-To archieve this goal LLAMA is split into mostly independent, orthogonal parts
+To achieve this goal LLAMA is split into mostly independent, orthogonal parts
 completely written in modern C++17 to run on as many architectures and with as
 many compilers as possible while still supporting extensions needed e.g. to run
 on GPU or other many core hardware.

diff --git a/docs/pages/api.rst b/docs/pages/api.rst
@@ -106,10 +106,13 @@ Mappings
    :members:
 .. doxygenstruct:: llama::mapping::AoSoA
    :members:
+.. doxygenvariable:: llama::mapping::maxLanes
 .. doxygenstruct:: llama::mapping::Split
    :members:
 .. doxygenstruct:: llama::mapping::Trace
    :members:
+.. doxygenstruct:: llama::mapping::Heatmap
+   :members:
 
 Common utilities
 ^^^^^^^^^^^^^^^^
@@ -121,8 +124,8 @@ Common utilities
 .. doxygenstruct:: llama::mapping::LinearizeArrayDimsMorton
    :members:
 
-Tree mapping
-^^^^^^^^^^^^
+Tree mapping (deprecated)
+^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. doxygenstruct:: llama::mapping::tree::Mapping
    :members:
@@ -137,9 +140,9 @@ For a detailed description of the tree mapping concept have a look at
 .. doxygenstruct:: llama::mapping::tree::functor::MoveRTDown
 
 .. FIXME: doxygen fails to parse the source code ...
-   Dumping
-   ^^^^^^^
-   
+Dumping
+^^^^^^^
+
    .. doxygenfunction:: llama::toSvg
    .. doxygenfunction:: llama::toHtml
 
@@ -158,5 +161,6 @@ Macros
 
 .. doxygendefine:: LLAMA_INDEPENDENT_DATA
 .. doxygendefine:: LLAMA_FN_HOST_ACC_INLINE
+.. doxygendefine:: LLAMA_LAMBDA_INLINE
 .. doxygendefine:: LLAMA_FORCE_INLINE_RECURSIVE
 .. doxygendefine:: LLAMA_COPY
diff --git a/docs/pages/blobs.rst b/docs/pages/blobs.rst
@@ -67,7 +67,7 @@ Creating a small view of :math:`4 \times 4` may look like this:
 
     using Mapping = /* some simple mapping */;
     using BlobAllocator = llama::bloballoc::Stack<
-        miniSize[0] * miniSize[1] * llama::sizeOf</* some record dimension */>::value
+        miniSize[0] * miniSize[1] * llama::sizeOf<RecordDim>::value
     >;
 
     auto miniView = llama::allocView(Mapping{miniSize}, BlobAllocator{});
@@ -77,14 +77,14 @@ with just one element without any padding, aligment, or whatever on the stack:
 
 .. code-block:: C++
 
-    auto tempView = llama::allocViewStack< N, /* some record dimension */ >();
+    auto tempView = llama::allocViewStack<N, RecordDim>();
 
 
 Non-owning blobs
 ----------------
 
 If a view is needed based on already allocated memory, the view can also be directly constructed with an array of blobs,
-e.g. an array of :cpp:`std::byte*` pointers or :cpp:`std::span<std::byte> to the existing memory regions.
+e.g. an array of :cpp:`std::byte*` pointers or :cpp:`std::span<std::byte>` to the existing memory regions.
 Everything works here as long as it can be subscripted by the view like :cpp:`blob[offset]`.
 One needs to be careful though, since now the ownership of the blob is decoupled from the view.
 It is the responsibility of the user now to ensure that the blobs outlive the views based on them.

diff --git a/docs/pages/copying.rst b/docs/pages/copying.rst
@@ -5,35 +5,39 @@
 Copying between views
 =====================
 
-Especially when working with hardware accelerators such as GPUs or offloading to many-core processors, explicit copy operations call for memory chunks as big as possible to reach good throughput.
+Especially when working with hardware accelerators such as GPUs, or offloading to many-core processors, explicit copy operations call for large, contiguous memory chunks to reach good throughput.
 
 Copying the contents of a view from one memory region to another if mapping and size are identical is trivial.
 However, if the mapping differs, a direct copy of the underlying memory is wrong.
-In most cases only field-wise copy operations are possible.
+In many cases only field-wise copy operations are possible.
 
 There is a small class of remaining cases where the mapping is the same, but the size or shape of the view is different, or the record dimension differ slightly, or the mappings are very related to each other.
 E.g. when both mappings use SoA, but one time with, one time without padding, or a specific field is missing on one side.
 Or two AoSoA mappings with a different inner array length.
 In those cases an optimized copy procedure is possible, copying larger chunks than mere fields.
 
-Practically, it is hard to figure out the biggest possible memory chunks to copy at compile time, since the mappings can always depend on run time parameters.
-E.g. a mapping could implement SoA if the view is bigger than 255 records, but use AoS for a smaller size.
+.. For the moment, LLAMA implements a generic, field-wise copy with specializations for combinations of SoA and AoSoA mappings, reflect the properties of these.
+.. This is sub-optimal, because for every new mapping new specializations are needed.
 
-Three solutions exist for this problem:
+.. One thus needs new approaches on how to improve copying because LLAMA can provide the necessary infrastructure:
+Four solutions exist for this problem:
 
 1. Implement specializations for specific combinations of mappings, which reflect the properties of these.
-This is relevant if an application uses a set of similar mappings and the copy operation between them is the bottle neck.
 However, for every new mapping a new specialization is needed.
 
 2. A run time analysis of the two views to find contiguous memory chunks.
 The overhead is probably big, especially if no contiguous memory chunks are identified.
 
-3. A compile time analysis of the mapping function.
-This requires the mapping to be formulated in a way which is fully consumable via constexpr and template meta programming, probably at the cost of read- and maintainability.
+3. A black box compile time analysis of the mapping function.
+All current LLAMA mappings are \lstinline{constexpr} and can thus be run at compile time.
+This would allow to observe a mappings behavior from exhaustive sampling of the mapping function at compile time.
 
-An additional challenge comes from copies between different address spaces where elementary copy operations require calls to external APIs which profit especially from large chunk sizes.
-In that case it may make sense to use a smaller intermediate view to shuffle a chunk from one mapping to the other inside the same address space and then perform a copy of that chunk into the other address space.
-This shuffle could be performed in the source or destination address space and potentially overlap with shuffles and copies of other chunks in an asynchronous workflow.
+4. A white box compile time analysis of the mapping function.
+This requires the mapping to be formulated transparently in a way which is fully consumable via meta-programming, probably at the cost of read- and maintainability.
+Potentially upcoming C++ features in the area of statement reflection could improve these a lot.
+
+Copies between different address spaces, where elementary copy operations require calls to external APIs, pose an additional challenge and profit especially from large chunk sizes.
+A good approach could use smaller intermediate views to shuffle a chunk from one mapping to the other and then perform a copy of that chunk into the other address space, potentially overlapping shuffles and copies in an asynchronous workflow.
 
 The `async copy example <https://github.com/alpaka-group/llama/blob/master/examples/asynccopy/asynccopy.cpp>`_ tries to show an asynchronous copy/shuffle/compute workflow.
 This example applies a bluring kernel to an RGB-image, but also may work only on two or one channel instead of all three.

diff --git a/docs/pages/dimensions.rst b/docs/pages/dimensions.rst
@@ -79,16 +79,17 @@ A record dimension itself is just a :cpp:`llama::Record` (or a fundamental type)
     struct g {};
     struct b {};
 
+    using RGB = llama::Record<
+        llama::Field<r, float>,
+        llama::Field<g, float>,
+        llama::Field<b, float>
+    >;
     using Pixel = llama::Record<
-        llama::Field<color, llama::Record<
-            llama::Field<r, float>,
-            llama::Field<g, float>,
-            llama::Field<b, float>
-        >>,
+        llama::Field<color, RGB>,
         llama::Field<alpha, char>
     >;
 
-Arrays of compile-time extent are also supported as arguments to :cpp:`llama::Field`, but not to  :cpp:`llama::Field`.
+Arrays of compile-time extent are also supported as arguments to :cpp:`llama::Field`.
 Such arrays are expanded into a :cpp:`llama::Record` with multiple :cpp:`llama::Field`\ s of the same type.
 E.g. :cpp:`llama::Field<Tag, float[4]>` is expanded into
 

diff --git a/docs/pages/iteration.rst b/docs/pages/iteration.rst
@@ -15,10 +15,9 @@ offers the :cpp:`begin()` and  :cpp:`end()` member functions with corresponding
 
 .. code-block:: C++
 
-    llama::ArrayDims<2> ad{3, 3};
-    llama::ArrayDimsIndexRange range{ad};
+    llama::ArrayDimsIndexRange range{llama::ArrayDims{3, 3}};
 
-    std::for_each(range.begin(), range.end(), [](auto coord) {
+    std::for_each(range.begin(), range.end(), [](llama::ArrayDims<2> coord) {
         // coord is {0, 0}, {0, 1}, {0, 2}, {1, 0}, {1, 1}, {1, 2}, {2, 0}, {2, 1}, {2, 2}
     });
 
@@ -32,71 +31,27 @@ Record dimension iteration
 
 The record dimension is iterated using :cpp:`llama::forEachLeaf`.
 It takes a record dimension as template argument and a callable with a generic parameter as argument.
-This function is then called for each leaf of the record dimension tree with a record coord as argument:
-
-.. code-block:: C++
-
-    using RecordDim = llama::Record<
-        llama::Field<x, float>,
-        llama::Field<y, float>,
-        llama::Field<z, llama::Record<
-            llama::Field< low, short>,
-            llama::Field<high, short>
-        > >
-    >;
-
-    MyFunctor functor;
-    llama::forEachLeaf<RecordDim>(functor);
-
-    // functor will be called with an instance of
-    // * RecordCoord<0> for x
-    // * RecordCoord<1> for y
-    // * RecordCoord<2, 0> for z.low
-    // * RecordCoord<2, 1> for z.high
-
-Optionally, a subtree of the RecordDim can be chosen for iteration.
-The subtree is selected either via a `RecordCoord` or a series of tags.
-
-.. code-block:: C++
-
-    // "functor" will be called for
-    // * z.low
-    // * z.high
-    llama::forEachLeaf<RecordDim>(functor, z{});
-
-    // "functor" will be called for
-    // * z.low
-    llama::forEachLeaf<RecordDim>(functor, z{}, low{});
-
-    // "functor" will be called for
-    // * z.high
-    llama::forEachLeaf<RecordDim>(functor, llama::RecordCoord<2, 1>{});
-
-The functor type itself needs to provide the :cpp:`operator()` with one templated parameter, to which 
-the coordinate of the leaf in the record dimension tree is passed.
+This function's :cpp:`operator()` is then called for each leaf of the record dimension tree with a record coord as argument.
 A polymorphic lambda is recommented to be used as a functor.
 
 .. code-block:: C++
 
-    auto vd = view(23, 43);
-    llama::forEachLeaf<RecordDim>([&](auto coord) {
-        vd(coord) = 1337.0f;
+    llama::forEachLeaf<Pixel>([&](auto coord) {
+        // coord is RecordCoord <0, 0 >{}, RecordCoord <0, 1>{}, RecordCoord <0, 2>{} and RecordCoord <1>{}
     });
 
-    // or using a struct:
+Optionally, a subtree of the record dimension can be chosen for iteration.
+The subtree is selected either via a `RecordCoord` or a series of tags.
 
-    template<typename VirtualRecord, typename Value>
-    struct SetValueFunctor {
-        template<typename Coord>
-        void operator()(Coord coord) {
-            vd(coord) = value;
-        }
-        VirtualRecord vd;
-        const Value value;
-    };
+.. code-block:: C++
+
+    llama::forEachLeaf<Pixel>([&](auto coord) {
+        // coord is RecordCoord <0, 0 >{}, RecordCoord <0, 1>{} and RecordCoord <0, 2>{}
+    }, color{});
 
-    SetValueFunctor<decltype(vd), float> functor{1337.0f};
-    llama::forEachLeaf<RecordDim>(functor);
+    llama::forEachLeaf<Pixel>([&](auto coord) {
+        // coord is RecordCoord <0, 1>{}
+    }, color{}, g{});
 
 A more detailed example can be found in the
 `simpletest example <https://github.com/alpaka-group/llama/blob/master/examples/simpletest/simpletest.cpp>`_.
@@ -105,40 +60,32 @@ A more detailed example can be found in the
 View iterators
 --------------
 
-Iterators on views of any dimension are supported.
-Higher than 1D iterators however are difficult to get right if we also want to achieve well optimized assembly.
-Multiple nested loops seem to be optimized better than a single loop using iterators over multiple dimensions.
-
-Nevertheless, having an iterator to a view opens up the standard library for use in conjunction with LLAMA:
+Iterators on views of any dimension are supported and open up the standard library for use in conjunction with LLAMA:
 
 .. code-block:: C++
 
+    using Pixel = ...;
     using ArrayDims = llama::ArrayDims<1>;
     // ...
     auto view = llama::allocView(mapping);
+    // ...
 
-    for (auto vd : view) {
-        vd(x{}) = 1.0f;
-        vd(y{}) = 2.0f;
-        vd(z{}, low{}) = 3;
-        vd(z{}, high{}) = 4;
-    }
-    std::transform(begin(view), end(view), begin(view), [](auto vd) { return vd * 2; });
-    const float sumY = std::accumulate(begin(view), end(view), 0, [](int acc, auto vd) { return acc + vd(y{}); });
-
-    // C++20:
-
-    for (auto x : view | std::views::transform([](auto vd) { return vd(x{}); }) | std::views::take(2))
-        // ...
-
-Since virtual records interact with each other based on the tags and not the underlying mappings, we can also use iterators from multiple views together:
+    // range for
+    for (auto vd : view)
+        vd(color{}, r{}) = 1.0f;
 
-.. code-block:: C++
+    auto view2 = llama::allocView (...); // with different mapping
 
-    auto aosView = llama::allocView(llama::mapping::AoS<ArrayDims, RecordDim>{arrayDimsSize});
-    auto soaView = llama::allocView(llama::mapping::SoA<ArrayDims, RecordDim>{arrayDimsSize});
-    // ...
+    // layout changing copy
     std::copy(begin(aosView), end(aosView), begin(soaView));
 
-    auto innerProduct = std::transform_reduce(begin(aosView), end(aosView), begin(soaView), llama::One<RecordDim>{});
+    // transform into other view
+    std::transform(begin(view), end(view), begin(view2), [](auto vd) { return vd(color{}) * 2; });
+
+    // accumulate using One as accumulator and destructure result
+    const auto [r, g, b] = std::accumulate(begin(view), end(view), One<RGB>{}, 
+        [](auto acc, auto vd) { return acc + vd(color{}); });
 
+    // C++20:
+    for (auto x : view | std::views::transform([](auto vd) { return vd(x{}); }) | std::views::take(2))
+        // ...
diff --git a/docs/pages/mappings.rst b/docs/pages/mappings.rst
@@ -113,7 +113,7 @@ One mapping
 -----------
 
 The One mapping is intended to map all coordinates in the array dimensions onto the same memory location.
-This is commonly used in  the `llama::One` virtual record, but also offers interesting applications in conjunction with the `llama::mapping::Split` mapping.
+This is commonly used in the :cpp:`llama::One` virtual record, but also offers interesting applications in conjunction with the :cpp:`llama::mapping::Split` mapping.
 
 
 Split mapping
@@ -136,8 +136,8 @@ Split mappings can be nested to map a record dimension into even fancier combina
 
 .. _label-tree-mapping:
 
-Tree mapping
-------------------
+Tree mapping (deprecated)
+-------------------------
 
 WARNING: The tree mapping is currently not maintained and we consider deprecation.
 

diff --git a/docs/pages/virtualrecord.rst b/docs/pages/virtualrecord.rst
@@ -32,10 +32,10 @@ Supplying the array dimensions coordinate to a view access returns such a :cpp:`
 This object can be thought of like a record in the :math:`N`-dimensional array dimensions space,
 but as the fields of this record may not be contiguous in memory, it is not a real object in the C++ sense and thus called virtual.
 
-Accessing subparts of a :cpp:`llama::VirtualRecord` is done using `operator()` and the tag types from the record dimension.
+Accessing subparts of a :cpp:`llama::VirtualRecord` is done using :cpp:`operator()` and the tag types from the record dimension.
 
 If an access describes a final/leaf element in the record dimension, a reference to a value of the corresponding type is returned.
-Such an access is called terminal. If the access is non-termian, i.e. it does not yet reach a leaf in the record dimension tree,
+Such an access is called terminal. If the access is non-terminal, i.e. it does not yet reach a leaf in the record dimension tree,
 another :cpp:`llama::VirtualRecord` is returned, binding the tags already used for navigating down the record dimension.
 
 A :cpp:`llama::VirtualRecord` can be used like a real local object in many places. It can be used as a local variable, copied around, passed as an argument to a function (as seen in the
@@ -56,8 +56,7 @@ This is useful when we want to have a single record instance e.g. as a local var
     auto pixel2 = pixel; // independent copy
 
 Technically, :cpp:`llama::One` is a :cpp:`llama::VirtualRecord` which stores a scalar :cpp:`llama::View` inside, using the mapping :cpp:`llama::mapping::One`.
-This also has the unfortunate consequence that a :cpp:`llama::One` is now a value type with deep-copy semantic.
-We might address this inconsistency at some point.
+This also has the consequence that a :cpp:`llama::One` is now a value type with deep-copy semantic.
 
 
 Arithmetic and logical operatores