diff --git a/README.md b/README.md index e37f350f48..189f989c3b 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ arbitrary depths. It is not limited to struct of array and array of struct data layouts but also capable to explicitly define memory layouts with padding, blocking, striding or any other run time or compile time access pattern. -To archieve this goal LLAMA is split into mostly independent, orthogonal parts +To achieve this goal LLAMA is split into mostly independent, orthogonal parts completely written in modern C++17 to run on as many architectures and with as many compilers as possible while still supporting extensions needed e.g. to run on GPU or other many core hardware. diff --git a/docs/index.rst b/docs/index.rst index 28a3b6c071..2382a97b1b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,8 +1,3 @@ -.. LLAMA documentation master file, created by - sphinx-quickstart on Wed Sep 26 13:28:02 2018. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - .. image:: images/logo.svg Low Level Abstraction of Memory Access @@ -18,7 +13,7 @@ arbitrary depths. It is not limited to struct of array and array of struct data layouts but also capable to explicitly define memory layouts with padding, blocking, striding or any other run time or compile time access pattern. -To archieve this goal LLAMA is split into mostly independent, orthogonal parts +To achieve this goal LLAMA is split into mostly independent, orthogonal parts completely written in modern C++17 to run on as many architectures and with as many compilers as possible while still supporting extensions needed e.g. to run on GPU or other many core hardware. diff --git a/docs/pages/api.rst b/docs/pages/api.rst index 784b3e7e99..55a4c30851 100644 --- a/docs/pages/api.rst +++ b/docs/pages/api.rst @@ -106,10 +106,13 @@ Mappings :members: .. doxygenstruct:: llama::mapping::AoSoA :members: +.. doxygenvariable:: llama::mapping::maxLanes .. doxygenstruct:: llama::mapping::Split :members: .. doxygenstruct:: llama::mapping::Trace :members: +.. doxygenstruct:: llama::mapping::Heatmap + :members: Common utilities ^^^^^^^^^^^^^^^^ @@ -121,8 +124,8 @@ Common utilities .. doxygenstruct:: llama::mapping::LinearizeArrayDimsMorton :members: -Tree mapping -^^^^^^^^^^^^ +Tree mapping (deprecated) +^^^^^^^^^^^^^^^^^^^^^^^^^ .. doxygenstruct:: llama::mapping::tree::Mapping :members: @@ -137,9 +140,9 @@ For a detailed description of the tree mapping concept have a look at .. doxygenstruct:: llama::mapping::tree::functor::MoveRTDown .. FIXME: doxygen fails to parse the source code ... - Dumping - ^^^^^^^ - +Dumping +^^^^^^^ + .. doxygenfunction:: llama::toSvg .. doxygenfunction:: llama::toHtml @@ -158,5 +161,6 @@ Macros .. doxygendefine:: LLAMA_INDEPENDENT_DATA .. doxygendefine:: LLAMA_FN_HOST_ACC_INLINE +.. doxygendefine:: LLAMA_LAMBDA_INLINE .. doxygendefine:: LLAMA_FORCE_INLINE_RECURSIVE .. doxygendefine:: LLAMA_COPY diff --git a/docs/pages/blobs.rst b/docs/pages/blobs.rst index 9ee65401bd..69a9a75dcd 100644 --- a/docs/pages/blobs.rst +++ b/docs/pages/blobs.rst @@ -67,7 +67,7 @@ Creating a small view of :math:`4 \times 4` may look like this: using Mapping = /* some simple mapping */; using BlobAllocator = llama::bloballoc::Stack< - miniSize[0] * miniSize[1] * llama::sizeOf::value + miniSize[0] * miniSize[1] * llama::sizeOf::value >; auto miniView = llama::allocView(Mapping{miniSize}, BlobAllocator{}); @@ -77,14 +77,14 @@ with just one element without any padding, aligment, or whatever on the stack: .. code-block:: C++ - auto tempView = llama::allocViewStack< N, /* some record dimension */ >(); + auto tempView = llama::allocViewStack(); Non-owning blobs ---------------- If a view is needed based on already allocated memory, the view can also be directly constructed with an array of blobs, -e.g. an array of :cpp:`std::byte*` pointers or :cpp:`std::span to the existing memory regions. +e.g. an array of :cpp:`std::byte*` pointers or :cpp:`std::span` to the existing memory regions. Everything works here as long as it can be subscripted by the view like :cpp:`blob[offset]`. One needs to be careful though, since now the ownership of the blob is decoupled from the view. It is the responsibility of the user now to ensure that the blobs outlive the views based on them. diff --git a/docs/pages/copying.rst b/docs/pages/copying.rst index 16a6f54ccc..57aaa5ffce 100644 --- a/docs/pages/copying.rst +++ b/docs/pages/copying.rst @@ -5,35 +5,39 @@ Copying between views ===================== -Especially when working with hardware accelerators such as GPUs or offloading to many-core processors, explicit copy operations call for memory chunks as big as possible to reach good throughput. +Especially when working with hardware accelerators such as GPUs, or offloading to many-core processors, explicit copy operations call for large, contiguous memory chunks to reach good throughput. Copying the contents of a view from one memory region to another if mapping and size are identical is trivial. However, if the mapping differs, a direct copy of the underlying memory is wrong. -In most cases only field-wise copy operations are possible. +In many cases only field-wise copy operations are possible. There is a small class of remaining cases where the mapping is the same, but the size or shape of the view is different, or the record dimension differ slightly, or the mappings are very related to each other. E.g. when both mappings use SoA, but one time with, one time without padding, or a specific field is missing on one side. Or two AoSoA mappings with a different inner array length. In those cases an optimized copy procedure is possible, copying larger chunks than mere fields. -Practically, it is hard to figure out the biggest possible memory chunks to copy at compile time, since the mappings can always depend on run time parameters. -E.g. a mapping could implement SoA if the view is bigger than 255 records, but use AoS for a smaller size. +.. For the moment, LLAMA implements a generic, field-wise copy with specializations for combinations of SoA and AoSoA mappings, reflect the properties of these. +.. This is sub-optimal, because for every new mapping new specializations are needed. -Three solutions exist for this problem: +.. One thus needs new approaches on how to improve copying because LLAMA can provide the necessary infrastructure: +Four solutions exist for this problem: 1. Implement specializations for specific combinations of mappings, which reflect the properties of these. -This is relevant if an application uses a set of similar mappings and the copy operation between them is the bottle neck. However, for every new mapping a new specialization is needed. 2. A run time analysis of the two views to find contiguous memory chunks. The overhead is probably big, especially if no contiguous memory chunks are identified. -3. A compile time analysis of the mapping function. -This requires the mapping to be formulated in a way which is fully consumable via constexpr and template meta programming, probably at the cost of read- and maintainability. +3. A black box compile time analysis of the mapping function. +All current LLAMA mappings are \lstinline{constexpr} and can thus be run at compile time. +This would allow to observe a mappings behavior from exhaustive sampling of the mapping function at compile time. -An additional challenge comes from copies between different address spaces where elementary copy operations require calls to external APIs which profit especially from large chunk sizes. -In that case it may make sense to use a smaller intermediate view to shuffle a chunk from one mapping to the other inside the same address space and then perform a copy of that chunk into the other address space. -This shuffle could be performed in the source or destination address space and potentially overlap with shuffles and copies of other chunks in an asynchronous workflow. +4. A white box compile time analysis of the mapping function. +This requires the mapping to be formulated transparently in a way which is fully consumable via meta-programming, probably at the cost of read- and maintainability. +Potentially upcoming C++ features in the area of statement reflection could improve these a lot. + +Copies between different address spaces, where elementary copy operations require calls to external APIs, pose an additional challenge and profit especially from large chunk sizes. +A good approach could use smaller intermediate views to shuffle a chunk from one mapping to the other and then perform a copy of that chunk into the other address space, potentially overlapping shuffles and copies in an asynchronous workflow. The `async copy example `_ tries to show an asynchronous copy/shuffle/compute workflow. This example applies a bluring kernel to an RGB-image, but also may work only on two or one channel instead of all three. diff --git a/docs/pages/dimensions.rst b/docs/pages/dimensions.rst index 3ebaad5aa0..b87064e7ad 100644 --- a/docs/pages/dimensions.rst +++ b/docs/pages/dimensions.rst @@ -79,16 +79,17 @@ A record dimension itself is just a :cpp:`llama::Record` (or a fundamental type) struct g {}; struct b {}; + using RGB = llama::Record< + llama::Field, + llama::Field, + llama::Field + >; using Pixel = llama::Record< - llama::Field, - llama::Field, - llama::Field - >>, + llama::Field, llama::Field >; -Arrays of compile-time extent are also supported as arguments to :cpp:`llama::Field`, but not to :cpp:`llama::Field`. +Arrays of compile-time extent are also supported as arguments to :cpp:`llama::Field`. Such arrays are expanded into a :cpp:`llama::Record` with multiple :cpp:`llama::Field`\ s of the same type. E.g. :cpp:`llama::Field` is expanded into diff --git a/docs/pages/iteration.rst b/docs/pages/iteration.rst index b0c8d10cdb..f9a22f8d03 100644 --- a/docs/pages/iteration.rst +++ b/docs/pages/iteration.rst @@ -15,10 +15,9 @@ offers the :cpp:`begin()` and :cpp:`end()` member functions with corresponding .. code-block:: C++ - llama::ArrayDims<2> ad{3, 3}; - llama::ArrayDimsIndexRange range{ad}; + llama::ArrayDimsIndexRange range{llama::ArrayDims{3, 3}}; - std::for_each(range.begin(), range.end(), [](auto coord) { + std::for_each(range.begin(), range.end(), [](llama::ArrayDims<2> coord) { // coord is {0, 0}, {0, 1}, {0, 2}, {1, 0}, {1, 1}, {1, 2}, {2, 0}, {2, 1}, {2, 2} }); @@ -32,71 +31,27 @@ Record dimension iteration The record dimension is iterated using :cpp:`llama::forEachLeaf`. It takes a record dimension as template argument and a callable with a generic parameter as argument. -This function is then called for each leaf of the record dimension tree with a record coord as argument: - -.. code-block:: C++ - - using RecordDim = llama::Record< - llama::Field, - llama::Field, - llama::Field, - llama::Field - > > - >; - - MyFunctor functor; - llama::forEachLeaf(functor); - - // functor will be called with an instance of - // * RecordCoord<0> for x - // * RecordCoord<1> for y - // * RecordCoord<2, 0> for z.low - // * RecordCoord<2, 1> for z.high - -Optionally, a subtree of the RecordDim can be chosen for iteration. -The subtree is selected either via a `RecordCoord` or a series of tags. - -.. code-block:: C++ - - // "functor" will be called for - // * z.low - // * z.high - llama::forEachLeaf(functor, z{}); - - // "functor" will be called for - // * z.low - llama::forEachLeaf(functor, z{}, low{}); - - // "functor" will be called for - // * z.high - llama::forEachLeaf(functor, llama::RecordCoord<2, 1>{}); - -The functor type itself needs to provide the :cpp:`operator()` with one templated parameter, to which -the coordinate of the leaf in the record dimension tree is passed. +This function's :cpp:`operator()` is then called for each leaf of the record dimension tree with a record coord as argument. A polymorphic lambda is recommented to be used as a functor. .. code-block:: C++ - auto vd = view(23, 43); - llama::forEachLeaf([&](auto coord) { - vd(coord) = 1337.0f; + llama::forEachLeaf([&](auto coord) { + // coord is RecordCoord <0, 0 >{}, RecordCoord <0, 1>{}, RecordCoord <0, 2>{} and RecordCoord <1>{} }); - // or using a struct: +Optionally, a subtree of the record dimension can be chosen for iteration. +The subtree is selected either via a `RecordCoord` or a series of tags. - template - struct SetValueFunctor { - template - void operator()(Coord coord) { - vd(coord) = value; - } - VirtualRecord vd; - const Value value; - }; +.. code-block:: C++ + + llama::forEachLeaf([&](auto coord) { + // coord is RecordCoord <0, 0 >{}, RecordCoord <0, 1>{} and RecordCoord <0, 2>{} + }, color{}); - SetValueFunctor functor{1337.0f}; - llama::forEachLeaf(functor); + llama::forEachLeaf([&](auto coord) { + // coord is RecordCoord <0, 1>{} + }, color{}, g{}); A more detailed example can be found in the `simpletest example `_. @@ -105,40 +60,32 @@ A more detailed example can be found in the View iterators -------------- -Iterators on views of any dimension are supported. -Higher than 1D iterators however are difficult to get right if we also want to achieve well optimized assembly. -Multiple nested loops seem to be optimized better than a single loop using iterators over multiple dimensions. - -Nevertheless, having an iterator to a view opens up the standard library for use in conjunction with LLAMA: +Iterators on views of any dimension are supported and open up the standard library for use in conjunction with LLAMA: .. code-block:: C++ + using Pixel = ...; using ArrayDims = llama::ArrayDims<1>; // ... auto view = llama::allocView(mapping); + // ... - for (auto vd : view) { - vd(x{}) = 1.0f; - vd(y{}) = 2.0f; - vd(z{}, low{}) = 3; - vd(z{}, high{}) = 4; - } - std::transform(begin(view), end(view), begin(view), [](auto vd) { return vd * 2; }); - const float sumY = std::accumulate(begin(view), end(view), 0, [](int acc, auto vd) { return acc + vd(y{}); }); - - // C++20: - - for (auto x : view | std::views::transform([](auto vd) { return vd(x{}); }) | std::views::take(2)) - // ... - -Since virtual records interact with each other based on the tags and not the underlying mappings, we can also use iterators from multiple views together: + // range for + for (auto vd : view) + vd(color{}, r{}) = 1.0f; -.. code-block:: C++ + auto view2 = llama::allocView (...); // with different mapping - auto aosView = llama::allocView(llama::mapping::AoS{arrayDimsSize}); - auto soaView = llama::allocView(llama::mapping::SoA{arrayDimsSize}); - // ... + // layout changing copy std::copy(begin(aosView), end(aosView), begin(soaView)); - auto innerProduct = std::transform_reduce(begin(aosView), end(aosView), begin(soaView), llama::One{}); + // transform into other view + std::transform(begin(view), end(view), begin(view2), [](auto vd) { return vd(color{}) * 2; }); + + // accumulate using One as accumulator and destructure result + const auto [r, g, b] = std::accumulate(begin(view), end(view), One{}, + [](auto acc, auto vd) { return acc + vd(color{}); }); + // C++20: + for (auto x : view | std::views::transform([](auto vd) { return vd(x{}); }) | std::views::take(2)) + // ... diff --git a/docs/pages/mappings.rst b/docs/pages/mappings.rst index 9aae72b300..be919d6767 100644 --- a/docs/pages/mappings.rst +++ b/docs/pages/mappings.rst @@ -113,7 +113,7 @@ One mapping ----------- The One mapping is intended to map all coordinates in the array dimensions onto the same memory location. -This is commonly used in the `llama::One` virtual record, but also offers interesting applications in conjunction with the `llama::mapping::Split` mapping. +This is commonly used in the :cpp:`llama::One` virtual record, but also offers interesting applications in conjunction with the :cpp:`llama::mapping::Split` mapping. Split mapping @@ -136,8 +136,8 @@ Split mappings can be nested to map a record dimension into even fancier combina .. _label-tree-mapping: -Tree mapping ------------------- +Tree mapping (deprecated) +------------------------- WARNING: The tree mapping is currently not maintained and we consider deprecation. diff --git a/docs/pages/virtualrecord.rst b/docs/pages/virtualrecord.rst index 96dc22fca8..3e9890747b 100644 --- a/docs/pages/virtualrecord.rst +++ b/docs/pages/virtualrecord.rst @@ -32,10 +32,10 @@ Supplying the array dimensions coordinate to a view access returns such a :cpp:` This object can be thought of like a record in the :math:`N`-dimensional array dimensions space, but as the fields of this record may not be contiguous in memory, it is not a real object in the C++ sense and thus called virtual. -Accessing subparts of a :cpp:`llama::VirtualRecord` is done using `operator()` and the tag types from the record dimension. +Accessing subparts of a :cpp:`llama::VirtualRecord` is done using :cpp:`operator()` and the tag types from the record dimension. If an access describes a final/leaf element in the record dimension, a reference to a value of the corresponding type is returned. -Such an access is called terminal. If the access is non-termian, i.e. it does not yet reach a leaf in the record dimension tree, +Such an access is called terminal. If the access is non-terminal, i.e. it does not yet reach a leaf in the record dimension tree, another :cpp:`llama::VirtualRecord` is returned, binding the tags already used for navigating down the record dimension. A :cpp:`llama::VirtualRecord` can be used like a real local object in many places. It can be used as a local variable, copied around, passed as an argument to a function (as seen in the @@ -56,8 +56,7 @@ This is useful when we want to have a single record instance e.g. as a local var auto pixel2 = pixel; // independent copy Technically, :cpp:`llama::One` is a :cpp:`llama::VirtualRecord` which stores a scalar :cpp:`llama::View` inside, using the mapping :cpp:`llama::mapping::One`. -This also has the unfortunate consequence that a :cpp:`llama::One` is now a value type with deep-copy semantic. -We might address this inconsistency at some point. +This also has the consequence that a :cpp:`llama::One` is now a value type with deep-copy semantic. Arithmetic and logical operatores diff --git a/examples/nbody/nbody.cpp b/examples/nbody/nbody.cpp index 93d1083569..980af479cc 100644 --- a/examples/nbody/nbody.cpp +++ b/examples/nbody/nbody.cpp @@ -1333,8 +1333,13 @@ namespace manualSoA_Vc auto main() -> int try { +#if __has_include() using vec = Vc::Vector; // using vec = Vc::SimdArray; + constexpr auto SIMDLanes = vec::size(); +#else + constexpr auto SIMDLanes = 1; +#endif const auto numThreads = static_cast(omp_get_max_threads()); const char* affinity = std::getenv("GOMP_CPU_AFFINITY"); @@ -1350,7 +1355,7 @@ SIMD lanes: {} PROBLEM_SIZE * sizeof(FP) * 7 / 1024, numThreads, affinity, - vec::size()); + SIMDLanes); std::ofstream plotFile{"nbody.sh"}; plotFile.exceptions(std::ios::badbit | std::ios::failbit); @@ -1371,7 +1376,7 @@ set y2tics auto )", numThreads, affinity, - vec::size(), + SIMDLanes, PROBLEM_SIZE / 1024, common::hostname()); plotFile << "\"\"\t\"update\"\t\"move\"\n"; diff --git a/include/llama/macros.hpp b/include/llama/macros.hpp index 3d36061f97..2b3a43773a 100644 --- a/include/llama/macros.hpp +++ b/include/llama/macros.hpp @@ -59,6 +59,7 @@ # endif #endif #ifndef LLAMA_LAMBDA_INLINE +/// Gives strong indication to the compiler to inline the attributed lambda. # define LLAMA_LAMBDA_INLINE LLAMA_LAMBDA_INLINE_WITH_SPECIFIERS() #endif