Copying the contents of a memory buffer to another.
So far this has been tested on iPhone 13 mini 256GB.
-
Open
AppleNumericalComputing/iOSTester_01/iOSTester_01.xcodeproj
with Xcode -
Build a release build
-
Run the iOS App in release build
-
Press 'Run' on the screen
-
Wait until App finished with 'finished!' on the log output.
-
Copy and paste the log into
01_memcpy/doc_ios/make_log.txt
. -
Run the following in the terminal.
$ cd 01_memcpy
$ grep '\(^INT\|^FLOAT\|^DOUBLE\|data element type\)' doc_ios/make_log.txt > doc_ios/make_log_cleaned.txt
$ python ../common/process_log.py -logfile doc_ios/make_log_cleaned.txt -specfile doc_ios/plot_spec.json -show_impl -plot_charts -base_dir doc_ios/
- You will get the PNG files in
01_memcpy/doc_ios/
.
-
memcpy() is 5x-10x faster than the compiler-optimized simple loop in C++. This is achieved by the unrolled and interleaved ldp & stp machine instructions to maximally fill the pipeline.
-
Multithreading on the calls to memcpy() will not make it faster: The overhead of synchronizing multiple threads is amortized around the problem size of 2 megabytes. However around that size, the problem will be bound by memory I/O, and it will apparently lose the benefit of multithreading.
-
The memcpy() utilizes two particular instructions ldp & stp as stated above. Clang++ compiler emits two ldps and then two stp with '-O3' by default. This is equivalent to the loop unrolling factor of 4. If an explicit loop unrolling is specified, Clang emits interleaved ld[u]r & st[u]r.
-
For Metal, MTLBlitCommandEncoder is in general faster than a shader with the simple parallel assignment as in
out[ tid ] = in[ tid ];
, but the difference is not significant. -
There is no noticeable difference in speed between the use of the shared- and managed-MTL buffers.
-
The cost of launching the kernel on Metal is amortized at around 256 megabytes. Around that size the use Metal will be justified.
Copying contents of a memory buffer to another can be considered a minimal I/O-bound problem that can be parallelized. The standard library function memcpy() is known to be highly optimized for the given system, and it seems this is the dominant way to copy data efficiently.
The purpose of this section is as follows:
-
To measure the running time of memcpy() against other possible implementations.
-
To measure the impact of several techniques on copying bulk in memory, notably, selection of instructions, loop unrolling, and cache prefetch.
-
To measure the running time of MTLBlitCommandEncoder, (The Metal version of memcpy()) in comparison to a plain Metal shader implementation as well as memcpy() in CPU.
-
Study how memcpy() achieves its running time.
The following experiments are done with ./test_memcpy.cpp in this directory.
-
Compiler: Apple clang version 13.0.0 (clang-1300.0.29.3) Target: arm64-apple-darwin20.6.0 Thread model: posix
-
Device: Mac mini (M1, 2020) Chip Apple M1, Memory 8GB, macOS Big Sur Version 11.6
Please type make all
in this directory to reproduce the results.
The following chart shows the mean running times taken to copy the given number of ints (4 bytes) for each implementation in log-log scale. X-axis is the number of ints copied, and Y-axis is the time taken in milliseconds.
-
CPP BLOCK 1 1 : Plain C++ (with '-O3' with which Clang++ emits ldp & stp )
-
CPP BLOCK 4 1 : Plain C++ with the loop unrolling of factor 4 (with '-O3', with which Clang emits ldur & stur )
-
CPP BLOCK 4 2 : Plain C++ with the loop unrolling of factor 4 with 2 threads (threads synchronization done mainly with condition variables)
-
MEMCPY 1 1 : memcpy() (with highly optimizezd loop-unrolled ldp & stp)
-
MEMCPY 1 2 : memcpy() with two threads
-
METAL_BLIT 0 0 : Metal BLIT Command with shared MTL buffers.
The memcpy() performs best for all the problem sizes. Compared with the plain C++ implementation, memcpy() is 5x - 10x faster. The manual loop unrolling in C++ improves the speed but not as good as memcpy(). The minimum overhead of launching 2 CPU threads is around 5 [μs], and apparently there is no benefit in using multithreads. The minimum overhead of launching a Metal kernel is somewhere below 100 [μs].
The following chart shows the relative running times taken to copy the given number of ints (4 bytes) for each implementation in log-lin scale. The X-axis is the number of ints copied, and the Y-axis is the relative running time of each implementation relative to 'CPP BLOCK 1 1', which is fixed at 1.0.
-
CPP BLOCK 1 1 : Plain C++ with '-O3' - baseline
-
CPP BLOCK 2 1 : Plain C++ with loop unrolling of factor 2
-
CPP BLOCK 4 1 : Plain C++ with loop unrolling of factor 4
-
CPP BLOCK 8 1 : Plain C++ with loop unrolling of factor 8
-
MEMCPY 1 1 : memcpy()
CPP BLOCK 1 1 uses non-interleaved ldp & stp with loop unrolling factor of 2.
MEMCPY 1 1 uses interleaved ldp & stp with loop unrolling factor of 8
CPP BLOCK X 1 uses interleaved ld[u]r & st[u]r with loop unrolling factor of X.
There is a clear advantage in the explicit loop unrolling either by a pragma or the manual expansion. The Clang++ generates the same code for both. The difference between the memcpy() and the unrolled code can be explained by the difference in the machine instructions and the order of the loads and the stores in the loop body.
The ld[u]r & lt[u]r are the single-word load and store instructions, and ldp & stp are the pair-wise load & store instructions.
Please see the ARM-software/optimized-routines/memcpy.S on Github provided by ARM for the assembler code of memcpy().
The following chart shows the effect of multithreading in comparison to the single-thread version fixed at 1.0, as well as the memcpy() version as a reference. Except for memcpy(), all the other implementations use the loop unrolling of factor 4.
-
CPP BLOCK 4 1 : Plain C++ with the loop unrolling of factor 4 with single thread- baseline
-
CPP BLOCK 4 2 : Plain C++ with the loop unrolling of factor 4 with 2 threads
-
CPP BLOCK 4 4 : Plain C++ with the loop unrolling of factor 4 with 4 threads
-
CPP BLOCK 4 8 : Plain C++ with the loop unrolling of factor 4 with 8 threads
-
MEMCPY 1 1 : memcpy()
The overhead of synchronizing multiple threads is amortized at around the problem size of 1 megabytes (256K ints). For the problems larger than 1 megabytes the multithreaded versions are 60-40 % faster than the single thread version. In most cases memcpy() achieves the best running time.
The following chart shows the effect of splitting the memcpy() into consecutive blocks, each of which is handled by a separate thread.
-
MEMCPY 1 1 : Single thread - baseline
-
MEMCPY 1 2 : Memory split into 2 blocks, each handled by a separate thread
-
MEMCPY 1 4 : Memory split into 4 blocks, each handled by a separate thread
-
MEMCPY 1 8 : Memory split into 8 blocks, each handled by a separate thread
The overhead of synchronizing two threads is amortized at around 2 to 4 megabytes. There seems to be no clear benefit in multithreading, as this problem is purely I/O bound.
The following chart shows the difference among 4 Metal versions to copy data from one MTLBuffer to another.
-
METAL DEFAULT_SHARED : The Metal shader with the simple assignments with two shared MTL buffers. - baseline
-
METAL DEFAULT_MANAGED : The Metal shader with the simple assignments with two managed MTL buffers.
-
METAL BLIT_SHARED : MTLBlitCommandEncoder with two shared MTL buffers.
-
METAL BLIT_MANAGED : MTLBlitCommandEncoder with two managed MTL buffers.
The running times fluctuate significantly for the sizes less than 256 megabytes.
For the shared MTL buffers it is assumed that the views into those buffers from both CPUs and GPUs are consistent before the commit to the command encoder and after the completion.
For the managed MTL buffers the following explicit synchronizations are required.
-
[ _buf didModifyRange:NSMakeRange(_low_byte,_high_byte) ];
-
[ _encoder synchronizeResource:_buf ];
The chart shows that MTLBlitCommandEncoder runs roughly 25% faster depending on the problem size, and there is no noticeable difference in speed between the shared MTL buffers and the managed MTL buffers.
This section reports three particular findings I have observed during the experiments. They could be probably applicable only to my particular environment (Mac Mini M1 2020 & Clang++ 13.0.0).
5.1. The Speed of memcpy() is Achieved by the Interleaved ldp & stp with the Loop Unrolling of Factor 8.
The reference code is found in ARM-software/optimized-routines/memcpy.S on Github provided by ARM for the assembler code of memcpy().
Please see the following excerpt. This is the main part of the copy routine. It has 4 interleaved calls to ldp & stp, each of which loads and stores 16 consecutive bytes (2 64-bit words). In total 64 consecutive bytes are processed in one iteration.
L(loop64):
stp A_l, A_h, [dst, 16]
ldp A_l, A_h, [src, 16]
stp B_l, B_h, [dst, 32]
ldp B_l, B_h, [src, 32]
stp C_l, C_h, [dst, 48]
ldp C_l, C_h, [src, 48]
stp D_l, D_h, [dst, 64]!
ldp D_l, D_h, [src, 64]!
subs count, count, 64
b.hi L(loop64)
Clang++ emits ldp & stp for the following plain C++ code as follows.
Command: clang++ -O3 -Wall -pthread -march=armv8-a+fp+simd -std=c++17 -S -c -o test_copyS test_copy.cpp
test_copy.cpp
void copy_baseline(const int* const x, int* const y, const size_t num ) {
for ( size_t i = 0; i < num; i++ ) {
y[i] = x[i];
}
}
test_copy.S (snippet)
.globl __Z13copy_baselinePKiPii ; -- Begin function _Z13copy_baselinePKiPii
...
LBB0_4: ; =>This Inner Loop Header: Depth=1
ldp q0, q1, [x11, #-32]
ldp q2, q3, [x11], #64
stp q0, q1, [x10, #-32]
stp q2, q3, [x10], #64
subs x12, x12, #16 ; =16
Please note the order of those instructions: ldps first, and then stps.
However, if a pragma is added to the code, or the loop is explicitly loop-unrolled, Clang++ no longer emits ldp & stp. Please see the following code snipped. Note that Clang++ emits the instructions interleaved. The explicit unrolling improves the time compared with the default '-O3' above, but it is not close to 'memcpy()'s time. The difference in the instructions used seems to explain the difference in the running time between the 'CPP BLOCK X 1' and 'MEMCPY 1 1'.
Ex.)
test_copy.cpp
void copy_baseline2(
const int* const __attribute__((aligned(64))) x,
int* const __attribute__((aligned(64))) y,
const int num )
{
# pragma unroll 8
for ( int i = 0; i < num; i++ ) {
y[i] = x[i];
}
}
test_copy.S (snippet)
.globl __Z14copy_baseline2PKiPii ; -- Begin function _Z14copy_baseline2PKiPii
...
LBB1_4: ; =>This Inner Loop Header: Depth=1
ldur w13, [x12, #-16]
stur w13, [x11, #-16]
ldur w13, [x12, #-12]
stur w13, [x11, #-12]
ldur w13, [x12, #-8]
stur w13, [x11, #-8]
ldur w13, [x12, #-4]
stur w13, [x11, #-4]
ldr w13, [x12]
str w13, [x11]
ldr w13, [x12, #4]
str w13, [x11, #4]
ldr w13, [x12, #8]
str w13, [x11, #8]
ldr w13, [x12, #12]
str w13, [x11, #12]
add x9, x9, #8 ; =8
add x11, x11, #32 ; =32
add x12, x12, #32 ; =32
cmp w10, w9
b.ne LBB1_4
Plase see Appendix below for further arguments.
The cache prefetch with '__builtin_prefetch()' does not seem to have significant effect. The cache prefetch is described in Arm C/C++ Compiler reference guide. It seems to emit prfm machine instruction for the cache-line prefetch.
I have tried it in the body of the for-loop in various timings with various parameters, but no improvement has been observed. As far as the code in test_memcpy.cpp is concerned, there is no prfm instruction emitted, and the optimized memcpy.S above do not have them either. I would appreciate any help on how to appropriately use the cache prefetch.
This section briefly describes each of the implementations tested with some key points in the code. Those are executed as part of the test program in test_memcpy.cpp.
The top-level object in the 'main()' function is TestExecutorMemcpy, which is a subclass of TestExecutor found in ../common/test_case_with_time_measurements.h. It manages one single test suite, which consists of test cases. It arranges the input data, allocates memory, executes each test case multiple times, measures the running times, cleans up, and reports the results.
Each implementation type is implemented as a TestCaseMemcpy, which is a subclass of TestCaseWithTimeMeasurements in ../common/test_case_with_time_measurements.h. The main part is implemented in TestCaseMemcpy::run(), and it is the subject for the running time measurements.
class TestCaseMemcpy_baseline in test_memcpy.cpp
This is a plain 'for-loop' and generates a code with ldp & stp with the loop unrolling factor of 4.
for ( size_t i = 0; i < NUM ; i++ ) {
out[i] = in[i];
}
class TestCaseMemcpy_loop_unrolled in test_memcpy.cpp
This is an attempt to increase the factor of loop unrolling. The main part of the implementation is the following function.
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
switch ( m_factor_loop_unrolling ) {
case 1:
// #pragma unroll 1
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
const T* const __attribute__((aligned(4))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(4))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
}
break;
case 2:
// #pragma unroll 2
for ( size_t i = elem_begin; i < elem_end_past_one; i+=2 ) {
const T* const __attribute__((aligned(8))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(8))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
out[1] = in[1];
}
break;
case 4:
// #pragma unroll 4
for ( size_t i = elem_begin; i < elem_end_past_one; i+=4 ) {
const T* const __attribute__((aligned(16))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(16))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
out[1] = in[1];
out[2] = in[2];
out[3] = in[3];
}
break;
case 8:
default:
// #pragma unroll 8
for ( size_t i = elem_begin; i < elem_end_past_one; i+=8 ) {
const T* const __attribute__((aligned(32))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(32))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
out[1] = in[1];
out[2] = in[2];
out[3] = in[3];
out[4] = in[4];
out[5] = in[5];
out[6] = in[6];
out[7] = in[7];
}
}
}
class TestCaseMemcpy_multithread in test_memcpy.cpp
This is a multithreaded version of 'CPP BLOCK X 1'. The worker threads are managed by ThreadSynchrnizer. The overhead of synchronization is around 5.2 [μs] for 4 worker threads per iteration.
class TestCaseMemcpy_memcpy in test_memcpy.cpp
This is just a call to the standard library function memcpy() as follows.
memcpy( this->m_out, this->m_in, sizeof(T) * this->m_num_elements );
class TestCaseMemcpy_memcpy_multithread in test_memcpy.cpp
This is a multithreaded version of 'MEMCPY 1 1'. The worker threads are managed by ThreadSynchrnizer. The overhead of synchronization is around 5.2 [μs] for 4 worker threads per iteration. As the chart above shows, there does not seem to be any significant benefit in using multithreads.
class TestCaseMemcpy_metal_kernel in test_memcpy.cpp
This is a Metal Compute kernel found in metal/memcpy.metal.
kernel void my_memcpy(
device const int* in [[ buffer(0) ]],
device int* out [[ buffer(1) ]],
device const memcpy_constants& c [[ buffer(2) ]],
const uint thread_position_in_grid [[ thread_position_in_grid ]]
) {
if ( thread_position_in_grid < c.num_elements ) {
out[ thread_position_in_grid ] = in[ thread_position_in_grid ];
}
}
This kernel is launched by metal/memcpy_metal_objc.h and
metal/memcpy_metal_objc.mm.
Please see performComputationKernel
.
If the MTLBuffers are managed, instead of shared, then they are explicitly synchronized as follows:
- CPU->GPU
[_mIn didModifyRange: NSMakeRange(0, _mNumElementsInt * sizeof(int) ) ];
[_mConst didModifyRange: NSMakeRange(0, sizeof( struct memcpy_constants) ) ];
- GPU->CPU
id<MTLBlitCommandEncoder> blitEncoder = [ commandBuffer blitCommandEncoder ];
[ blitEncoder synchronizeResource:_mOut ];
[ blitEncoder endEncoding ];
The number of threads per thread-group is aligned at 32, with minimum 32 and maximum 1024. The number of thread-groups per grid is ⌈|V| / 1024⌉, where |V| denotes the number of ints copied.
class TestCaseMemcpy_metal_blit in test_memcpy.cpp
This is similar to METAL DEFAULT_SHARED & DEFAULT_MANAGED, but uses the built-in function blitCommandEncoder instead.
id<MTLBlitCommandEncoder> blitEncoder = [ commandBuffer blitCommandEncoder ];
[ blitEncoder copyFromBuffer: _mIn
sourceOffset: 0
toBuffer: _mOut
destinationOffset: 0
size: sizeof(int)*_mNumElementsInt ];
Please see performComputationBlit: in metal/memcpy_metal_objc.h and metal/memcpy_metal_objc.mm.
This section logs my attempt to make the C++ code closer to the performance of memcpy().
The machine code shown below is disassembled by otool -t -v bin/test_memcpy
.
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
this->m_out_aligned [i] = this->m_in_aligned [i];
}
The corresponding code is as follows. As you can see ldp & stp are used and the loop is unrolled by the factor of 4.
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
00000001000073d4 ldp q0, q1, [x13, #-0x20]
00000001000073d8 ldp q2, q3, [x13], #0x40
00000001000073dc stp q0, q1, [x12, #-0x20]
00000001000073e0 stp q2, q3, [x12], #0x40
00000001000073e4 subs x14, x14, #0x10
00000001000073e8 b.ne 0x1000073d4
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
#pragma nounroll
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
this->m_out_aligned [i] = this->m_in_aligned [i];
}
The code does not change.
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
00000001000073d4 ldp q0, q1, [x13, #-0x20]
00000001000073d8 ldp q2, q3, [x13], #0x40
00000001000073dc stp q0, q1, [x12, #-0x20]
00000001000073e0 stp q2, q3, [x12], #0x40
00000001000073e4 subs x14, x14, #0x10
00000001000073e8 b.ne 0x1000073d4
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
#pragma unroll 1
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
this->m_out_aligned [i] = this->m_in_aligned [i];
}
The generated code is still the same as the default.
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
00000001000073d4 ldp q0, q1, [x13, #-0x20]
00000001000073d8 ldp q2, q3, [x13], #0x40
00000001000073dc stp q0, q1, [x12, #-0x20]
00000001000073e0 stp q2, q3, [x12], #0x40
00000001000073e4 subs x14, x14, #0x10
00000001000073e8 b.ne 0x1000073d4
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
#pragma unroll 2
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
this->m_out_aligned [i] = this->m_in_aligned [i];
}
The generated code changes. It no longer uses ldp & stp. Instead it uses ld[u]r & st[u]r.
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
000000010000727c ldur w11, [x9, #-0x4]
0000000100007280 stur w11, [x10, #-0x4]
0000000100007284 ldr w11, [x9], #0x8
0000000100007288 str w11, [x10], #0x8
000000010000728c add x12, x12, #0x2
0000000100007290 cmp x12, x8
0000000100007294 b.lo 0x10000727c
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
#pragma unroll 2
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
const T* const __attribute__((aligned(4))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(4))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
}
The code still uses ld[u]r & st[u]r.
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
000000010000727c ldur w11, [x9, #-0x4]
0000000100007280 stur w11, [x10, #-0x4]
0000000100007284 ldr w11, [x9], #0x8
0000000100007288 str w11, [x10], #0x8
000000010000728c add x12, x12, #0x2
0000000100007290 cmp x12, x8
0000000100007294 b.lo 0x10000727c
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
for ( size_t i = elem_begin; i < elem_end_past_one; i+=2 ) {
const T* const __attribute__((aligned(4))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(4))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
out[1] = in[1];
}
The code still uses ld[u]r & st[u]r.
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
000000010000747c lsl x12, x9, #2
0000000100007480 add x13, x10, x12
0000000100007484 add x12, x11, x12
0000000100007488 ldr w14, [x13]
000000010000748c str w14, [x12]
0000000100007490 ldr w13, [x13, #0x4]
0000000100007494 str w13, [x12, #0x4]
0000000100007498 add x9, x9, #0x2
000000010000749c cmp x9, x8
00000001000074a0 b.lo 0x10000747c
...
The following code snippets show the C++ codesfor loop unrolling factor of 4 and 8. The generated codes do not use ldp & stp.
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
#pragma unroll 4
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
this->m_out_aligned [i] = this->m_in_aligned [i];
}
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
0000000100007200 ldur w11, [x9, #-0x8]
0000000100007204 stur w11, [x10, #-0x8]
0000000100007208 ldur w11, [x9, #-0x4]
000000010000720c stur w11, [x10, #-0x4]
0000000100007210 ldr w11, [x9]
0000000100007214 str w11, [x10]
0000000100007218 ldr w11, [x9, #0x4]
000000010000721c str w11, [x10, #0x4]
0000000100007220 add x13, x13, #0x4
0000000100007224 add x10, x10, #0x10
0000000100007228 add x9, x9, #0x10
000000010000722c cmp x13, x8
0000000100007230 b.lo 0x100007200
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
# pragma unroll 4
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
const T* const __attribute__((aligned(4))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(4))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
}
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
0000000100007200 ldur w11, [x9, #-0x8]
0000000100007204 stur w11, [x10, #-0x8]
0000000100007208 ldur w11, [x9, #-0x4]
000000010000720c stur w11, [x10, #-0x4]
0000000100007210 ldr w11, [x9]
0000000100007214 str w11, [x10]
0000000100007218 ldr w11, [x9, #0x4]
000000010000721c str w11, [x10, #0x4]
0000000100007220 add x13, x13, #0x4
0000000100007224 add x10, x10, #0x10
0000000100007228 add x9, x9, #0x10
000000010000722c cmp x13, x8
0000000100007230 b.lo 0x100007200
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
for ( size_t i = elem_begin; i < elem_end_past_one; i+=4 ) {
const T* const __attribute__((aligned(4))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(4))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
out[1] = in[1];
out[2] = in[2];
out[3] = in[3];
}
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
0000000100007458 ldur w12, [x10, #-0x8]
000000010000745c stur w12, [x11, #-0x8]
0000000100007460 ldur w12, [x10, #-0x4]
0000000100007464 stur w12, [x11, #-0x4]
0000000100007468 ldr w12, [x10]
000000010000746c str w12, [x11]
0000000100007470 ldr w12, [x10, #0x4]
0000000100007474 str w12, [x11, #0x4]
0000000100007478 add x9, x9, #0x4
000000010000747c add x10, x10, #0x10
0000000100007480 add x11, x11, #0x10
0000000100007484 cmp x9, x8
0000000100007488 b.lo 0x100007458
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
#pragma unroll 8
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
this->m_out_aligned [i] = this->m_in_aligned [i];
}
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
00000001000073cc ldur w12, [x11, #-0x10]
00000001000073d0 stur w12, [x10, #-0x10]
00000001000073d4 ldur w12, [x11, #-0xc]
00000001000073d8 stur w12, [x10, #-0xc]
00000001000073dc ldur w12, [x11, #-0x8]
00000001000073e0 stur w12, [x10, #-0x8]
00000001000073e4 ldur w12, [x11, #-0x4]
00000001000073e8 stur w12, [x10, #-0x4]
00000001000073ec ldr w12, [x11]
00000001000073f0 str w12, [x10]
00000001000073f4 ldr w12, [x11, #0x4]
00000001000073f8 str w12, [x10, #0x4]
00000001000073fc ldr w12, [x11, #0x8]
0000000100007400 str w12, [x10, #0x8]
0000000100007404 ldr w12, [x11, #0xc]
0000000100007408 str w12, [x10, #0xc]
000000010000740c add x9, x9, #0x8
0000000100007410 add x10, x10, #0x20
0000000100007414 add x11, x11, #0x20
0000000100007418 cmp x9, x8
000000010000741c b.lo 0x1000073cc
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
#pragma unroll 8
for ( size_t i = elem_begin; i < elem_end_past_one; i++ ) {
const T* const __attribute__((aligned(4))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(4))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
}
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
00000001000073cc ldur w12, [x11, #-0x10]
00000001000073d0 stur w12, [x10, #-0x10]
00000001000073d4 ldur w12, [x11, #-0xc]
00000001000073d8 stur w12, [x10, #-0xc]
00000001000073dc ldur w12, [x11, #-0x8]
00000001000073e0 stur w12, [x10, #-0x8]
00000001000073e4 ldur w12, [x11, #-0x4]
00000001000073e8 stur w12, [x10, #-0x4]
00000001000073ec ldr w12, [x11]
00000001000073f0 str w12, [x10]
00000001000073f4 ldr w12, [x11, #0x4]
00000001000073f8 str w12, [x10, #0x4]
00000001000073fc ldr w12, [x11, #0x8]
0000000100007400 str w12, [x10, #0x8]
0000000100007404 ldr w12, [x11, #0xc]
0000000100007408 str w12, [x10, #0xc]
000000010000740c add x9, x9, #0x8
0000000100007410 add x10, x10, #0x20
0000000100007414 add x11, x11, #0x20
0000000100007418 cmp x9, x8
000000010000741c b.lo 0x1000073cc
...
inline void process_block(const int elem_begin, const int elem_end_past_one ) {
for ( size_t i = elem_begin; i < elem_end_past_one; i+=8 ) {
const T* const __attribute__((aligned(4))) in = &(this->m_in_aligned [i]);
T* const __attribute__((aligned(4))) out = &(this->m_out_aligned[i]);
out[0] = in[0];
out[1] = in[1];
out[2] = in[2];
out[3] = in[3];
out[4] = in[4];
out[5] = in[5];
out[6] = in[6];
out[7] = in[7];
}
__ZN28TestCaseMemcpy_loop_unrolledIiE3runEv:
...
0000000100007418 ldur w12, [x10, #-0x10]
000000010000741c stur w12, [x11, #-0x10]
0000000100007420 ldur w12, [x10, #-0xc]
0000000100007424 stur w12, [x11, #-0xc]
0000000100007428 ldur w12, [x10, #-0x8]
000000010000742c stur w12, [x11, #-0x8]
0000000100007430 ldur w12, [x10, #-0x4]
0000000100007434 stur w12, [x11, #-0x4]
0000000100007438 ldr w12, [x10]
000000010000743c str w12, [x11]
0000000100007440 ldr w12, [x10, #0x4]
0000000100007444 str w12, [x11, #0x4]
0000000100007448 ldr w12, [x10, #0x8]
000000010000744c str w12, [x11, #0x8]
0000000100007450 ldr w12, [x10, #0xc]
0000000100007454 str w12, [x11, #0xc]
0000000100007458 add x9, x9, #0x8
000000010000745c add x10, x10, #0x20
0000000100007460 add x11, x11, #0x20
0000000100007464 cmp x9, x8
0000000100007468 b.lo 0x100007418
...