ffts_real only works for single-precision #59

danra · 2017-07-03T13:46:02Z

Should also support double-precision

mcourteaux · 2017-07-11T11:33:08Z

I also get a segfault when trying to use the:

p = ffts_init_2d_real(n, n, sign);

plan, no matter how big I allocate these buffers.

mcourteaux · 2017-07-11T11:40:34Z

I modified the test.c file to use the plan ffts_init_2d_real.

martijn@Martijns-MacBook-Pro ~/ffts/build (master)$ lldb -o r -- ./ffts_test 512 -1
(lldb) target create "./ffts_test"
Current executable set to './ffts_test' (x86_64).
(lldb) settings set -- target.run-args  "512" "-1"
(lldb) r
ffts_test was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 56122 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
   187 	            __m128 t3 = _mm_load_ps(A + i);
   188 	            __m128 t4 = _mm_load_ps(B + i);
   189
-> 190 	            _mm_store_ps(out + i, _mm_add_ps(_mm_addsub_ps(
   191 	                _mm_mul_ps(t1, _mm_moveldup_ps(t3)),
   192 	                _mm_mul_ps(_mm_shuffle_ps(t1, t1, _MM_SHUFFLE(2,3,0,1)),
   193 	                _mm_movehdup_ps(t3))), _mm_addsub_ps(

Process 56122 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
    frame #1: 0x0000000100003000 ffts_test`ffts_execute_nd_real(p=0x0000000100600140, in=0x0000000101900800, out=0x0000000101b00000) at ffts_real_nd.c:96 [opt]
    frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
    frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)

linkotec · 2017-07-11T11:46:45Z

Try replacing _mm_store_ps with _mm_storeu_ps as there is no guarantee that the memory is correctly aligned.

mcourteaux · 2017-07-11T11:48:50Z

Check the pointers: in=0x0000000101900800, out=0x0000000101b00000. Looks fairly aligned to me. Are you sure your comment still holds?

mcourteaux · 2017-07-11T11:52:11Z

However, I tried what you said, and now it crashes here:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
   72  	                   op[x*TSIZE] = ip[x];
   73  	                */
   74  	                __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75  	                __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
   76  	                __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
   77  	                __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
   78  	                __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));

Process 56308 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
    frame #1: 0x0000000100003032 ffts_test`ffts_execute_nd_real(p=0x0000000100502b20, in=<unavailable>, out=0x0000000101200000) at ffts_real_nd.c:99 [opt]
    frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
    frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)

The value of w seems to be 257 (which looks like 2^8+1), and looks indeed something that screws with the alignment required.

mcourteaux · 2017-07-11T11:56:56Z

Also, the hardcoded double datatype cast for a "single-precision" compiled library looks odd. I am a complete newbie to this library tho.

linkotec · 2017-07-11T13:02:35Z

I have done very little testing with 2D or higher ranking transforms. But most of problems are related with alignment, as intrinsics used require strict 128 bit align.

Double datatype casting is used here to simplify following _mm_shuffle_pd instrinsics. We could have used _mm_load_ps (or _mm_loadu_ps) and then cast using _mm_castps_pd to _m128d datatype,

You could test with my benchFFTS, https://github.com/linkotec/benchFFTS. It supports tests for 2D, for example "bench_ffts -s ocb512x128"

mcourteaux · 2017-07-12T12:32:11Z

Some thing of course: (note the orb vs ocb)

martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s ocb512x128
Problem: ocb512x128, setup: 317.00 us, time: 410.53 us, ``mflops'': 12771
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s orb512x128
[1]    93048 segmentation fault  ./bench_ffts -s orb512x128
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$

LLDB gives the same as previously:

martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ lldb -o run -- ./bench_ffts -s orb512x128                                                                  [ruby-2.0.0p648]
(lldb) target create "./bench_ffts"
Current executable set to './bench_ffts' (x86_64).
(lldb) settings set -- target.run-args  "-s" "orb512x128"
(lldb) run
bench_ffts was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 93163 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00000001000036ce bench_ffts`ffts_transpose(in=<unavailable>, out=0x0000000101042000, w=65, h=<unavailable>) at ffts_transpose.c:75 [opt]
   72  	                   op[x*TSIZE] = ip[x];
   73  	                */
   74  	                __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75  	                __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
   76  	                __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
   77  	                __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
   78  	                __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));

Process 93163 launched: './bench_ffts' (x86_64)
(lldb)

mcourteaux · 2017-07-12T12:36:06Z

It looks like the problem is due to the "half spectrum" thing, which causes you to have N/2 + 1 coefficients. The constant coefficient is the one causing the +1 I believe. This is why the w is 64+1 = 65, and the transpose failing on this misaligned data. Note that I did not investigate this, but this is just what pops into my head, with my knowledge of fourier transforms.

jtojanen · 2017-07-12T12:40:59Z

I think you have found the problem. At the moment I don't have the time to fix it. One option is that you use 2D complex transform with imaginary part set to zero, so all input value are real numbers. It is slower but at least it works.

mcourteaux · 2017-07-12T12:58:49Z

Nice! 😄 Well, speed is a very important aspect of my application. So for now, I will use the complex transform. Once fixed, I might change it back to the real_to_complex_2d transform. However, if you believe the fix is fairly easy and can outline it, I might consider trying to fix this (with PR off course).

emmenlau · 2022-03-09T18:16:49Z

Hi, did anyone work on this? I'd like to use the 2d real transform too, would be great if it works...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ffts_real only works for single-precision #59

ffts_real only works for single-precision #59

danra commented Jul 3, 2017

mcourteaux commented Jul 11, 2017 •

edited

Loading

mcourteaux commented Jul 11, 2017

linkotec commented Jul 11, 2017

mcourteaux commented Jul 11, 2017

mcourteaux commented Jul 11, 2017 •

edited

Loading

mcourteaux commented Jul 11, 2017

linkotec commented Jul 11, 2017 •

edited

Loading

mcourteaux commented Jul 12, 2017 •

edited

Loading

mcourteaux commented Jul 12, 2017

jtojanen commented Jul 12, 2017

mcourteaux commented Jul 12, 2017 •

edited

Loading

emmenlau commented Mar 9, 2022

ffts_real only works for single-precision #59

ffts_real only works for single-precision #59

Comments

danra commented Jul 3, 2017

mcourteaux commented Jul 11, 2017 • edited Loading

mcourteaux commented Jul 11, 2017

linkotec commented Jul 11, 2017

mcourteaux commented Jul 11, 2017

mcourteaux commented Jul 11, 2017 • edited Loading

mcourteaux commented Jul 11, 2017

linkotec commented Jul 11, 2017 • edited Loading

mcourteaux commented Jul 12, 2017 • edited Loading

mcourteaux commented Jul 12, 2017

jtojanen commented Jul 12, 2017

mcourteaux commented Jul 12, 2017 • edited Loading

emmenlau commented Mar 9, 2022

mcourteaux commented Jul 11, 2017 •

edited

Loading

mcourteaux commented Jul 11, 2017 •

edited

Loading

linkotec commented Jul 11, 2017 •

edited

Loading

mcourteaux commented Jul 12, 2017 •

edited

Loading

mcourteaux commented Jul 12, 2017 •

edited

Loading