Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ffts_real only works for single-precision #59

Open
danra opened this issue Jul 3, 2017 · 12 comments
Open

ffts_real only works for single-precision #59

danra opened this issue Jul 3, 2017 · 12 comments

Comments

@danra
Copy link

danra commented Jul 3, 2017

Should also support double-precision

@mcourteaux
Copy link

mcourteaux commented Jul 11, 2017

I also get a segfault when trying to use the:

p = ffts_init_2d_real(n, n, sign);

plan, no matter how big I allocate these buffers.

@mcourteaux
Copy link

I modified the test.c file to use the plan ffts_init_2d_real.

martijn@Martijns-MacBook-Pro ~/ffts/build (master)$ lldb -o r -- ./ffts_test 512 -1
(lldb) target create "./ffts_test"
Current executable set to './ffts_test' (x86_64).
(lldb) settings set -- target.run-args  "512" "-1"
(lldb) r
ffts_test was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 56122 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
   187 	            __m128 t3 = _mm_load_ps(A + i);
   188 	            __m128 t4 = _mm_load_ps(B + i);
   189
-> 190 	            _mm_store_ps(out + i, _mm_add_ps(_mm_addsub_ps(
   191 	                _mm_mul_ps(t1, _mm_moveldup_ps(t3)),
   192 	                _mm_mul_ps(_mm_shuffle_ps(t1, t1, _MM_SHUFFLE(2,3,0,1)),
   193 	                _mm_movehdup_ps(t3))), _mm_addsub_ps(

Process 56122 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
    frame #1: 0x0000000100003000 ffts_test`ffts_execute_nd_real(p=0x0000000100600140, in=0x0000000101900800, out=0x0000000101b00000) at ffts_real_nd.c:96 [opt]
    frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
    frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)

@linkotec
Copy link
Contributor

Try replacing _mm_store_ps with _mm_storeu_ps as there is no guarantee that the memory is correctly aligned.

@mcourteaux
Copy link

Check the pointers: in=0x0000000101900800, out=0x0000000101b00000. Looks fairly aligned to me. Are you sure your comment still holds?

@mcourteaux
Copy link

mcourteaux commented Jul 11, 2017

However, I tried what you said, and now it crashes here:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
   72  	                   op[x*TSIZE] = ip[x];
   73  	                */
   74  	                __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75  	                __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
   76  	                __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
   77  	                __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
   78  	                __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));

Process 56308 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
    frame #1: 0x0000000100003032 ffts_test`ffts_execute_nd_real(p=0x0000000100502b20, in=<unavailable>, out=0x0000000101200000) at ffts_real_nd.c:99 [opt]
    frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
    frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)

The value of w seems to be 257 (which looks like 2^8+1), and looks indeed something that screws with the alignment required.

@mcourteaux
Copy link

Also, the hardcoded double datatype cast for a "single-precision" compiled library looks odd. I am a complete newbie to this library tho.

@linkotec
Copy link
Contributor

linkotec commented Jul 11, 2017

I have done very little testing with 2D or higher ranking transforms. But most of problems are related with alignment, as intrinsics used require strict 128 bit align.

Double datatype casting is used here to simplify following _mm_shuffle_pd instrinsics. We could have used _mm_load_ps (or _mm_loadu_ps) and then cast using _mm_castps_pd to _m128d datatype,

You could test with my benchFFTS, https://github.com/linkotec/benchFFTS. It supports tests for 2D, for example "bench_ffts -s ocb512x128"

@mcourteaux
Copy link

mcourteaux commented Jul 12, 2017

Some thing of course: (note the orb vs ocb)

martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s ocb512x128
Problem: ocb512x128, setup: 317.00 us, time: 410.53 us, ``mflops'': 12771
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s orb512x128
[1]    93048 segmentation fault  ./bench_ffts -s orb512x128
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$

LLDB gives the same as previously:

martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ lldb -o run -- ./bench_ffts -s orb512x128                                                                  [ruby-2.0.0p648]
(lldb) target create "./bench_ffts"
Current executable set to './bench_ffts' (x86_64).
(lldb) settings set -- target.run-args  "-s" "orb512x128"
(lldb) run
bench_ffts was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 93163 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00000001000036ce bench_ffts`ffts_transpose(in=<unavailable>, out=0x0000000101042000, w=65, h=<unavailable>) at ffts_transpose.c:75 [opt]
   72  	                   op[x*TSIZE] = ip[x];
   73  	                */
   74  	                __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75  	                __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
   76  	                __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
   77  	                __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
   78  	                __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));

Process 93163 launched: './bench_ffts' (x86_64)
(lldb)

@mcourteaux
Copy link

It looks like the problem is due to the "half spectrum" thing, which causes you to have N/2 + 1 coefficients. The constant coefficient is the one causing the +1 I believe. This is why the w is 64+1 = 65, and the transpose failing on this misaligned data. Note that I did not investigate this, but this is just what pops into my head, with my knowledge of fourier transforms.

@jtojanen
Copy link

I think you have found the problem. At the moment I don't have the time to fix it. One option is that you use 2D complex transform with imaginary part set to zero, so all input value are real numbers. It is slower but at least it works.

@mcourteaux
Copy link

mcourteaux commented Jul 12, 2017

Nice! 😄 Well, speed is a very important aspect of my application. So for now, I will use the complex transform. Once fixed, I might change it back to the real_to_complex_2d transform. However, if you believe the fix is fairly easy and can outline it, I might consider trying to fix this (with PR off course).

@emmenlau
Copy link

emmenlau commented Mar 9, 2022

Hi, did anyone work on this? I'd like to use the 2d real transform too, would be great if it works...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants