Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize audio node processing for simd avoiding bytemuck #53

Closed

Conversation

rhizoome
Copy link
Contributor

@rhizoome rhizoome commented Sep 7, 2024

EDIT: Although I do not like how the code looks, it gives a 10% to 22% performance improvement. Maybe you could give it a thought. Also as far as I can tell, the results are still correct, although the tests do not like it.

I was confused by the difference in execution-profile of tick() and process() based AudioNodes . So I looked deeper into the profile: The largest identifiable chunk that was spent outside process() calculating things, was 13% in a crate called bytemuck. I found out that this happens, because you need some tricks to access a f32 in a avx- or neon-value directly. So I thought what happens if we just access the complete data directly in one go? It turns out in that case it is just a transmute(), which should be a no-op.

As usual with optimisation, I am not sure if the ugly code is actually worth it. So please feel free to reject or heavily modify what I did. It is just a proposal.

tick()-based

optimized » /usr/bin/time target/release/fundsp_player bank_simd dummy --seconds 4000
        2.42 real         2.41 user         0.00 sys

original » /usr/bin/time target/release/fundsp_player bank_simd dummy --seconds 4000
        2.67 real         2.65 user         0.01 sys

-> 10% (SIMD in resonator bank)

optimized » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 4000
        7.40 real         7.37 user         0.02 sys
        
original » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 4000
        9.57 real         9.53 user         0.03 sys
        
-> 22% (Normal resonator bank)

process()-based

optimized » /usr/bin/time target/release/fundsp_player harmonic_series dummy --seconds 4000
       10.53 real        10.45 user         0.06 sys
       
original » /usr/bin/time target/release/fundsp_player harmonic_series dummy --seconds 4000
       10.58 real        10.49 user         0.06 sys
       
-> 0.005% (as expected not much difference)

If you ignore the loop-unrolling part of the patch. Using uninitialised temp storage might still be interesting. It is nicer code, but only contributes about 1%. Different to the code for uninitialised BufferArray, I wanted the unsafe to be seen at the call site. Because if you use such a frame as input for calculations, the program will crash. This is true for the BufferArray case, too, so making it an unsafe function might be a good idea.

@rhizoome rhizoome marked this pull request as draft September 7, 2024 21:54
@rhizoome
Copy link
Contributor Author

rhizoome commented Sep 7, 2024

I broke something.

image

It cannot the be too bad though, my examples still produce the correct spectrums.

@rhizoome rhizoome force-pushed the unrolled-loop-audio-node-tick branch 2 times, most recently from 7ef142f to 5f775dc Compare September 7, 2024 21:57
@rhizoome rhizoome marked this pull request as ready for review September 8, 2024 07:17
@rhizoome
Copy link
Contributor Author

rhizoome commented Sep 8, 2024

I think it might be better to optimize this at the leaf-node of the graph, since there we know most of the time how many channels there are. I will try that tomorrow.

@SamiPerttu
Copy link
Owner

The code is correct if you add a call to process_remainder at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?

@rhizoome
Copy link
Contributor Author

rhizoome commented Sep 8, 2024

The code is correct if you add a call to process_remainder at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?

Yes on arm. I‘ll test my example on x86, to be sure. I was surprised that at_f32() has an overhead, maybe that is an arm thing.

@rhizoome
Copy link
Contributor Author

rhizoome commented Sep 8, 2024

The code is correct if you add a call to process_remainder at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?

On x86 I have 20%.

This is the code I am running:

fn build_bank_current() -> impl AudioUnit {
    (noise()
        >> split()
        >> (resonator_hz(440.0, 50.0)
            | resonator_hz(440.0 * 2.0, 50.0)
            | resonator_hz(440.0 * 3.0, 50.0)
            | resonator_hz(440.0 * 4.0, 50.0)
            | resonator_hz(440.0 * 5.0, 50.0)
            | resonator_hz(440.0 * 6.0, 50.0)
            | resonator_hz(440.0 * 7.0, 50.0)
            | resonator_hz(440.0 * 8.0, 50.0))
        >> join())
        * 0.1
}

And this is how I am running it:

pub fn dummy(seconds: u32, build_name: &str) {
    let sink: [f32; 8] = [0.0; 8];
    let sink_ptr = sink.as_ptr() as *mut [f32; 8];
    let mut runner = Runner::new(build_name);
    let mut backend = runner.backend();
    let count: usize = SAMPLE_RATE as usize / MAX_BUFFER_SIZE * 2 * seconds as usize;
    for _ in 0..count {
        backend.process();
        // Avoid optimizations
        for wide in 0..BUFFER_LEN {
            for channel in 0..1 {
                let wide_buf = backend.buffer.at(channel, wide).to_array();
                unsafe {
                    ptr::write_volatile(sink_ptr, wide_buf);
                }
            }
        }
    }
}

Maybe I did something stupid. Here is the complete code: https://github.com/rhizoome/fundsp_player

@SamiPerttu
Copy link
Owner

I added this to 0.19.1.

@rhizoome
Copy link
Contributor Author

rhizoome commented Sep 11, 2024

Thanks for trying it out! As promised I looked into optimising at the leaf:

It is even more efficient than the root optimisation, but the code is only nice for the trivial-case of one INPUT/OUTPUT. While looking around I saw that even the fn process() in many leaf AudioUnits uses at_f32(). So I think this means:

  • First we probably have to understand what is going on. I still can't believe that at_f32() has actually an overhead. Do I just force the CPU to fill the cache correctly? On the other hand, the fact that the optimisation is working even better in the one INPUT/OUTPUT case at the leaf implicates that at_f32() really has an overhead. But maybe it is implemented badly in wide or bytemuck?
    • Can we try to use unsafe in at_f32() and just force the f32 out of the register-value? (Just to compare performance, not as a solution) <---- [That is the next thing I try]
  • Maybe we can come up with a way to remove this impedance mismatch* (I currently can't) and I would help cleaning it up
  • Or I would just build my system and then optimise the things I use and try to upstream the optimisation on a case-by-case basis
    • Which probably means: making generic helper that does conversion from frame "by" channel to channel "by" frame and back

* The fact that because of f32x8 the memory is not ordered correctly for processing.

optimized » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
        3.48 real         3.47 user         0.00 sys
        
not » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
        5.04 real         4.77 user         0.01 sys

-> 30%

diff --git a/src/biquad.rs b/src/biquad.rs
index 8b0d0b7a..1cc421b2 100644
--- a/src/biquad.rs
+++ b/src/biquad.rs
@@ -7,6 +7,7 @@

 use super::audionode::*;
 use super::math::*;
+use super::prelude::{BufferMut, BufferRef};
 use super::setting::*;
 use super::shape::*;
 use super::signal::*;
@@ -162,6 +163,19 @@ impl<F: Real> Biquad<F> {
     pub fn set_coefs(&mut self, coefs: BiquadCoefs<F>) {
         self.coefs = coefs;
     }
+
+    #[inline]
+    fn _tick(&mut self, x0: f32) -> f32 {
+        let x0 = convert(x0);
+        let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
+            - self.coefs.a1 * self.y1
+            - self.coefs.a2 * self.y2;
+        self.x2 = self.x1;
+        self.x1 = x0;
+        self.y2 = self.y1;
+        self.y1 = y0;
+        F::to_f32(y0)
+    }
 }

 impl<F: Real> AudioNode for Biquad<F> {
@@ -183,16 +197,25 @@ impl<F: Real> AudioNode for Biquad<F> {
     #[inline]
     fn tick(&mut self, input: &Frame<f32, Self::Inputs>) -> Frame<f32, Self::Outputs> {
         let x0 = convert(input[0]);
-        let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
-            - self.coefs.a1 * self.y1
-            - self.coefs.a2 * self.y2;
-        self.x2 = self.x1;
-        self.x1 = x0;
-        self.y2 = self.y1;
-        self.y1 = y0;
+        let y0 = self._tick(x0);
         [convert(y0)].into()
     }

+    fn process(&mut self, size: usize, input: &BufferRef, output: &mut BufferMut) {
+        for i in 0..full_simd_items(size) {
+            let simd_in = input.at(0, i);
+            let mut simd_out = output.at(0, i);
+            let array_in = simd_in.as_array_ref();
+            let array_out = simd_out.as_array_mut();
+            for x in 0..array_in.len() {
+                array_out[x] = self._tick(array_in[x]);
+            }
+        }
+
+        self.process_remainder(size, input, output);
+    }
+

@rhizoome
Copy link
Contributor Author

rhizoome commented Sep 11, 2024

I did the test of bypassing bytemuck there is no performance difference!!

-> So it is solving a cache issue!!

I do not like to solve cache issues in a multi-platform library like that. Because the advantage we have on desktop-computers might very well be a problem on embedded MCUs like cortex without a cache. I can test on cortex-M0, maybe llvm is intelligent enough to remove the copying, if not I am pretty sure the code in this PR is worse.

That is probably the reason why you did not see a difference, you had a more mixed load. I just run that one resonator-bank.

Solving impedance mismatch on the other hand would not harm MCUs. Let me know what you think! Do not feel obligated to do anything, I should build my complete system and then optimise. I was just doing this to get a feel for the performance of the resonator-bank, because that is definitely a core feature. I think I now know the numbers and whether this is optimisation is kept is not very important. I can always patch it on builds for desktop-computers. I am not afraid of patches.

@SamiPerttu
Copy link
Owner

Closing this. I decided for the time being to revert back to the earlier method. It was worth a shot, though.

@SamiPerttu SamiPerttu closed this Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants