-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize audio node processing for simd avoiding bytemuck #53
Conversation
7ef142f
to
5f775dc
Compare
I think it might be better to optimize this at the leaf-node of the graph, since there we know most of the time how many channels there are. I will try that tomorrow. |
The code is correct if you add a call to |
Yes on arm. I‘ll test my example on x86, to be sure. I was surprised that at_f32() has an overhead, maybe that is an arm thing. |
On x86 I have 20%. This is the code I am running: fn build_bank_current() -> impl AudioUnit {
(noise()
>> split()
>> (resonator_hz(440.0, 50.0)
| resonator_hz(440.0 * 2.0, 50.0)
| resonator_hz(440.0 * 3.0, 50.0)
| resonator_hz(440.0 * 4.0, 50.0)
| resonator_hz(440.0 * 5.0, 50.0)
| resonator_hz(440.0 * 6.0, 50.0)
| resonator_hz(440.0 * 7.0, 50.0)
| resonator_hz(440.0 * 8.0, 50.0))
>> join())
* 0.1
} And this is how I am running it: pub fn dummy(seconds: u32, build_name: &str) {
let sink: [f32; 8] = [0.0; 8];
let sink_ptr = sink.as_ptr() as *mut [f32; 8];
let mut runner = Runner::new(build_name);
let mut backend = runner.backend();
let count: usize = SAMPLE_RATE as usize / MAX_BUFFER_SIZE * 2 * seconds as usize;
for _ in 0..count {
backend.process();
// Avoid optimizations
for wide in 0..BUFFER_LEN {
for channel in 0..1 {
let wide_buf = backend.buffer.at(channel, wide).to_array();
unsafe {
ptr::write_volatile(sink_ptr, wide_buf);
}
}
}
}
} Maybe I did something stupid. Here is the complete code: https://github.com/rhizoome/fundsp_player |
I added this to 0.19.1. |
Thanks for trying it out! As promised I looked into optimising at the leaf: It is even more efficient than the root optimisation, but the code is only nice for the trivial-case of
optimized » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
3.48 real 3.47 user 0.00 sys
not » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
5.04 real 4.77 user 0.01 sys -> 30% diff --git a/src/biquad.rs b/src/biquad.rs
index 8b0d0b7a..1cc421b2 100644
--- a/src/biquad.rs
+++ b/src/biquad.rs
@@ -7,6 +7,7 @@
use super::audionode::*;
use super::math::*;
+use super::prelude::{BufferMut, BufferRef};
use super::setting::*;
use super::shape::*;
use super::signal::*;
@@ -162,6 +163,19 @@ impl<F: Real> Biquad<F> {
pub fn set_coefs(&mut self, coefs: BiquadCoefs<F>) {
self.coefs = coefs;
}
+
+ #[inline]
+ fn _tick(&mut self, x0: f32) -> f32 {
+ let x0 = convert(x0);
+ let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
+ - self.coefs.a1 * self.y1
+ - self.coefs.a2 * self.y2;
+ self.x2 = self.x1;
+ self.x1 = x0;
+ self.y2 = self.y1;
+ self.y1 = y0;
+ F::to_f32(y0)
+ }
}
impl<F: Real> AudioNode for Biquad<F> {
@@ -183,16 +197,25 @@ impl<F: Real> AudioNode for Biquad<F> {
#[inline]
fn tick(&mut self, input: &Frame<f32, Self::Inputs>) -> Frame<f32, Self::Outputs> {
let x0 = convert(input[0]);
- let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
- - self.coefs.a1 * self.y1
- - self.coefs.a2 * self.y2;
- self.x2 = self.x1;
- self.x1 = x0;
- self.y2 = self.y1;
- self.y1 = y0;
+ let y0 = self._tick(x0);
[convert(y0)].into()
}
+ fn process(&mut self, size: usize, input: &BufferRef, output: &mut BufferMut) {
+ for i in 0..full_simd_items(size) {
+ let simd_in = input.at(0, i);
+ let mut simd_out = output.at(0, i);
+ let array_in = simd_in.as_array_ref();
+ let array_out = simd_out.as_array_mut();
+ for x in 0..array_in.len() {
+ array_out[x] = self._tick(array_in[x]);
+ }
+ }
+
+ self.process_remainder(size, input, output);
+ }
+ |
I did the test of bypassing bytemuck there is no performance difference!! -> So it is solving a cache issue!! I do not like to solve cache issues in a multi-platform library like that. Because the advantage we have on desktop-computers might very well be a problem on embedded MCUs like cortex without a cache. I can test on cortex-M0, maybe llvm is intelligent enough to remove the copying, if not I am pretty sure the code in this PR is worse. That is probably the reason why you did not see a difference, you had a more mixed load. I just run that one resonator-bank. Solving impedance mismatch on the other hand would not harm MCUs. Let me know what you think! Do not feel obligated to do anything, I should build my complete system and then optimise. I was just doing this to get a feel for the performance of the resonator-bank, because that is definitely a core feature. I think I now know the numbers and whether this is optimisation is kept is not very important. I can always patch it on builds for desktop-computers. I am not afraid of patches. |
Closing this. I decided for the time being to revert back to the earlier method. It was worth a shot, though. |
EDIT: Although I do not like how the code looks, it gives a 10% to 22% performance improvement. Maybe you could give it a thought. Also as far as I can tell, the results are still correct, although the tests do not like it.
I was confused by the difference in execution-profile of
tick()
andprocess()
basedAudioNodes
. So I looked deeper into the profile: The largest identifiable chunk that was spent outsideprocess()
calculating things, was 13% in a crate calledbytemuck
. I found out that this happens, because you need some tricks to access af32
in aavx-
orneon-value
directly. So I thought what happens if we just access the complete data directly in one go? It turns out in that case it is just atransmute()
, which should be a no-op.As usual with optimisation, I am not sure if the ugly code is actually worth it. So please feel free to reject or heavily modify what I did. It is just a proposal.
tick()-based
process()-based
optimized » /usr/bin/time target/release/fundsp_player harmonic_series dummy --seconds 4000 10.53 real 10.45 user 0.06 sys original » /usr/bin/time target/release/fundsp_player harmonic_series dummy --seconds 4000 10.58 real 10.49 user 0.06 sys -> 0.005% (as expected not much difference)
If you ignore the loop-unrolling part of the patch. Using uninitialised temp storage might still be interesting. It is nicer code, but only contributes about 1%. Different to the code for uninitialised
BufferArray
, I wanted theunsafe
to be seen at the call site. Because if you use such a frame as input for calculations, the program will crash. This is true for theBufferArray
case, too, so making it an unsafe function might be a good idea.