-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept - C++ backend using Fastor Library #28
Comments
I can see that you have implemented a You can by-pass all that code and use one of Fastor's solvers if you want to get a bit more performance // inversion based solve - very fast
Tensor<T,N> x = solve(A,b);
// or equivalent of what you are doing but faster using template recursion
Tensor<T,N> x = solve<SolveCompType::BlockLUPiv>(A,b);
// or exactly equivalent of what you are doing I guess
Tensor<T,N> x = solve<SolveCompType::SimpleLUPiv>(A,b); You will gain significant speed up if your matrix sizes are less than These solvers are not heavily tested with Visual Studio but compiling with clang Fastor solvers are faster than Eigen, MKL and straight-forward for loop style implementations (such as yours) by quite a big margin. Here are the results from a comparison I did a while ago on my macbook with Julia
|
This is quite exciting! It may take me bit to take a closer look, though. Regarding your questions:
Obviously, ACME was never meant to be used for effect simulation "in production", e.g. by wrapping it (including the whole Julia runtime) in a plugin (though I do think that would be a fun project to do). It's mainly intended as a test bed for algorithm and model development. So I've optimized it to a point where waiting for results is not too much of a torture, but haven't tried to squeeze the last bit of performance out of it. So being able to speed it up by quite a bit when switching to C++ is not unexpected. What I had envisioned (but haven't come around to actualy doing so far) is to be able to "export" a model from Julia to C++. While this is trivial for all the matrices, it needs some thought on how to approach the functions defining the nonlinearities. No I'm not 100% clear: You have created the C++ code from the circuit description within C++, right? Or have you done it from Julia? Further I've always shied away from reimplementing the solvers in C++. Now with latter having been done by someone else... Tangentially, what happened to @maxprod2016? |
Who is @maxprod2016? :) I perfectly understand the meaning of the project and it's objective. ACME is already a very good and solid project, my goal is not to criticize it. However, for my personal case (old computer), ACME is extremely slow not only for the running part, but the discretization part take several minutes which is clearly unmanageable for testing and research. As you known, I think, i'm autodidact and independant researcher. This is why I undertook to develop a version in C ++ which is not only faster at the level of the exploitation of the discrete model, but which reduces the discretization to a handful of seconds in release mode. Nevertheless, my C++ clone is not intended to replace your work, already because it was not written for that, and moreover it is not sufficiently well written, and very limited in deployment (C ++ 17 , VS2019 ...), and depends on a few auxiliary libraries which can constrain the ease of dissemination of the project.
And of course, the rough code to reproduce the Sparse Arrays library of Julia that is hacky to simulate the one-based indexing of Julia. For example, the
You're right, the Fastor-based code is generated inside my C++ code but this is not so hard to potentialy include such idea inside the Julia ACME code. I've do that very quickly and this is not very well coded, but works for the proof of concept. For example, for the Diode element, the cpp code include both of pre-generated encoded string of the non-linearity code, and the precomputed lambda function used in the armadillo-based internal process (slow one) :
This is ofcourse not elegant, but works for the current proof of concept, I will optimize that soon to externalize the template-like strings inside a specific file. The indexing in the c++ lambda function The string-template code is updated when the indexing is know using simple string replace, something like:
I still hope that this little study, because it is nothing other than that, is still interesting. I will update the code to replace the decomposition and activate the potentiometers for dynamic manipulation. @romeric, I answer you as soon as I've update the code. For information, the LHS size is (for the current case), 7x7 without potentiometers, and 13x13 with activated potentiometers. |
No worries, I didn't feel criticized in any way, just wanted to set this straight for others following along. (I would have doubted there are any, but @romeric proofed me wrong...)
Ah, that's bad. Is that only the first time in a fresh Julia session (when a lot of internal code gets compiled), or every time? And is that for the SuperOver example or for a more complicated circuit? But maybe that's worth it's own issue to not derail this one.
Yes, I had considered doing something similar in the Julia project (Julia code + stringified C++ code template) but the introspection functionality of Julia might allow synthesizing C++ code from the Julia code, at least in not-too-complicated cases, i.e. only calls to a limited set of functions ( |
You're right. @romeric (Roman) is the creator of the Fastor library, he helped me with the introduction of some useful functions for this project. romeric/Fastor#90
Ah ok, my fault (pure Julia newbie), I run script from windows console, not in Julia instance, so I think each time I run my script, the code is compiled. After a quick test (Julia 1.4.2 64 bits) on SuperOver example with no other process on the computer, I get 61.01500 seconds overall timing for the discretization (without decomposition), 64.106999 (with decomposition) :
When calling timing on function, I get 39.420185 seconds (without decomposition), 41.874466 seconds (with decomposition)
In all case, the same test in C++ is reduce for me to 0.374724 seconds (without decomposition), 0.509646 (with decomposition). I think my code is maybe not so bad :) |
I've update the code after removing the temporary isfinite helpers (thank's @romeric) and the Unfortunatly, the solver become unstable and go to NaN after 7189 processed samples. Maybe i've do something wrong by replacing the |
That is a bit strange since By the way it seems like you are doing LU decomposition twice once for solving and once for determinant on Tensor<T,N,N> L, U;
Tensor<size_t,N> p;
lu<LUCompType::SimpleLUPiv>(nleq->J, L, U, p);
Tensor<T,N> y = forward_subs(L, b);
Tensor<T,N> x = backward_subs(U, y); // x is your solution
T _det = product(diag(U)); // _det is your determinant |
Oh, thank you Roman. So, the best way to get both decomposition and determinant is to reintroduce the
By the way, that allow me to keep the same design for the Unfortunatly, that don't change the problem. I think the method is slightly different and let me see you the different result of the Julia ACME LinearSolver (one-based in pivot indices)
C++ ACME LinearSolver (zero-based in pivot indices)
C++ Fastor LU Solver (SimpleLUPiv)
Note for @martinholters (if you read that) : This is the first call to The updated full code is always here : https://codeshare.io/5QEE4q So @romeric, I think the result difference occur in the method employed (pivot indices difference) in the LU factorization. I'm not able to go in deep in this problem, my science background is limited. But i'm sure that, because the intensive calls to the linear solver during the process, the gain of performance can be greatly increase using SIMD vectorization on small tensors (7x7, 13x 13 ... up to ?) In any case, thank you very much for your help. |
From a superficial glance: Shouldn't |
@martinholters You're right.
Anyway, the problem stay the same than the initial implementation. This implementation allow to solve the double call. I'm currently investigate the question, maybe (because the Julia implementation works on single factors matrix), the idea will be to keep your initial code and see with @romeric what is the best way to optimize it. |
My bad for giving incorrect instructions. Copy pasted the code in a haste. You do need to apply the pivot. Specifically in the forward substitution step: Tensor<T,N> y = internal::forward_subs(L, p, b); Also, I hope I am not daydreaming here, but a pivot is nothing but a permutation vector of rows (or columns) and you shouldn't get duplicated entries in a pivot vector (in your own case you do which is a bit odd). Different permutations of rows however (different pivots) does not impact the final result and in fact Julia, NumPy and Fastor will all give you different pivots while giving you the same final solution vector. Here is an example of solving your matrix
Julia
While Fastor pivot vector
and Julia pivot vector
are different. So maybe your LU decomposition does not do a conventional LU or does not do what it is intended to do. |
Same observation by using the Julia standard lu! function
I take a look to the difference in implementation right now |
Check your final solution vector to see if you are actually getting the same results |
OK. I think the solution is in @martinholters code comment:
The only difference is here:
while initial implementation is:
The diagonal become reciprocal of is value. |
I guess you're referring to |
@martinholters it's me, i'm confuse. I talk about pivot indices but the variable name induce me in error. Sorry. |
Hah, right! Sometimes I actually do leave helpful comments 😄 Yes, the idea here is to avoid repeated reciprocal/division when solving for multiple RHSs. |
Ah cool 😄 ! Now the big question is : is the change "conventional" in sense of @romeric question? |
It's not conventional in how the result is stored: Instead of storing L (except for the implicit ones on the diagonal) and U, it stores U with the diagonal elements inverted. Of course, this is dealt with in the back-substitution by replacing a division in the conventional implementation with a multiplication. As long as the LU decomposition and the solving part agree on the storage format, that shouldn't make much difference, though. Likewise, different pivoting should only result in different numerical round-off noise which I don't expect to be problematic here. |
OK. Thank's for the clarification Martin. So, this is not a possible method to be implemented inside Fastor, Therefore, there's two solutions. First one, to duplicate the current Fastor solver and modify it as the indication you provide. Another solution is to keep the current julia algorithm and attempt to adapt the code with the help of Fastor optimization. I need the @romeric expertize to know what to do, in particular on the potential gain because it would seem to be substantial. |
@dorpxam as long as your implementation is supposed to do a If I were you I would stick with your own (Julia or Julia style) implementation and keep things at a high level using Fastor's tensors. Given that your matrix sizes are small and compile time constants the compiler will do a pretty good job optimising them. Fastor's internal solver routines (like most optimised BLAS routines) get pretty low-level to squeeze that last bit of performance. But that is going to be futile exercise for you here. Also don't directly compare the result of a determinant |
Hi Martin,
First of all, thank you very much for all your scientific works, and especially for ACME that is a great project.
Since a moment, I attempt to port ACME to C++ using Armadillo for Linear Algebra. Unfortunatly, the Sparse Matrix of Armadillo do not cover the potential of Julia's Sparse Arrays standard library, in this case the maintain of structural zero. So i've write a C++ version of the Julia Sparse Arrays library.
After that, I've follow the ACME code design using latest standards of C++, heavy usage of meta programming to keep in place with the flexibility of the Julia code. The discretization of the circuit is, for my own configuration more faster in C++ than the Julia code. I mean, computation of incidence and topology matrices, the non-linear decomposition and so on. But of course, no possibility to reproduce the symbolic trick for the build of the circuit.
For example, the build of the SuperOver circuit become:
For a build and design simplification, the DiscreteModel class follow the PIMPL idiom and hide the Model Runner :
Now, I can run the circuit using a 30 seconds unprocessed guitar excerpt. No problem, I get the same result than Julia ACME process ... but in 57 seconds where the excerpt take 28 seconds with the Julia code !!! What ?!?!
My fault, by following the ACME design using lambda function (for genericity) make it unusable in such of process. Optimization with Intel MKL and cblas call do not have effect. Poor performance.
So after reading this : https://medium.com/@romanpoya/a-look-at-the-performance-of-expression-templates-in-c-eigen-vs-blaze-vs-fastor-vs-armadillo-vs-2474ed38d982 , I take a look to the famous Fastor library.
After some code change, I'm able to produce on-the-fly a static header only file that is a freezed version of the DiscreteModel of a circuit using Fastor at backend for the matrices and linear algebra. No dependencies except Fastor that is header only too. There's some changes on the design to allow some generic code, for example LinearSolver, SimpleSolver, HomotopySolver become generic (template based), ParametricNonLinEq become abstract and template base to allow multiple solvers instance (same as ACME core solvers module).
So, in a subnamespace details, you will find reusable code :
The others classes are produced programmaticaly including compile time static versions of the precomputed matrices as Fastor's Tensor. Because the possibility of multiple solvers for the non-linear parts, the code produce overrided classes of ParametricNonLinEq base class and for the SuperOver with fixed potentiometer values case, only one :
Ofcourse, this is not yet fully optimized, but this class can be used by the generic Solver classes and be runned with a class that is a mixed version of the DiscreteModel and ModelRunner of ACME :
As you can see, for the moment the SuperOver class can be instanciated with a compile-time chunck size for the I/O, fixed by default at 1024 columns. A static method return the sampling rate used for the discretization of the circuit.
SuperOver-Fastor
The full code (single file header-only, no dependencies except Fastor) is available here for the proof of concept : https://codeshare.io/5QEE4q
Now this is the report for the performance with my configuration, all the tests are running 10 times on a 30 seconds audio sample file at 44100 Hz with the SuperOver circuit discretized with the 3 fixed potentiometers at full value (1.0) :
Julia 32bits and 64 bits
Visual Studio 2019 32bits and 64 bits
LLVM/Clang 32bits and 64 bits
Conclusion:
As you can see, the gain of performance is substantial in both release build (x86, x64), the LLVM/Clang build is generated from the embedded version of the Visual Studio 2019 IDE and possibly out-of-date version. But the result is clear.
This is a first and possibly buggy attempt to use ACME with C++ and so can be used inside a audio plugin. I need you help for some points that are confused for me :
If you see some errors in my code, don't hesitate to correct me. Maybe this little study can be a entry point to make a builtin code generator for C++ backend inside the julia ACME core code. This is maybe an answer about the implementation aspects of your 2017's paper : "Automatic Decomposition of Non-Linear Equation Systems in Audio Effect Circuit Simulation".
Cheers,
Maxime Coorevits
The text was updated successfully, but these errors were encountered: