-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPGA utilization? #13
Comments
I've read some papers about FPGA-based sparse direct solvers. For all I know, these are all research works and I've never seen a success of using FPGAs in practical sparse direct solvers. Even for mature GPUs and mature CUDA, there are very few usages in practical sparse direct solvers. I believe there are unsolved challenges for using FPGAs in practical sparse LU factorization solvers. Another important question is whether FPGAs can really be faster than CPUs for sparse LU factorization. From my experience on GPU-based sparse LU factorization, the answer is pessimistic. The results reported in papers are usually misleading. They are not end-to-end comparisons and the baselines are not the fastest CPU implementation. FPGAs and GPUs are not good at handling irregular problems. Sparse LU factorization is a representative irregular problem. Though CKTSO has a GPU module, it is faster than the CPU module only for relatively dense matrices. I believe the situation is similar for FPGAs, or even worse. In fact, the performance bottleneck is memory access, but not computation. FPGAs do not have special mechanisms to handle irregular memory access. From this point of view, I believe CPUs with large caches are still the best choice for sparse LU factorization for circuit matrices. |
Thank you for the answer - now my colleague directed me to an earlier paper including you as co-author - FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations |
yes, I was involved in that early paper. I contributed some parallelism ideas. You can see that we only tested very few cases and they only showed 2-3X speedup against KLU. Current CKTSO can easily achieve this speedup even using a single thread. From an academic research point of view, trying new hardware architectures to show how to optimize the algorithm implementations targeted at reconfigurable or massively parallel hardware has scientific significance. But for practical usage, the only issue is the absolute performance. If the GPU or FPGA solver cannot be faster than CPU, there is no practical significance. I believe there are also many other practical issues need to be solved, rather than scientific problems, to really achieve higher performance than CPU solvers. If FPGAs can be faster for relatively dense matrices, it is also good. |
I know this may sound like something too much to be done in the near future, but have you considered utilizing cloud FPGA services to achieve more parallel speedups? Do you have any experience in this field?
I've recently read a paper from Tarek Nechma who claims to had success with it - though on local FPGA hardware.
Thank you for any answer or hint.
The text was updated successfully, but these errors were encountered: