Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPGA utilization? #13

Open
gkovacsds opened this issue Oct 2, 2023 · 3 comments
Open

FPGA utilization? #13

gkovacsds opened this issue Oct 2, 2023 · 3 comments

Comments

@gkovacsds
Copy link

I know this may sound like something too much to be done in the near future, but have you considered utilizing cloud FPGA services to achieve more parallel speedups? Do you have any experience in this field?
I've recently read a paper from Tarek Nechma who claims to had success with it - though on local FPGA hardware.
Thank you for any answer or hint.

@chenxm1986
Copy link
Owner

I've read some papers about FPGA-based sparse direct solvers. For all I know, these are all research works and I've never seen a success of using FPGAs in practical sparse direct solvers. Even for mature GPUs and mature CUDA, there are very few usages in practical sparse direct solvers. I believe there are unsolved challenges for using FPGAs in practical sparse LU factorization solvers.

Another important question is whether FPGAs can really be faster than CPUs for sparse LU factorization. From my experience on GPU-based sparse LU factorization, the answer is pessimistic. The results reported in papers are usually misleading. They are not end-to-end comparisons and the baselines are not the fastest CPU implementation. FPGAs and GPUs are not good at handling irregular problems. Sparse LU factorization is a representative irregular problem. Though CKTSO has a GPU module, it is faster than the CPU module only for relatively dense matrices. I believe the situation is similar for FPGAs, or even worse. In fact, the performance bottleneck is memory access, but not computation. FPGAs do not have special mechanisms to handle irregular memory access. From this point of view, I believe CPUs with large caches are still the best choice for sparse LU factorization for circuit matrices.

@gkovacsds
Copy link
Author

gkovacsds commented Nov 6, 2023

Thank you for the answer - now my colleague directed me to an earlier paper including you as co-author - FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations
Was this project and research direction a success then, or further practical application did not really work out? We are curious, if you can tell us more. We are researching this FPGA calculation topic right now - we have a commercial circuit simulation package product actually.

@chenxm1986
Copy link
Owner

yes, I was involved in that early paper. I contributed some parallelism ideas. You can see that we only tested very few cases and they only showed 2-3X speedup against KLU. Current CKTSO can easily achieve this speedup even using a single thread. From an academic research point of view, trying new hardware architectures to show how to optimize the algorithm implementations targeted at reconfigurable or massively parallel hardware has scientific significance. But for practical usage, the only issue is the absolute performance. If the GPU or FPGA solver cannot be faster than CPU, there is no practical significance. I believe there are also many other practical issues need to be solved, rather than scientific problems, to really achieve higher performance than CPU solvers. If FPGAs can be faster for relatively dense matrices, it is also good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants