GPU performance worse than CPU #110

touste · 2017-08-26T12:55:39Z

Hi, I've been using Simit for writing a piece of finite element code in order to reach real-time execution for computer-assisted surgery planning. Using this library has made my work much easier by taking care of kernel assembly and execution.

Based on the published paper, I was expecting a drastic improvement in terms of performance when executing on the GPU, however this is not the case, actually the performance is worse than on the CPU.
Are there plans to update the GPU backend to reach performances such as advertised in the paper? If not, is it possible for me to revert to a previous version of simit which will compile and give me better performance?

As a user, I would also like to second the need of a roadmap ( #58 ) of some sort in order to have a better visibility of future development. This would be very helpful for planning long-term development on the user's end.

Thanks!

fredrikbk · 2017-09-08T09:48:13Z

Hi @touste. Thank you for your interest and I'm glad it has made your job easier! The GPU backend has sadly not been maintained for too long. Are you mapping arguments to the GPU every iteration, thus incurring costs in moving data? Perhaps @gkanwar can provide some further suggestions? Could it be due to the cost of sending a run command to the GPU every time step?

Regarding a roadmap, I completely agree. Simit has, unfortunately, been neglected for about a year while we have worked on the tensor algebra compiler (tensor-compiler.org). We built it to become the new compiler for Simit, but it has taken on a life of its own. However, we are trying to find the time to integrate the tensor compiler with Simit and use it to carry out several improvements to the Simit language.

gkanwar · 2017-09-09T03:25:57Z

Unfortunately, as @fredrikbk mentioned, the GPU backend certainly needs some maintenance at this point. I am happy to help work through the issues you're seeing, but it would be helpful if you can provide a small code sample that demonstrates the performance bug. Thanks!

touste · 2017-09-13T18:42:21Z

Thank you for your suggestions, I was indeed able to get better performances by reducing the number of mappings between the cpu and gpu. One thing that also helped was to build simit with the release flag. Now I get a 4x speedup on gpu compared to cpu, which is reasonnable I suppose.
@fredrikbk it's nice to know that there are still plans to improve the language, I was starting to worry about the project being abandonned.
On a side note, I tried to look for other solutions (ex: theano) to implement my code, but nothing approaches the combined simplicity and performance of Simit. This is perfect for me, as I'm more versed in mechanical engineering than computer science, I can rely on the language to achieve the best possible performance without digging too deep into gpu programming. By the way, taco is also promising to write code for continuum mechanics, I can't wait to see it integrated with Simit.
Performance-wise, are there plans to support vectorization and multiprocessing? Also, a nice feature would be the possibility to perform multiple assembly maps in parallel when possible.

Thanks again!

fredrikbk · 2017-09-13T21:08:17Z

@touste, yes, we want to get back to it. Integrating taco with Simit will let us make it much more general, including arbitrary blocked matrices and even general sparse tensor computations. It has just been a time management issue, since taco itself has required a lot of work over the last year. Also, we'll be talking to Nvidia about tricks to get fast GPU support in taco, so when they are combined you might get a nice GPU speedup.

I'm so glad to hear that it's useful to you! It really makes the work worth it.

In the mean time, taco is a C++ library so you can use it apart from Simit if you wish. Of course, when it's integrated with Simit, especially tensor assemblies, then it will be much more convenient.

taco does support multiprocessing to some degree, so Simit will get that with taco integration. It also has some vectorization that it gets from the compiler, but I'm hoping to find a keen master student to improve on that. Multiple concurrent assemblies is a great idea that we'll think about.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU performance worse than CPU #110

GPU performance worse than CPU #110

touste commented Aug 26, 2017

fredrikbk commented Sep 8, 2017 •

edited

Loading

gkanwar commented Sep 9, 2017

touste commented Sep 13, 2017

fredrikbk commented Sep 13, 2017

GPU performance worse than CPU #110

GPU performance worse than CPU #110

Comments

touste commented Aug 26, 2017

fredrikbk commented Sep 8, 2017 • edited Loading

gkanwar commented Sep 9, 2017

touste commented Sep 13, 2017

fredrikbk commented Sep 13, 2017

fredrikbk commented Sep 8, 2017 •

edited

Loading