-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Denormal floats handling #1237
Comments
Care to mention processor ID? |
AMD Phenom II X6 1090, if it matters... It doesn't have AVX, SSE only.
These functions+macroses change processor state (some bits in processor's control registers), therefore they do affect how subsequent code (including imported DLL) is executed by a processor, don't they? I don't remember exactly if Windows maintains separate versions of these control registers for different threads. Anyway, code of the same thread is still affected no matter which compiler was used to make it. |
I asked you if passing CONSISTENT_FPCSR compile option to OpenBLAS solves your problem. Can you provide sample code that reproduces your problem? |
I'm going to try it, however it'll require a plenty of time to setup build environment for OpenBLAS.
Problem occurs during fairly complex neural network training using my nntl project. I'll try to isolate the problem source and make a sample code out of it to reduce the number of unrelated code. What I learned till this moment is this:
UPD: point 3 is incorrect and probably OpenBLAS does NOT change how denormals are handled. I forgot that the same time I added accompanying call for the OpenBLAS routines, I added the same code that disables denormals to my worker thread pool startup routine (I forgot to do it earlier) - that is the real reason that helped to control how denormals spread over the numeric data. My worker threads was working with denormals previously, though the main thread don't. My bad. All of this lead me to a conclusion that OpenBLAS totally don't takes into account user's intentions regarding denormals and (always?) turns them on, slowing computations. |
Fastest build environment is to install some Linux in a virtual machine and cross-compile DYNAMIC_ARCH=1 DLL. You mentioned one year that problem appeared - can you attribute it to some new OpenBLAS release or something changed with your computers? Wikipedia page on subnormal floats lists some methods to avoid them - scale, log+exp, increase precision among others. You can try lower-brain cores (see https://github.com/xianyi/OpenBLAS/tree/develop/kernel/x86_64 for fulllist) via OPENBLAS_CORETYPE, to see if any makes your software to act differently. |
Thanks, I'll try this method.
There were no change in computer hardware and I don't think the problem might be related to a OpenBLAS release change. I have implemented some new algorithms that uses new functions (they weren't used before) cblas_ssyrk() and cblas_ssymm(). Now it looks like these algorithms in some special cases (I've just stuck onto them) might produce very small numbers and OpenBLAS just tries to keep computation precision and uses denormals...
All of this is absolutely not an option. In a nutshell neural network training process based on a gradient descent algorithm, which works quite reliable even with a low-precision arithmetic (much lower than 32bits - it was shown, that even 8bits of floating point precision might be enough). This task requires performance extremely more than a computational precision (and the said applies to newly implemented algorithms from the previous paragraph too), therefore simply dropping denormals is the best idea to follow.
I'll try to provide it, but I need some time for it...
Well... Even if this cores won't be messing with processor's control registers, they won't be as fast as the most suitable core for my processor. So it might look like a trade bad for worse, am I right? |
Possibly (cross-)compiling OpenBLAS with either the gcc flag "-ffast-math" or the VS flag /fp:fast will force denormals to zero. See https://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x for some suggestions |
@martin-frbg Thanks, I'll try to recompile OpenBLAS with this flag. BTW, don't know about gcc, but in VC this flag does nothing with denormals. Explicit code is still required to change processor's state. |
FPCSR option (supposedly, maybe not always right) does exactly that on each thread. if you find option fixing your problem - it is nice proposal for defaults used to build binary DLL |
Could be a mingw issue with control word settings not carrying over to threads numpy/numpy#5194 (though I only glanced through that). If the mingw build does not assume -mfpmath=sse by default it may be using the legacy fpu in 80bit "extended precision" mode. |
CSR is inherited (or not) way before MSVC EXE can interfere. |
Here is a very simple code how to get denormals with cblas_ssyrk() using a normal source data: https://github.com/Arech/tmp/blob/master/DenormalsDemo/DenormalsDemo/DenormalsDemo.cpp One may checkout the whole repository and try it live. I need a way to instuct OpenBLAS to flush all denormals to zero... BTW: This sample code fails to reproduce my previous claim that OpenBLAS changes the way denormals are handled by a processor. It just demonstrates how they appears in data. It is a perfectly real-world example: there is a quite useful algorithm of reducing cross-covariations between neurons activations in a neural-network that (obviously) requires a computation of covariance matrix. This is done by first computing de-meaned activation values (subtracting its mean activation from each neuron) and then computing covariance matrix C = 1/rowsCnt * A' * A, where A is de-meaned matrix of neuron activations. Either activation values could be very small, or they might almost all be near their mean. In any case, the de-meaned matrix A will be full of very small but still normalized FP values, which will induce denormalized floats in matrix C. Still investigating who's in charge of changing denormals handling settings in the original code... |
You must scale the matrix to better numbers and not hit extremes (syrk has parameter for that). Your input value range is quite limited. |
This task does not require a precise answer for a very small values. It requires fast answer (it's fine if it is very aproximate for a small values). There's absolutely no point in wasting precious cycles to scale the data, that can be perfectly fine and absolutely automatically set to zero. |
Try 2 things: There is no instrumentation to pass fp csr to threads, nor it is automatic _SCAL does not take CPU cycles, it is memory-bound, you waste 100x more cycles inducing denormals. |
No change from the base version. Still produces denormals.
I don't get your point. Could you please elaborate?
The same as previous. Elaborate? |
Mingw defect applies to you too. Try re-build option. |
Still playing with the sample code. Just to remind:
I've noticed that run-time denormals settings seems to affects at least a some part the output matrix C. When denormals are disabled ( However, if denormals are enabled ( It seems that FpCsr&MxCsr are thread-local registers and therefore every good thread-pool implementation must provide a way to set the same setting for denormals for every worker thread. Or provide a way to execute custom code in every worker thread context. BTW: why none of the OpenBLAS thread pool functions are exported? Should any of |
Important update: my claim that OpenBLAS might be changing how denormals are handled are probably false. Made an update for the comment. Probably the only thing that should be done is just make denormals handling coherent across the threads. I'm wondering why Intel says that for MKL it's enough to just change MXCSR in the caller thread... Do they pass related parts of MXCSR of the callee to worker threads each call? |
Can you verify IF what is done with CSR is correct AND addresses your issue? MKL licence sort of prohibits reverse engineering, from description it looks that they scan input for denormals wasting the cpu cycles and use CSR as input to select right routine, and finding one denormal either zeroes it X wastes even more cycles with "software routine". |
I'm sorry, I don't understand the question.. English is a foreign language for me. Could you please write in a different words what you want me to verify? |
Can you make your new DLL as follows (on a Linux VM):
And test with that DLL... it is only option that actually enables OpenBLAS tackling FPU configuration. Do we understand eachother now? |
I think yes. I'm setting up Debian9 on VM now... I'll write more once I got new info... |
I can help if you get stuck trying. |
Yep, Thank you! Now I'll try make an OpenBLAS build with smallest possible run-time overhead and see if it helps to drive training time down a bit more... But that's another story not related to this ticket. Regarding setting a build with |
It just sets FPCSRs on threads once when started unknowingly working around MinGW problems |
Mmm... Here is my problem description. Thanks in advance! |
You did run |
@martin-frbg, hmm...)))) Actually, I did use it sometimes and so it happened that it didn't help. Probably because I'd used some invalid flags combination. Thank you for the suggestion, I didn't know it's mandatory. However, the problem still persists. I've just executed
and still get the same error:
UPD1
However, as far as I understand, UPD2
still fails with the same error as I get without I guess, I'm doing something wrong, but have no idea what is it... Could somebody please help? |
Cross-compilation should not run any tests. There is basically no chance they ever return any success. |
mmm, great, but how to skip them from the build process? I don't see any test-related options in |
dynamic arch just like fpcsr is initialized once at library load. It will not speed up your library calls. |
Can we call it a bug? Try something like TARGET=Atom failing at same place or not. |
Try removing the |
@brada4 , indeed it fails exactly at the same place.
@martin-frbg have just tried, but unfortunately, it fails in a similar manner during another step:
Any ideas? |
Please attach full build log (collectable on linux with 'script' command) |
Also may be worth trying with a current "develop" snapshot (or wait till 0.2.20 is out, should not be long now hopefully). At least your commandline works fine for me with 0.2.20dev and x86_64-w64-mingw32.static-gcc/gfortran built by mxe |
@brada4 Here is the log of May be it's not relevant anymore, because:
@martin-frbg works for me too! Thanks! |
Thanks for confirming 0.2.20 will solve this issue. FPCSR question is still open, and you wish to make it as fast as possible too. |
So do I understand correctly that
|
@brada4 , correct. As you have said, building with Any thoughts on how to improve run-time performance (especially for cblas_sgemm(), cblas_ssyrk() and cblas_ssymm()) are welcomed!
True. Not sure that it's due to mingw defect, but I'm incompetent in it, so may be.
I don't know how noticeable it is. Probably not very noticeable. For example, if the Windows maintains different floating point environment for each thread (unfortunately, I don't remember if it's true, but probably it is true), then it has to switch it many times a second during switching threads context. But no matter of that, what the point of doing it? I hardly imagine an app, that mixes calculations with different FP settings in parallel. It just seems as redundant as an empty loop
Probably, yes. A function to disable denormals is a great solution, but for me would work even a simple environment setting like ADDED: please, note, I don't vote for droping support of CONSISTENT_FPCSR flag in its current form. It might be useful for some users. I just vote for some (very easy to implement) improvement.
Yes. There's even a corresponding comment in a
|
You now have it 10x faster. Do you think 1% more speedup will change anything? How big are your typical input dimensions? If it is under L3 cache size you better go with single-threaded OpenBLAS and thread inside your code yourself. |
Dear @xianyi and everyone!
Is there a way to forbid denormals in the OpenBLAS?
I tried to execute the following code (MSVC2015) before calling any OpenBLAS routines:
and it looked like it worked for some time (about a year)... Does it really affect OpenBLAS denormals handling (especially inside worker threads)?
I started to use some additional OpenBLAS functions and noticed that denormals started to reappear in results again and that extremely slows down computations. I'm heavily suspecting that it is the OpenBLAS, who is responsible for them. Probably, the worker threads might still have enabled denormals?...
How to get rid of them?
I'm using precompiled OpenBLAS-v0.2.19-Win64-int32.
Here is the complete list of used functions (if it helps):
UPD: indeed, cblas_ssyrk() is a first function which produces the first denormalized float during execution... :-(
How to change denormals behaviour?
May be there is a (reachable from outside) way to execute custom code in context of OpenBLAS's worker threads if there's no predefined way to control denormals?
The text was updated successfully, but these errors were encountered: