-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes for "Test 5" needed in xeigtst* #1043
base: master
Are you sure you want to change the base?
Conversation
Does this happen with reference-BLAS or are you linking against an optimized BLAS implementation? |
We are using an optimized OpenBLAS implementation, but we believe that optimized OpenBLAS library functions correctly, since it passes all the other BLAS/LAPACK tests we run internally. |
And xeigtst* Test 5 also passes, if modified as described in this PR to require the two sets of eigenvalues to agree to within a small tolerance, rather than requiring that they be exactly the same. |
The optimized BLAS implementation likely uses different kernels for the two cases. The kernels are selected based on matrix dimensions. The result is a slightly different accumulation order, as you analyzed. Reference-BLAS should produce the same result regardless of the matrix sizes. |
Indeed OpenBLAS will use a different algorithm (different blocking and hence different accumulation of rounding errors) on several hardware platforms when the matrix size is "small", and it will use GEMV instead of GEMM when one matrix dimension is equal to one. |
With respect to the initial post of this discussion thread ( #1043 (comment) ) and the lines of codes at: Lines 659 to 667 in 9c0ef66
Indeed when we do not want the Schur factor T, (so when WANTT is FALSE,) then we should be able to safely skip the operations in the IF( WANTT ) statement above. The goal is to save some operations that are unnecessary when the Schur factor T is not desired. These operations should not affect the computation of the eigenvalues. These operations should only be working on pieces of H (which converges to T as the QR iteration algorithm converges) that needs to be updated to get a correct T, but these operations do not impact the quasidiagonal of T (where the eigenvalues of A will be) in the subsequent operations. Therefore skipping these operations (WANTT is FALSE) or performing them (WANTT is TRUE) should result in the (exactly bit-to-bit) same eigenvalues. Now I might be mistaken, or there might be a bug in the code, or this comes from some parts of the HSEQR algorithm that I do not understand, but my current understanding is that skipping these operations (WANTT is FALSE) or performing them (WANTT is TRUE) should result in the (exactly bit-to-bit) same eigenvalues |
⏸️ I would like to press the pause button on this. ⏸️ I am not saying "no", but I would like that we give time for folks to express their opinions and maybe some more investigations. Three comments: short version (1) I do not understand why the two sets of eigenvalues are not bit-to-bit the same TL;DR (1) First it is not obvious to me why, if provided with same ILAENV parameters (so algorithmic block size, early agressive window size, etc.) HSEQR with eigenvectors and HSEQR without eigenvectors do not result in the (exactly bit-by bit) same eigenvalues. I thought both would return the (exactly bit-by bit) same eigenvalues. When you compute the eigenvectors, you perform some extra operations, but these operations should not impact the eigenvalues. These operations are confined to part of the Schur form that will not influence the quasi diagonal part of T. (Or, at least, that is my understanding.) So either (a) my understanding is not correct, or (b) HSEQR with eigenvectors and HSEQR without eigenvectors (used in this test) use different block sizes, or (c) the BLAS (used in this test) is not reproducible, or (d) maybe there is a weird bug somewhere. I would like to drill down a little on this to try to understand. In short, there is something here that I do not understand. I have not reviewed the codes and I probably should. (2) There are some arguments for having a software that returns (exactly bit-by bit) same eigenvalues whether or not eigenvectors are requested. (See a related discussion for the symmetric eigenvalue problem at #997 (comment).) Before given up on "having a software that returns (exactly bit-by bit) same eigenvalues whether or not eigenvectors are requested", I would like to understand impact on performance, reason why, etc. I would rather have a software that returns (exactly bit-by bit) same eigenvalues whether or not eigenvectors are requested if possible and if cost is not prohibitive. (3) If we give up on "having a software that returns (exactly bit-to-bit) same eigenvalues whether or not eigenvectors are requested", then, the call with eigenvectors will return eigenvalues but also Schur form (T and Q) so we can check backward error with the Schur form. When we test the call with no eigenvectors, there is no Schur form and so we only have eigenvalues. For this reason, we compare the two sets of eigenvalues. The one from the call with no eigenvectors (that we want to check) with respect to the one from the call with eigenvectors (that we trust since we have computed the backward error.) When we compare the two sets of eigenvalues, we do it bit-to-bit. (See reason why above.) Now if we cannot guarantee bit-to-bit similar set of eigenvalues, yes, we want to check some kind of relative error between two eigenvalues and we would like these relative errors to be close to machine precision. This is called a forward error. For the nonsymmetric eigenvalue problem, we might want to introduce the condition number of the eigenvalue in multiplication to the forward error. So not just the forward error. Unless we know that the eigenvalues are well conditioned. |
For a software like OpenBLAS, during the execution of a program that calls twice GEMM (say) with twice the exact same parameters (but at different times in the execution) then does OpenBLAS guarantee bit-to-bit similar answers? For example, does OpenBLAS allow itself to change block sizes of GEMM for two calls with similar parameters during the same execution of a program? |
Repeat calls to GEMM with the exact same parameters should produce the exact same computation and result, "similar" parameters might result in a different partitioning of the problem (depending on number of threads available) which I think might affect rounding errors even though the actual GEMM kernel (and its block size) is constant for a given cpu. If the code paths taken in LAPACK produce vastly different matrix sizes, say one calling GEMM with dimensions 1000x1000 and the other 20x20 or sometimes even 1x20, you might end up in different assembly kernels with results no longer guaranteed to be bit-to-bit identical. It gets a bit murkier when the eigenvector and non-eigenvector paths in LAPACK call two different BLAS functions to achieve the same goal, as there is no guarantee that the algorithms and implementations are exactly the same as in the Reference BLAS, only that the individual result is correct "to the computational precision of the data type" (or in practice, that it passes the tests from Reference-BLAS (among others). |
This answers my question.
Yes, this is understood. I am asking about "exactly same" parameters. By "similar", I mean "exactly same".
Yes, this is understood and fine. If we create ourselves our own GEMM based on smaller GEMMs, and compare different block sizes, we should not expect bit-to-bit reproducibility between two similar calls.
Yes, this is understood and fine. If we use two different BLAS libraries in an execution, we should not expect bit-to-bit reproducibility between two similar calls. I meant
Even with FMA and vector instructions enabled, one could expect bit-to-bit reproducibility when the parameters are the same. The BLAS library would have to guarantee to pretty much have no "runtime" decision making. Or if it makes some "runtime" decisions, then it needs to stick to these decisions for the whole execution. Given a set of input parameters, during. a given execution, () units on which computation are done (CPU, GPU, etc.), () block sizes, () number of threads, () microkernels, () reduction tree, etc. would need to be fixed. The requirement "() block sizes, () number of threads, () reduction tree" could be waived somewhat if one uses ideas expressed in https://www.siam.org/publications/siam-news/articles/reproducible-blas-make-addition-associative-again/ But for a given execution, an easy way to achieve reproducibility is to either have no runtime decision, or stick to decisions once one has been made. I am asking what I can expect from a BLAS library. Maybe the fact that the LDAs might not be the same leads to not-bit-to-bit-similar outputs. Maybe the fact that the pointers are not pointing to the same pieces of memory matters and leads to not-bit-to-bit-similar outputs. Note that the MPI-3.0 standard reads (page 175, lines 9–13): "It is strongly recommended that MPI REDUCE be implemented so that the same result be obtained whenever the function is applied on the same arguments, appearing in the same order. Note that this may prevent optimizations that take advantage of the physical location of ranks." My understanding of this is once MPI decides (at runtime or at compile time) a reduction tree, it needs to stick to it for the whole execution as long as the number of processes are the same, the data size, data type are the same. |
I just talked to @langou did not read the entire discussion here, sorry if I missed anything. I think that https://github.com/Reference-LAPACK/lapack/blob/master/SRC/dlaqr5.f#L787 is probably the bad guy here. A possible solution would be to split the gemm call into two pieces, one of which only gets called if I have on one occasion noticed a situation where this was important. Slightly different rounding errors led to a different shift and a different eigenvalue converging first. But that issue was then exaggerated because my code was also bugged at the time. I don't know if this could lead to a big issue with the current HQR implementation. While I agree that requiring bitwise reproducibility for all implementations is too strict, if it is (relatively) easily attainable for reference LAPACK, i think a reasonable effort should be made to make it so. |
@martin-frbg We were using IBM's openxl compilers. As I recall, we saw this behavior on multiple Power machines (e.g. Power8). |
For a given set of input parameters, this is fixed (but not necessarily the same for the entire range of possible input parameters). I have no idea to what extent cpu hardware (and possible microcode flaws) might have an influence in the future, with two or three tiers of differently fast/capable cores within one cpu becoming popular. But all runtime decisions made in or "by" OpenBLAS are based on (1) model of cpu (2) problem size, so should always be fully reproducible. I do not think physical location of data during individual runs plays a role on any supported architecture, the optimizations in OpenBLAS tend to be a lot more mundane, judicious loop unrolling and making use of fma instructions where available Difference in LDA might lead to non-bit-to-bit identity (between cases with different LDA of course) if blocking/loop unrolling is not equally well implemented for either dimension. ISTR there was a case discussed in some issue here fairly recently, where the result of some fairly complex calculation was compared against exact zero, and that comparison determined the further code flow... not sure if I can find it or if it was just a bad dream though... |
I upvote this. While it is not directly related to this issue, I would like to highlight that deviations between job configurations are already accepted for some routines in LAPACK. GESVD, for example, uses different algorithms depending on whether or not the singular vectors are requested and bitwise equality cannot be expected. |
Correct. A cross reference where this fact is mentioned is https://github.com/Reference-LAPACK/lapack/blob/master/SRC/dlaqr5.f#L787 |
@thijssteel wrote "A possible solution would be to split the gemm call into two pieces, one of which only gets called if wantt is true. Then the gemm call that is important for reproducibility is at least consistent." @angsch wrote: "I upvote this." @langou writes: "I upvote this too." |
To summarize where we are at: @thijssteel established that, even with a BLAS that returns bit-to-bit-similar answers for similar calls, GEEV from LAPACK might not return bit-to-bit-similar eigenvalues when called "with eigenvectors computation enabled" and called "without eigenvectors computation enabled". The reason is that some BLAS calls that affect the eigenvalue regions are done with different sizes (larger sizes for "with eigenvectors computation enabled", smaller sizes for "without eigenvectors computation enabled"). There is a possible fix. We could break the BLAS call in the "with eigenvectors computation enabled" so that the block sizes match the block sizes in "without eigenvectors computation enabled"and then deal with the remaining region (that do not affect eigenvalues) separately. (See @thijssteel's post.) Thanks to @frjohnst for pointing out this matter and thanks to @thijssteel for explaining why this happens. WRT this PR, I am not sure what to do. This PR is relevant when we do not expect bit-to-bit-similar eigenvalues. (Which is the current state of LAPACK.) However, I am leery to merge the PR. I would rather that we fix LAPACK GEEV to return bit-to-bit-similar eigenvalues when called "with eigenvectors computation enabled" and called "without eigenvectors computation enabled". I feel that, if we merge this PR, then we'll be less motivated to work on LAPACK GEEV. |
Please note that we see this issue for xGGEV and xGGEV3 (generalized eigenvalue) as well as |
We are seeing a number of "Test 5" failures from the
xeigtst* tests with *ed.in and *gd.in input files.
We believe these are the consequence of limited
numerical precision, and not the result of a computational
error, so we would like to change "Test 5" as outlined below.
In the following, we will discuss only xeigtsts
failures with sed.in, since all "Test 5" failures related to
this PR follow a similar pattern.
Example of failures from xeigtsts run with sed.in :
These "fail"s are happening because the sed.in test, described in part as
compares eigenvalues computed from calls to SGEEV requesting eigenvalues
only to calls to SGEEV requesting both eigenvalues and eigenvectors, and
demands that the eigenvalues (real and imaginary parts) be identical:
(original "Test (5)" from EIG/sdrvev.f).
As best we can determine, a matrix is updated as eigenvalues are computed,
and these matrix updates ("H" in the code just below) may be different
between the eigenvalues-only and eigenvalues+eigenvectors cases.
For example, the snippet below is from SLAQR3, called eventually by
SGEEV, where WANTT is true when eigenvectors are computed:
As best we can determine, from the mathematical point of view the eigenvalues
should be identical in the two cases, but finite computer precision makes the
two sets of eigenvalues differ slightly. To confirm this, the "Test (5)"
comparison of the two sets of eigenvalues was changed to compare the relative
eigenvalue differences to a small tolerance (instead of demanding exact
bitwise equality):
(proposed modification for sdrvev.f)
and with this change the test passes. So this PR proposes that "Test 5"
executed for *ev.in and *gd.in should generally be changed in this manner.
Note that changes made for generalized eigenvalue "Test 5" differ somewhat in the
details (since alphas/betas are computed in this case instead of
eigenvalues) but the basic idea is the same.