Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolating issues in handling rank-deficiency #324

Merged
merged 56 commits into from
Sep 15, 2020
Merged

Isolating issues in handling rank-deficiency #324

merged 56 commits into from
Sep 15, 2020

Conversation

palday
Copy link
Member

@palday palday commented May 23, 2020

The Cholesky decomposition can fail and thus far it fails silently, which leads to the sporadic errors we get in testing against Julia nightly.

Possible solutions:

  • Use SVD or QR decompositions to compute pivot
  • Use SVD for more robust rank detection
  • ????

This initial pull is just proof-of-concept for where the Cholesky fails on the CIs. No solution is implemented yet.

@codecov
Copy link

codecov bot commented May 23, 2020

Codecov Report

Merging #324 into master will decrease coverage by 0.58%.
The diff coverage is 97.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #324      +/-   ##
==========================================
- Coverage   95.44%   94.86%   -0.59%     
==========================================
  Files          23       23              
  Lines        1515     1576      +61     
==========================================
+ Hits         1446     1495      +49     
- Misses         69       81      +12     
Impacted Files Coverage Δ
src/MixedModels.jl 100.00% <ø> (ø)
src/linalg/pivot.jl 97.05% <97.05%> (ø)
src/femat.jl 100.00% <100.00%> (ø)
src/schema.jl 75.00% <0.00%> (-25.00%) ⬇️
src/mixedmodel.jl 80.95% <0.00%> (-19.05%) ⬇️
src/linalg/rankUpdate.jl 95.23% <0.00%> (-2.33%) ⬇️
src/arraytypes.jl 91.30% <0.00%> (-2.03%) ⬇️
src/generalizedlinearmixedmodel.jl 82.32% <0.00%> (-1.09%) ⬇️
src/linearmixedmodel.jl 98.91% <0.00%> (-0.27%) ⬇️
src/remat.jl 95.25% <0.00%> (-0.13%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c73b0...f1b8bdb. Read the comment docs.

@palday
Copy link
Member Author

palday commented May 24, 2020

Just had Cholesky pass on the nightly CI with this versioninfo:

 Julia Version 1.6.0-DEV.85
Commit 0413ef0e4d (2020-05-24 02:53 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, broadwell)

@palday
Copy link
Member Author

palday commented May 24, 2020

And now from a failed run:

Julia Version 1.6.0-DEV.85
Commit 0413ef0e4d (2020-05-24 02:53 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

@palday
Copy link
Member Author

palday commented May 24, 2020

So it really does look like the interaction of Intel Skylake and development version of Julia that's causing this problem.

@andreasnoack Has anything changed in LinearAlgebra since 1.4 that would make the linear algebra more sensitive to processor details?

@palday
Copy link
Member Author

palday commented May 24, 2020

Skylake processors do pass on Julia 1.4 / the older LLVM:

Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

@palday
Copy link
Member Author

palday commented May 25, 2020

So it's not just the newer LLVM version on Skylake. I've found a non Xeon Skylake to run the tests on and they don't fail (OpenSuse 15.0):

Julia Version 1.6.0-DEV.90                                                                                            
Commit c832e47a0a (2020-05-25 08:56 UTC)      
Platform Info:                                                                                                        
  OS: Linux (x86_64-pc-linux-gnu)                                                                                     
  CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz                                                                       
  WORD_SIZE: 64                                                                                                       
  LIBM: libopenlibm                                                                                                   
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)                                                                               
Environment:                                                                                                          
  JULIA_LOAD_PATH = @:/tmp/jl_Py0VU3  

But even on the same nightly, they do fail on Skylake Xeon (Ubuntu 18.04):

Julia Version 1.6.0-DEV.90
Commit c832e47a0a (2020-05-25 08:56 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_LOAD_PATH = @:/tmp/jl_6QaBGU

Meanwhile the broadwell CI machines continue to pass.

@andreasnoack
Copy link
Member

@andreasnoack Has anything changed in LinearAlgebra since 1.4 that would make the linear algebra more sensitive to processor details?

It might be the version of OpenBLAS. We upgraded recently and it could probably affect AVX512 systems. Try comparing

julia> BLAS.openblas_get_config()
"OpenBLAS 0.3.5  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=32"

There have been issues with AVX512 kernels in OpenBLAS in the past but it might also be that you need to use a different tolerance when determining the rank.

@palday
Copy link
Member Author

palday commented May 25, 2020

Skylake machine where the tests passed:

"OpenBLAS 0.3.9  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=32"

Skylake where the tests failed:

OpenBLAS 0.3.9  USE64BITINT DYNAMIC_ARCH NO_AFFINITY SkylakeX MAX_THREADS=32

@andreasnoack It does look the OpenBLAS version is playing a role here.

@andreasnoack
Copy link
Member

You can try starting julia with OPENBLAS_CORETYPE=haswell julia on the SkylakeX machine to verify that restricting OpenBLAS to the Haswell kernels actually fixes the issue. But again, it might just be because of the differences in rounding if the tolerance is too strict.

@palday
Copy link
Member Author

palday commented Aug 8, 2020

There's a middle ground here that I can implement: we just wrap the BLAS pivoted Cholesky so that the order of the linearly independent columns isn't changed and the linearly dependent columns are put all the way to the right. In other words, we let BLAS decide which columns are redundant, but otherwise preserve order.

@palday
Copy link
Member Author

palday commented Aug 8, 2020

I wrote that last comment and did some tests .... and it seems that the original test that started all of this out is so ill-conditioned that even the BLAS delivers different answers on different architectures as to which columns should be pivoted. But then again, so does the QR decomposition. In other words, swapping to QR won't save us here.

So then: we either have a really ill conditioned example (inclusive) or there is a latent issue with the use of AVX instructions in OpenBLAS.

@Nosferican
Copy link
Collaborator

Guess we should well define what the best behavior should be. Do we want to keep the full rank version based on the column ordering, do we want to make sure to always prefer the intercept, do we want to keep the combinations for a categorical variables over just a part set? Sometimes in applied work on wants to make sure to include the feature of interest, yes, but other than that and the intercept I am not sure. I really dislike Stata's random algo to choose which ones to drop so for it to be deterministic to some extent seems desirable. I would maybe use some measure of variability in the column as a way to "prioritize" the selected columns.

@andreasnoack
Copy link
Member

andreasnoack commented Aug 10, 2020

So a couple of thoughts a had after rereading this issue:

  1. Mixing rank determination strategies is not a good idea as they can easily disagree. Any solution should be based on a single procedure. (As I read the thread, this seems to be the consensus as well.)
  2. The pivoted QR and pivoted Cholesky are equivalent in full precision so I believe it's expected that they give the same conclusion whenever you use the default tolerance but if you use a lower tolerance the QR should be able to be more precise than Cholesky. However, I don't think it would be useful with the extra precision in this application. (I might be wrong)
  3. It's probably not possible to ensure that the rank determination is consistent across architectures in all cases. However, in should probably be the case in most real applications and it might, therefore, make sense to adjust the test if it's pathological.
  4. Maybe you want to consider a higher tolerance than the default one. It should make is less likely that noise accidentally increases the rank with a large effect on coefficients. I guess it would also make the results less sensitive to architecture.
  5. I'm wondering if it would be possible to get results that are easier to interpret when using the pivoted Cholesky if the variables are scaled differently. I think the intercept is lost because higher-order terms end up with a larger column norm than the intercept but since the intercept doesn't have to be one, it's, in some sense, a self-created problem. Make the intercept 1e6 and it will have the largest norm and probably not get dropped. Hence, it might be possible to use a scaling that ensures lower order terms have a large norm and then apply the inverse scaling after the rank determination.

@palday
Copy link
Member Author

palday commented Sep 14, 2020

@dmbates If you care to review, we can be done with this mess! The pivoted QR is now what we're using; I've left the pivoted Cholesky in for now and made it clear in the docs that we make no promise that we'll keep using the same algorithm for rank determination and pivoting.

Do check out the docs -- they're not my best explanation ever, but if they're good enough and I haven't been infelicitous in my simplifications, we can merge this change in and improve the docs later. Then the CIs will all be happy again.

@dmbates
Copy link
Collaborator

dmbates commented Sep 15, 2020

I have made some suggested edits in rankdeficiency.md. These are suggestions only.

One thing I did try to distinguish more clearly is the distinction between singularity of the covariance matrix estimates for the random effects, which will result in the conditional means being on some kind of hyperplane in one or more dimensions, and singularity or rank deficiency in the random-effects model matrix, which essentially always occurs. The random-effects model matrix for any model with a random intercept contains a set of indicators for the levels of the grouping factor. Even without considering fulldummy these columns sum to an intercept column which makes the concatenation of X and Z rank deficient. So we always have some type of rank deficiency for the random-effects model matrix but it is not a problem because of the shrinkage or regularization.

@palday
Copy link
Member Author

palday commented Sep 15, 2020

Thanks @dmbates! I think it reads better now and sharpening the distinction between the two types of singularity in the RE is a nice addition.

Also, I apparently cannot type redudant.

@palday
Copy link
Member Author

palday commented Sep 15, 2020

If you're happy, then let's squash and merge (and please remove all the redundant stuff from the long-form commit message).

@dmbates
Copy link
Collaborator

dmbates commented Sep 15, 2020 via email

@palday
Copy link
Member Author

palday commented Sep 15, 2020

I can, but there's no need to do so: the squash will collapse everything into a single commit and you can edit the commit message through the web interface. Should I just do that?

@dmbates
Copy link
Collaborator

dmbates commented Sep 15, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants