Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of variance explained appears wrong #317

Closed
tsalo opened this issue May 30, 2019 · 9 comments
Closed

Calculation of variance explained appears wrong #317

tsalo opened this issue May 30, 2019 · 9 comments
Labels
discussion issues that still need to be discussed TE-dependence issues related to TE dependence metrics and component selection

Comments

@tsalo
Copy link
Member

tsalo commented May 30, 2019

Summary

Variance explained and normalized variance explained are calculated within dependence_metrics for PCA and ICA components. However, variance explained is also directly returned by the PCA-fitting method, and the varex values from the original method are different from the ones calculated by our function.

Additional Detail

Here is where we calculate variance explained and normalized variance explained in dependence_metrics:

tedana/tedana/model/fit.py

Lines 152 to 154 in 65f89e1

varex[i_comp] = (tsoc_B[:, i_comp]**2).sum() / totvar * 100.
varex_norm[i_comp] = (utils.unmask(WTS, mask)[t2s != 0][:, i_comp]**2).sum() /\
totvar_norm * 100.

The values are not very similar (for example, for PCA the "real" varex might be 37.8%, while the estimated varex is 57.6% and the estimated normalized varex is 53.8%), but they are highly correlated across components.

@tsalo tsalo added the discussion issues that still need to be discussed label May 30, 2019
@tsalo tsalo changed the title Calculation of variance explained within dependence_metrics appears wrong Calculation of variance explained appears wrong May 30, 2019
@jbteves
Copy link
Collaborator

jbteves commented Jun 1, 2019

Thanks for bringing this up. When you say that the variance explained is highly correlated across components, do you mean that the variance explained appears to at least grow with the amount of variance that a component is purported to explain?

@tsalo
Copy link
Member Author

tsalo commented Jun 1, 2019

I guess you could say that. The values differ by calculation method, but are correlated across methods. I have high confidence in one method, but it only works for PCA, so I think we need to either come up with a more accurate method to implement in dependence_metrics or to figure out how to calculate variance explained for ICA specifically (i.e., in tedica) so we can use the "official" version for both decompositions.

@tsalo
Copy link
Member Author

tsalo commented Jul 18, 2019

We can calculate voxel-wise variance explained and then average it across voxels to get an estimate of component-wise variance explained. These values are very similar to the PCA-based variance explained values, and can be calculated for both PCA and ICA.

Here's what the code could look like in dependence_metrics:

tsoc_dm = tsoc - np.mean(tsoc, axis=-1, keepdims=True)
totvar = np.var(tsoc_dm, axis=1)
...
LGR.info('Fitting TE- and S0-dependent models to components')
for i_comp in range(n_components):
    comp_pred_data = np.dot(mmix[:, i_comp:i_comp+1], tsoc_B[:, i_comp:i_comp+1].T).T
    compvar = np.var(comp_pred_data, axis=1)
    comptable.loc[i_comp, 'variance explained'] = np.mean(compvar / totvar)

Otherwise, I can sort of see the logic behind using parameter estimates as a proxy for variance explained, since the IVs (mixing matrix) are all supposed to be z-scores, but I think it would only make sense if the data was also z-scores. I also don't know why we have separate "variance explained" and "normalized variance explained" values.

@tsalo
Copy link
Member Author

tsalo commented Jul 18, 2019

Since "normalized variance explained" is only used within the PCA decision tree, and "variance explained" is only used within the ICA decision trees (both v2.5 and v3.2), I propose that we merge the two. As mentioned above, I don't know what the conceptual difference is between the two measures- they both sum to standardized values (100 or 1)- and they at least appear interchangeable.

@tsalo
Copy link
Member Author

tsalo commented Jul 20, 2019

Okay, so it looks like squared parameter estimates do match up to variance explained, but only if both the DVs and the IVs have unit variance (which I believe means that the parameter estimates are actually beta values). If we z-score the mixing matrix and the optimally combined data within kundu_fit, and then square the resulting betas, we'll have voxel-wise estimates of variance explained. We can then just average them across voxels to get a variance explained value per component. I think this is a bit simpler than the method I described above.

@tsalo tsalo added the TE-dependence issues related to TE dependence metrics and component selection label Oct 4, 2019
@emdupre
Copy link
Member

emdupre commented Nov 8, 2019

Do we have a PR open for this ?

@tsalo
Copy link
Member Author

tsalo commented Nov 8, 2019

No PR. I don't know enough about variance explained in ICA to be sure about my conclusions. I was hoping that others could weigh in on it before trying to change it in tedana.

@emdupre
Copy link
Member

emdupre commented Nov 8, 2019

I'd say we could roll this into #84 . Maybe once we document what we're doing it will be clearer if we need to change it ?

@stale
Copy link

stale bot commented Feb 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion issues that still need to be discussed TE-dependence issues related to TE dependence metrics and component selection
Projects
None yet
Development

No branches or pull requests

3 participants