-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutual information is greater than information entropy #11
Comments
Hi, thanks for raising the issue. Both packages implement the Kozachenko-Leonenko estimator for entropy and the Kraskov et al estimator for mutual information. At first I thought the difference might come down to the parameterisation but then I noticed that you already matched the
If you read the paper, you will notice that the correct equation (equation 8) is instead:
I tested my implementation against distributions for which the mutual information can be computed analytically, so I am fairly sure that this equation is not only the intended but also the correct one. Tl;dr: they may have put some parentheses in the wrong place. |
Sorry, I didn't find the difference between eq. (8) in the paper and mi (line 70 of _mutual_infor.py) in sklearn. What's the problem with brackets (parentheses)? Or I didn't understand what you mean. |
np.mean(digamma(nx+1) + digamma(ny+1)) != np.mean(digamma(nx+1)) + np.mean(digamma(ny+1)) The expression in the paper includes the left-hand term, the code in scikit-learn the term on the right. |
According to the nature of expectation, the expectation of sum is equal to the sum of expectation. So np.mean(digamma(nx+1) + digamma(ny+1)) == np.mean(digamma(nx+1)) + np.mean(digamma(ny+1)) |
Only if nx and ny are uncorrelated. Which they are not. |
Might be having a bit of brain fart though. I am having a cold, and every thought takes ages. |
I think you are right. Had to run some numbers on the ipython prompt to help my reduced mental capacities understand basic math again. In that case, I don't know where the difference comes from, at least not today. I will look into it when my brain is in working conditions again. |
Thank you for your careful reply. Good health is the first. I wish you a speedy recovery. Maybe you'll know the answer when you recover from a cold. |
Actually, I don't think there is a difference at all. The definitional or so-called "naive" estimator of the mutual information is: I(X;Y) = H(X) + H(Y) - H(X,Y) If we plug in your example: import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous
np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)
print(mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False, n_neighbors=3))
# [2.31452164]
hx = continuous.get_h(x, k=3)
hy = continuous.get_h(y, k=3)
hxy = continuous.get_h(np.c_[x, y], k=3)
mi = hx + hy - hxy
print(mi)
# 2.325853446732216 I would say those estimates are pretty close, given that we are using two different methods to get the result. |
I find mi>hx, which puzzles me. |
That continues to be a strong point. However, I am by now fully convinced that the entropy computations are fine: import scipy.stats as st
from entropy_estimators import continuous
distribution = st.norm(0, 1)
analytical = distribution.entropy()
empirical = continuous.get_h(distribution.rvs(10000), k=3)
print(analytical, empirical)
# 1.4189385332046727 1.4197821857006883 This leaves the following options:
My money is on option 3. |
Actually, option 1 and 2 are ruled out as possible explanations by the fact that the naive estimator that I implemented above returns the same result for the mutual information as the KSG estimator... |
Thanks a lot. I'll take the time to study it again. |
Mutual information is not necessarily less than information entropy. I was misled by a picture on Wikipedia. |
why Mutual information is not necessarily less than information entropy? |
mi is larger than entropy. This is indeed a serious problem. |
The fact that the mutual information can be larger as the entropy is due to the fact that the continuous Shannon entropy formula, is not correct. Jaynes found that the correct formula should be,¹ instead. This way, the entropy becomes invariant on scaling where the last term A first order fix to make the entropy more meaningful is to use scaler, e.g. RobustScaler, before performing entropy estimation. Nevertheless, this does not ensure that |
For exmple in
How can
The H(X,Y) is from a 10000*10000 Dimension vector, and you can't just get a joint distribution from marginal distributions, so typically you have to assume it is a multivarible normal distibution, and that's what you do when marginal distributions is normal. But if the marginal distribution is not normal, what can we assume? |
@singledoggy You seem to have a separate question. Please open another issue. |
Another exmaple is 2 independnt normal distribution, as the |
That’s a nice intuitive counterexample! (I propose closing this issue with that.) |
Hi everyone, to follow up on this discussion. As @singledoggy and I have explained, the reason that the mutual information of continuous variables can be larger than the entropies is that taking the limit For anyone who actually needs to estimate a normalized mutual information and it is not enough to know why the Kraskov estimator fails: I've recently published an article, arXiv:2405.04980, where I discuss this issue in detail and present a generalization of the Kraskov estimator that is able to estimate normalized mutual information as well. The source code is available on github at moldyn/NorMI. |
@braniii Thanks for taking the time to explain the issue clearly and providing an alternative. I have linked your last comment at the top of the README to direct people your way. |
Thank you very much. If you can explain why mutual information is greater than information entropy, the code is as follows
The text was updated successfully, but these errors were encountered: