-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add simulated sparse data #13
Comments
You can use the |
Thats great, thanks! But the current implementation doesn't really work when it comes to correlation + sparse, right? |
Doesn't it ? We create X the standard way, then decimate it: https://github.com/benchopt/benchopt/blob/main/benchopt/datasets/simulated.py#L93 Since the decimation is iid and independent of X, it seems to me that the correlation matrix is just multiplied by |
No, I think think so since you're uniformly decimating it. If you have two columns for instance, and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right? See here: import numpy as np
from benchopt.datasets.simulated import make_correlated_data
n = 10_000
p = 3
rho = 0.9
random_state = 1
X_density = 0.01
A, _, _ = make_correlated_data(n, p, random_state=random_state, rho=rho)
print(np.corrcoef(A.T))
#> [[1. 0.90090375 0.81264968]
#> [0.90090375 1. 0.9021679 ]
#> [0.81264968 0.9021679 1. ]]
B, _, _ = make_correlated_data(
n, p, random_state=random_state, rho=rho, X_density=X_density
)
print(np.corrcoef(B.T.toarray()))
#> [[1.00000000e+00 9.28859390e-05 3.68998428e-03]
#> [9.28859390e-05 1.00000000e+00 7.56397951e-03]
#> [3.68998428e-03 7.56397951e-03 1.00000000e+00]] |
Let while at the denominator: So at the numerator you get a |
Same snippet but with 1e6 samples:
so multiplication by 0.01 ( Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right" |
Right, but you don't get the nominal 0.9; isn't that what you want also when X is sparse?
…On 24.05.2022 01:35, mathurinm wrote:
Same snippet but with 1e6 samples:
```
## -- End pasted text --
[[1. 0.89992156 0.80996923]
[0.89992156 1. 0.8999785 ]
[0.80996923 0.8999785 1. ]]
[[1. 0.00910993 0.00873657]
[0.00910993 1. 0.00923818]
[0.00873657 0.00923818 1. ]]
```
so multiplication by 0.01 ($\rho$)
Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right"
yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation.
--
Reply to this email directly or view it on GitHub:
#13 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested ! |
Well... I guess that depends on what you consider the zeros to be. If you think they are values just like the non-zeros, then I don't see why the supports should be independent. If you consider them to be missing data completely at random, then sure it would not make sense to have it be correlated. If we consider binary data instead, does that change things for you? Because there's of course a lot of very sparse data with binary values (e.g. microarray data) and highly correlated columns and you cannot simulate that type of data unless you allow the sparsity pattern to be correlated too. |
Consider adding functions to simulate sparse data (binary X), with correlation structure, which should be useful when benchmarking in the p >> n regime.
The text was updated successfully, but these errors were encountered: