Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add simulated sparse data #13

Open
jolars opened this issue May 23, 2022 · 9 comments
Open

feat: add simulated sparse data #13

jolars opened this issue May 23, 2022 · 9 comments

Comments

@jolars
Copy link
Collaborator

jolars commented May 23, 2022

Consider adding functions to simulate sparse data (binary X), with correlation structure, which should be useful when benchmarking in the p >> n regime.

@mathurinm
Copy link
Collaborator

You can use the X_density parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data

@jolars
Copy link
Collaborator Author

jolars commented May 23, 2022

You can use the X_density parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data

Thats great, thanks! But the current implementation doesn't really work when it comes to correlation + sparse, right?

@mathurinm
Copy link
Collaborator

Doesn't it ? We create X the standard way, then decimate it: https://github.com/benchopt/benchopt/blob/main/benchopt/datasets/simulated.py#L93

Since the decimation is iid and independent of X, it seems to me that the correlation matrix is just multiplied by X_density, so the correlation structure is preserved.

@jolars
Copy link
Collaborator Author

jolars commented May 24, 2022

No, I think think so since you're uniformly decimating it. If you have two columns for instance, and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right? See here:

import numpy as np
from benchopt.datasets.simulated import make_correlated_data

n = 10_000
p = 3
rho = 0.9
random_state = 1
X_density = 0.01

A, _, _ = make_correlated_data(n, p, random_state=random_state, rho=rho)

print(np.corrcoef(A.T))
#> [[1.         0.90090375 0.81264968]
#>  [0.90090375 1.         0.9021679 ]
#>  [0.81264968 0.9021679  1.        ]]

B, _, _ = make_correlated_data(
    n, p, random_state=random_state, rho=rho, X_density=X_density
)

print(np.corrcoef(B.T.toarray()))
#> [[1.00000000e+00 9.28859390e-05 3.68998428e-03]
#>  [9.28859390e-05 1.00000000e+00 7.56397951e-03]
#>  [3.68998428e-03 7.56397951e-03 1.00000000e+00]]

@mathurinm
Copy link
Collaborator

Let $\delta_i$ be the decimation Bernoulli variable, with expectation $\rho$. Note that $\delta_i = \delta_i^2$.
I have:

$$E[ \sum \delta_i x_i \delta_i' x_i'] = \rho^2 E[\sum x_i x_i']$$ (works only when $\delta_i$ is indep from $\delta_i'$, ie we are looking at correlation between two different features.

while at the denominator:
$$\sqrt{E[\sum \delta_i^2 x_i^2 ]} = \sqrt{\rho E[\sum_i x_i^2]}$$

So at the numerator you get a $\rho^2$ out, while at the denominator you get twice $\sqrt{\rho}$. Thus in total the new correlation is multiplied by $\rho$ outside the diagonal ?

@mathurinm
Copy link
Collaborator

Same snippet but with 1e6 samples:

## -- End pasted text --
[[1.         0.89992156 0.80996923]
 [0.89992156 1.         0.8999785 ]
 [0.80996923 0.8999785  1.        ]]
[[1.         0.00910993 0.00873657]
 [0.00910993 1.         0.00923818]
 [0.00873657 0.00923818 1.        ]]

so multiplication by 0.01 ($\rho$)

Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right"
yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation.

@jolars
Copy link
Collaborator Author

jolars commented May 24, 2022 via email

@mathurinm
Copy link
Collaborator

If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !

@jolars
Copy link
Collaborator Author

jolars commented May 24, 2022

If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !

Well... I guess that depends on what you consider the zeros to be. If you think they are values just like the non-zeros, then I don't see why the supports should be independent. If you consider them to be missing data completely at random, then sure it would not make sense to have it be correlated.

If we consider binary data instead, does that change things for you? Because there's of course a lot of very sparse data with binary values (e.g. microarray data) and highly correlated columns and you cannot simulate that type of data unless you allow the sparsity pattern to be correlated too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants