feat: add simulated sparse data #13

jolars · 2022-05-23T13:47:49Z

Consider adding functions to simulate sparse data (binary X), with correlation structure, which should be useful when benchmarking in the p >> n regime.

mathurinm · 2022-05-23T14:22:12Z

You can use the X_density parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data

jolars · 2022-05-23T15:32:22Z

You can use the X_density parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data

Thats great, thanks! But the current implementation doesn't really work when it comes to correlation + sparse, right?

mathurinm · 2022-05-24T06:24:32Z

Doesn't it ? We create X the standard way, then decimate it: https://github.com/benchopt/benchopt/blob/main/benchopt/datasets/simulated.py#L93

Since the decimation is iid and independent of X, it seems to me that the correlation matrix is just multiplied by X_density, so the correlation structure is preserved.

jolars · 2022-05-24T06:43:52Z

No, I think think so since you're uniformly decimating it. If you have two columns for instance, and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right? See here:

import numpy as np
from benchopt.datasets.simulated import make_correlated_data

n = 10_000
p = 3
rho = 0.9
random_state = 1
X_density = 0.01

A, _, _ = make_correlated_data(n, p, random_state=random_state, rho=rho)

print(np.corrcoef(A.T))
#> [[1.         0.90090375 0.81264968]
#>  [0.90090375 1.         0.9021679 ]
#>  [0.81264968 0.9021679  1.        ]]

B, _, _ = make_correlated_data(
    n, p, random_state=random_state, rho=rho, X_density=X_density
)

print(np.corrcoef(B.T.toarray()))
#> [[1.00000000e+00 9.28859390e-05 3.68998428e-03]
#>  [9.28859390e-05 1.00000000e+00 7.56397951e-03]
#>  [3.68998428e-03 7.56397951e-03 1.00000000e+00]]

mathurinm · 2022-05-24T07:06:13Z

Let $\delta_i$ be the decimation Bernoulli variable, with expectation $\rho$. Note that $\delta_i = \delta_i^2$.
I have:

$$E[ \sum \delta_i x_i \delta_i' x_i'] = \rho^2 E[\sum x_i x_i']$$ (works only when $\delta_i$ is indep from $\delta_i'$, ie we are looking at correlation between two different features.

while at the denominator:
$$\sqrt{E[\sum \delta_i^2 x_i^2 ]} = \sqrt{\rho E[\sum_i x_i^2]}$$

So at the numerator you get a $\rho^2$ out, while at the denominator you get twice $\sqrt{\rho}$. Thus in total the new correlation is multiplied by $\rho$ outside the diagonal ?

mathurinm · 2022-05-24T08:35:02Z

Same snippet but with 1e6 samples:

## -- End pasted text --
[[1.         0.89992156 0.80996923]
 [0.89992156 1.         0.8999785 ]
 [0.80996923 0.8999785  1.        ]]
[[1.         0.00910993 0.00873657]
 [0.00910993 1.         0.00923818]
 [0.00873657 0.00923818 1.        ]]

so multiplication by 0.01 ($\rho$)

Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right"
yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation.

jolars · 2022-05-24T14:47:28Z

Right, but you don't get the nominal 0.9; isn't that what you want also when X is sparse?

…

On 24.05.2022 01:35, mathurinm wrote: Same snippet but with 1e6 samples: ``` ## -- End pasted text -- [[1. 0.89992156 0.80996923] [0.89992156 1. 0.8999785 ] [0.80996923 0.8999785 1. ]] [[1. 0.00910993 0.00873657] [0.00910993 1. 0.00923818] [0.00873657 0.00923818 1. ]] ``` so multiplication by 0.01 ($\rho$) Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right" yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation. -- Reply to this email directly or view it on GitHub: #13 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

mathurinm · 2022-05-24T15:20:26Z

If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !

jolars · 2022-05-24T20:11:56Z

If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !

Well... I guess that depends on what you consider the zeros to be. If you think they are values just like the non-zeros, then I don't see why the supports should be independent. If you consider them to be missing data completely at random, then sure it would not make sense to have it be correlated.

If we consider binary data instead, does that change things for you? Because there's of course a lot of very sparse data with binary values (e.g. microarray data) and highly correlated columns and you cannot simulate that type of data unless you allow the sparsity pattern to be correlated too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add simulated sparse data #13

feat: add simulated sparse data #13

jolars commented May 23, 2022

mathurinm commented May 23, 2022

jolars commented May 23, 2022

mathurinm commented May 24, 2022

jolars commented May 24, 2022

mathurinm commented May 24, 2022

mathurinm commented May 24, 2022

jolars commented May 24, 2022 via email

mathurinm commented May 24, 2022

jolars commented May 24, 2022

feat: add simulated sparse data #13

feat: add simulated sparse data #13

Comments

jolars commented May 23, 2022

mathurinm commented May 23, 2022

jolars commented May 23, 2022

mathurinm commented May 24, 2022

jolars commented May 24, 2022

mathurinm commented May 24, 2022

mathurinm commented May 24, 2022

jolars commented May 24, 2022 via email

mathurinm commented May 24, 2022

jolars commented May 24, 2022