Cannot generate data with cluster structure #14

PeterZZQ · 2022-01-18T01:01:25Z

Hi,

I wish to use sergio to benchmark the data imputation method. When I run the run_sergio.ipynb example and visualize the clean expression data, I cannot see any clear cluster structure from the visualization (there should be 9 clusters according to the setting in the example function if I understand the readme file correctly). Please see the screenshot below for the code and visualization result:

I also visualize the data after adding technical noise, still there is no cluster structure:

May I know how I can fix the problem? Thanks!

The text was updated successfully, but these errors were encountered:

PayamDiba · 2022-01-21T23:17:22Z

let's start with the clean simulated data (i.e. before adding technical noise). The clean data, since represents the true mRNA levels in each cell, does not requires library size normalization. So, you can try using the clean simulated data, as it is, for clustering. Also, it would be good if you could first run PCA on the clean data to extract the first few PCs (e.g. top 20 PCs) and then use that as an input for UMAP. Also can you color the cells according to their cell type assignment so that we could see the global structure. Please make these changes and let me know how the results look like. After that we can move to noisy simulated data.

PeterZZQ · 2022-01-21T23:51:54Z

Ok it shows me something like this:

The code that I used

sim = sergio(number_genes=100, number_bins = 9, number_sc = 300, noise_params = 1, decays=0.8, sampling_state=15, noise_type='dpd')
sim.build_graph(input_file_taregts ='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Interaction_cID_4.txt', input_file_regs='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Regs_cID_4.txt', shared_coop_state=2)
sim.simulate()
expr = sim.getExpressions()
expr_clean = np.concatenate(expr, axis = 1)
anno = [np.array([str(x)] * 300, dtype=np.object) for x in range(9)]
anno = np.concatenate(anno, axis = 0)

# Plot
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP 
import matplotlib.pyplot as plt
pca_op = PCA(n_components = 20)
tsne_op = TSNE(n_components = 2)
umap_op = UMAP(n_components = 2, min_dist = 0.4, n_neighbors = 15)
counts_true = expr_clean.T
x_pca = pca_op.fit_transform(counts_true)
x_umap = tsne_op.fit_transform(x_pca)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PayamDiba · 2022-01-22T20:59:12Z

Can you try log1p transformation on the clean simulated data before applying PCA? i.e. first transform the clean data with log(1+data) then apply PCA followed by UMAP or tSNE. Let me know the results.

PeterZZQ · 2022-01-22T21:09:05Z

Yes, here is the result:

The code that I used, the simulation part is from the run_sergio.ipynb:

sim = sergio(number_genes=100, number_bins = 9, number_sc = 300, noise_params = 1, decays=0.8, sampling_state=15, noise_type='dpd')
sim.build_graph(input_file_taregts ='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Interaction_cID_4.txt', input_file_regs='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Regs_cID_4.txt', shared_coop_state=2)
sim.simulate()
expr = sim.getExpressions()
expr_clean = np.concatenate(expr, axis = 1)
anno = [np.array([str(x)] * 300, dtype=np.object) for x in range(9)]
anno = np.concatenate(anno, axis = 0)


from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP 
import matplotlib.pyplot as plt
pca_op = PCA(n_components = 20)
tsne_op = TSNE(n_components = 2)
umap_op = UMAP(n_components = 2, min_dist = 0.4, n_neighbors = 15)
counts_true = expr_clean.T
# counts_true = counts_true/np.sum(counts_true, axis = 1)[:, None] * 100
counts_true = np.log1p(counts_true)
x_pca = pca_op.fit_transform(counts_true)
x_umap = tsne_op.fit_transform(x_pca)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PayamDiba · 2022-01-22T21:16:31Z

mmm ... it looks strange. Can you try simulating another GRN (the one with 400 genes). Also can you plot the umap results (I think your plots above are for tSNE).

PeterZZQ · 2022-01-22T21:33:34Z

Yes, it seems to be better for 400 genes.

The code

sim = sergio(number_genes=100, number_bins = 9, number_sc = 300, noise_params = 1, decays=0.8, sampling_state=15, noise_type='dpd')
sim.build_graph(input_file_taregts ='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Interaction_cID_4.txt', input_file_regs='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Regs_cID_4.txt', shared_coop_state=2)
sim.simulate()
expr = sim.getExpressions()
expr_clean = np.concatenate(expr, axis = 1)
anno = [np.array([str(x)] * 300, dtype=np.object) for x in range(9)]
anno = np.concatenate(anno, axis = 0)

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP 
import matplotlib.pyplot as plt
pca_op = PCA(n_components = 20)
# tsne_op = TSNE(n_components = 2)
umap_op = UMAP(n_components = 2, min_dist = 0.4, n_neighbors = 50)
counts_true = expr_clean.T
# counts_true = counts_true/np.sum(counts_true, axis = 1)[:, None] * 100
counts_true = np.log1p(counts_true)
x_pca = pca_op.fit_transform(counts_true)
x_umap = umap_op.fit_transform(x_pca)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PeterZZQ · 2022-01-22T21:36:31Z

May I know why there are line-like structures? Feels like it still hasn't reached the steady-state. Should I add more safety_steps? Currently is 0:

PayamDiba · 2022-01-22T21:43:17Z

Great, for 100 genes I think you should lower the "noise_params" (maybe setting it to 0.1 or 0.3 instead of 1).

The line-like structure is due to the clean nature of simulated data, after adding technical noise you will get more familiar structures. I think you don't need to add safety steps, simulations are already started from regions close to steady-state region.

After adding technical noise, you need to do all standard normalizations including library size normalizations followed by log1p transformation.

PeterZZQ · 2022-01-22T21:53:07Z

Ok after I reduce the noise_params to 0.1, the result for 100 gene is like

But after I add the technical noise the result is like:

Here is the code for adding technical noise, should I adjust some parameters?

"""
Add outlier genes
"""
expr_O = sim.outlier_effect(expr, outlier_prob = 0.01, mean = 0.8, scale = 1)

"""
Add Library Size Effect
"""
libFactor, expr_O_L = sim.lib_size_effect(expr_O, mean = 4.6, scale = 0.4)

"""
Add Dropouts
"""
binary_ind = sim.dropout_indicator(expr_O_L, shape = 6.5, percentile = 82)
expr_O_L_D = np.multiply(binary_ind, expr_O_L)

"""
Convert to UMI count
"""
count_matrix = sim.convert_to_UMIcounts(expr_O_L_D)

"""
Make a 2d gene expression matrix
"""
count_matrix = np.concatenate(count_matrix, axis = 1)

PeterZZQ · 2022-01-22T21:59:36Z

And I normalized and log-transformed the observed count

from umap import UMAP 
import matplotlib.pyplot as plt
umap_op = UMAP(n_components = 2, min_dist = 0.1)
counts_obs = count_matrix.T
counts_obs = counts_obs/(np.sum(counts_obs, axis = 1)[:, None] + 1e-6)* 100
counts_obs = np.log1p(counts_obs)
x_umap = umap_op.fit_transform(counts_obs)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PayamDiba · 2022-01-22T22:01:14Z

Can you plot UMAP instead? Also, incorrect noise parameters can result in excessive technical noise which in turn causes inaccurate low-dimensional representation. You can read the noise parameters used for each data from the supplementary files of the paper. I'll soon add to the github a python script for optimization of noise parameters with respect to a provided real data.

PeterZZQ · 2022-01-23T04:47:51Z

Ok I tried adjusting the technical noise parameters. I changed themean of lib_size_effect from 4.6 to 6.8, and the percentile of dropout_indicator from 82 to 50. The visualization seems to be well clustered this time, but I'm not sure if that is a reasonable change because the original values don't seem to be randomly decided.

PayamDiba · 2022-01-23T04:57:58Z

Reducing dropout percentile from 82 to ~65 or so should also result in good clusters (even without changing "library_size_effect"). The parameters of technical noise should be tuned with respect to a user-defined real data (such that certain statistical measures of real and simulated data become comparable.) I'll soon add an automated python script for such an optimization.

anisioti · 2023-06-23T11:26:20Z

Hello!
Thank you @PeterZZQ for raising the issue and @PayamDiba for the insightful replies and the nice work!
I am currently working on an application that studies graph inference in settings with heterogeneity. I want to test it using Sergio and relatively small graphs (at most 25 nodes). I saw the comment above about using lower values of the "noise_params" and it indeed makes a lot of difference on how the resulting data look like. I wanted to ask if you have any high level explanation on why smaller values of this noise parameter are more appropriate for graphs with lower number of nodes/genes. Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot generate data with cluster structure #14

Cannot generate data with cluster structure #14

PeterZZQ commented Jan 18, 2022

PayamDiba commented Jan 21, 2022

PeterZZQ commented Jan 21, 2022

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022 •

edited

Loading

PeterZZQ commented Jan 22, 2022 •

edited

Loading

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 23, 2022

PayamDiba commented Jan 23, 2022

anisioti commented Jun 23, 2023

Cannot generate data with cluster structure #14

Cannot generate data with cluster structure #14

Comments

PeterZZQ commented Jan 18, 2022

PayamDiba commented Jan 21, 2022

PeterZZQ commented Jan 21, 2022

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 22, 2022 • edited Loading

PeterZZQ commented Jan 22, 2022 • edited Loading

PayamDiba commented Jan 22, 2022

PeterZZQ commented Jan 23, 2022

PayamDiba commented Jan 23, 2022

anisioti commented Jun 23, 2023

PeterZZQ commented Jan 22, 2022 •

edited

Loading

PeterZZQ commented Jan 22, 2022 •

edited

Loading