AggregateExpression results in contradictory fold change values depending on the number of features #8682

babakml · 2024-03-26T17:34:48Z

Dear Seurat team,

I came across an issue when performing pseudobulk using seurat, and comparing the result to my pseudobulk result obtained by edgeR. It appeared that when I performed pseudobulk in seurat using AggregateExpression, and DESeq2 as the statistical test, some of the fold changes were in the opposite direction compared to edgeR results obtained by glmLRT. I calculated the fold change manually in the count matrix in edgeR and verified that the calculation was done correctly.
I started to do the same in seurat and found out that for some genes (here I use the example of Ltf), a negative fold change is calculated for the 19 months, although the aggregate counts in the 19 months samples are higher than in the 6 months samples:

19 months:

hom-19-months_1974	hom-19-months_1920	hom-19-months_1946	hom-19-months_1973
0.01591993	0.01808362	0.03045049	0.02116870

6 months:

hom-6-months_2090	hom-6-months_2070	hom-6-months_2139
0.004710646	0.001877463	0.004511580

I tried to manually calculate the fold changes (the result is consistent with the values I get from Foldchange function and Findmarkers). Since the values are normalized, I used the following function:

#using normalized counts
mean.fxn_norm <- function(x){log(x = (rowSums(x = expm1(x = x)) + 1)/NCOL(x), base = 2)}

data.1 <- mean.fxn_norm(x)
data.2 <- mean.fxn_norm(x)
fc_norm <- (data.1 - data.2)

the resulting value for "Ltf" using mean.fxn_norm for the 19 months (data.1) is -1.8801676 and for the 6 months (data.2) is -1.5690043. Therefore, (data.1 - data.2) returns -0.3111632. Which mathematically is true, but given the fact that the 19 months values are higher than the 6 months values, it is weird.

Then I tried to perform AggregateExpression on only a few genes ("Xkr4" and "Ltf"). This way, the normalized values for Ltf were as the following:

19 months:

hom-19-months_1974	hom-19-months_1920	hom-19-months_1946	hom-19-months_1973
2.875093	2.973051	3.712841	3.066862

6 months:

hom-6-months_2090	hom-6-months_2070	hom-6-months_2139
1.637985	1.205193	1.757088

here, the resulting fold change for Ltf is 2.562777027, so a positive change, as expected, but opposite to the previous calculation.

I spent some time to figure out how the normalization processes inside the AggregateExpression works, such that it returns values that might result in contradictory fold change values depending on the number of genes, but I couldn't reach any conclusion.

I would be thankful if you answer to my following questions:

Is the problem here with the nomalization process in AggregateExpression? or one should use another method to calculate the fold change to reach the correct result? When I use the formula for raw counts and scaled data, I get a positive fold change value (different values, but both are positive).
Aren't fold change calculations using normalized, raw and scaled data supposed to give the same result? As there are three different formula for each type of data available in seurat, I supposed the aim of this would be to minimise the difference among the results, but here, I get 1.87 for scaled, 1.96 for raw, and -0.311 for normalized data. Here are the formula I used for scaled and raw data, respectively:

mean.fxn_scaled <- function(x){rowMeans(x)}

mean.fxn_counts <- function(x){log(x = (rowSums(x = x) + 1)/NCOL(x), base = 2)}

How should one handle the issue with the number of features when running AggregateExpression? Is there a way to adjust the normalization process to avoide the above mentioned issue?

Thank you very much in advance for your comments, and for your great work developing Seurat.
Babak

mhkowalski · 2024-04-05T20:12:21Z

Hello,

Thanks for your question. I wouldn't recommend calculating fold changes from aggregated counts in a pseudobulk object, as this might be confounded by how many cells (and how deeply sequenced these cells are). For example, if you have more cells in Group A compared to Group B, the aggregate counts of a gene may be higher in Group A even though the relative expression may be lower, which is reflected in the normalized data.

I'm a bit confused on the data that you show. Is 0.01591993 the normalized data after pseudobulking, for example? I'm not able to reproduce the fold-changes that you calculate, so it would be helpful to provide a more reproducible example with either a seurat object or just a vector of values. If the data is higher in one group, I would expect the logFC's to be positive.

Also, I'll comment that AggregateExpression normalizes the data automatically (as if you had run NormalizeData), but does so based on only the features that you provide (so the values will be different using a subset of features). I would recommend running AggregateExpression on all features and then subsetting later if you prefer.

I hope this is helpful.

rsatija · 2024-06-24T17:03:39Z

Thanks for the question - closing this issue now as I think @mhkowalski gave a very complete answer.

roi-meir · 2024-10-13T08:25:38Z

The reason for the higher fold change for group 6 months even though the expression is lower is the division of the pseudocount by the number of samples. See the issue here for more details #9346

Changing the mean fxn to

mean.fxn_norm <- function(x){log(x = (rowMeans(x = expm1(x = x)) + 0.000001, base = 2)}

will get you the expected direction (at least on the normalized count)

rsatija closed this as completed Jun 24, 2024

DelongZHOU mentioned this issue Oct 8, 2024

Get an expression table that matches what FoldChange function uses #9376

Closed

roi-meir mentioned this issue Oct 13, 2024

FoldChange results depend on the group size #9346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AggregateExpression results in contradictory fold change values depending on the number of features #8682

AggregateExpression results in contradictory fold change values depending on the number of features #8682

babakml commented Mar 26, 2024 •

edited

Loading

mhkowalski commented Apr 5, 2024

rsatija commented Jun 24, 2024

roi-meir commented Oct 13, 2024

AggregateExpression results in contradictory fold change values depending on the number of features #8682

AggregateExpression results in contradictory fold change values depending on the number of features #8682

Comments

babakml commented Mar 26, 2024 • edited Loading

mhkowalski commented Apr 5, 2024

rsatija commented Jun 24, 2024

roi-meir commented Oct 13, 2024

babakml commented Mar 26, 2024 •

edited

Loading