Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AggregateExpression results in contradictory fold change values depending on the number of features #8682

Closed
babakml opened this issue Mar 26, 2024 · 3 comments

Comments

@babakml
Copy link

babakml commented Mar 26, 2024

Dear Seurat team,

I came across an issue when performing pseudobulk using seurat, and comparing the result to my pseudobulk result obtained by edgeR. It appeared that when I performed pseudobulk in seurat using AggregateExpression, and DESeq2 as the statistical test, some of the fold changes were in the opposite direction compared to edgeR results obtained by glmLRT. I calculated the fold change manually in the count matrix in edgeR and verified that the calculation was done correctly.
I started to do the same in seurat and found out that for some genes (here I use the example of Ltf), a negative fold change is calculated for the 19 months, although the aggregate counts in the 19 months samples are higher than in the 6 months samples:

19 months:

hom-19-months_1974 hom-19-months_1920 hom-19-months_1946 hom-19-months_1973
0.01591993 0.01808362 0.03045049 0.02116870

6 months:

hom-6-months_2090 hom-6-months_2070 hom-6-months_2139
0.004710646 0.001877463 0.004511580

I tried to manually calculate the fold changes (the result is consistent with the values I get from Foldchange function and Findmarkers). Since the values are normalized, I used the following function:

#using normalized counts
mean.fxn_norm <- function(x){log(x = (rowSums(x = expm1(x = x)) + 1)/NCOL(x), base = 2)}

data.1 <- mean.fxn_norm(x)
data.2 <- mean.fxn_norm(x)
fc_norm <- (data.1 - data.2)

the resulting value for "Ltf" using mean.fxn_norm for the 19 months (data.1) is -1.8801676 and for the 6 months (data.2) is -1.5690043. Therefore, (data.1 - data.2) returns -0.3111632. Which mathematically is true, but given the fact that the 19 months values are higher than the 6 months values, it is weird.

Then I tried to perform AggregateExpression on only a few genes ("Xkr4" and "Ltf"). This way, the normalized values for Ltf were as the following:

19 months:

hom-19-months_1974 hom-19-months_1920 hom-19-months_1946 hom-19-months_1973
2.875093 2.973051 3.712841 3.066862

6 months:

hom-6-months_2090 hom-6-months_2070 hom-6-months_2139
1.637985 1.205193 1.757088

here, the resulting fold change for Ltf is 2.562777027, so a positive change, as expected, but opposite to the previous calculation.

I spent some time to figure out how the normalization processes inside the AggregateExpression works, such that it returns values that might result in contradictory fold change values depending on the number of genes, but I couldn't reach any conclusion.

I would be thankful if you answer to my following questions:

  1. Is the problem here with the nomalization process in AggregateExpression? or one should use another method to calculate the fold change to reach the correct result? When I use the formula for raw counts and scaled data, I get a positive fold change value (different values, but both are positive).
  2. Aren't fold change calculations using normalized, raw and scaled data supposed to give the same result? As there are three different formula for each type of data available in seurat, I supposed the aim of this would be to minimise the difference among the results, but here, I get 1.87 for scaled, 1.96 for raw, and -0.311 for normalized data. Here are the formula I used for scaled and raw data, respectively:

mean.fxn_scaled <- function(x){rowMeans(x)}

mean.fxn_counts <- function(x){log(x = (rowSums(x = x) + 1)/NCOL(x), base = 2)}

  1. How should one handle the issue with the number of features when running AggregateExpression? Is there a way to adjust the normalization process to avoide the above mentioned issue?

Thank you very much in advance for your comments, and for your great work developing Seurat.
Babak

@mhkowalski
Copy link
Contributor

Hello,

Thanks for your question. I wouldn't recommend calculating fold changes from aggregated counts in a pseudobulk object, as this might be confounded by how many cells (and how deeply sequenced these cells are). For example, if you have more cells in Group A compared to Group B, the aggregate counts of a gene may be higher in Group A even though the relative expression may be lower, which is reflected in the normalized data.

I'm a bit confused on the data that you show. Is 0.01591993 the normalized data after pseudobulking, for example? I'm not able to reproduce the fold-changes that you calculate, so it would be helpful to provide a more reproducible example with either a seurat object or just a vector of values. If the data is higher in one group, I would expect the logFC's to be positive.

Also, I'll comment that AggregateExpression normalizes the data automatically (as if you had run NormalizeData), but does so based on only the features that you provide (so the values will be different using a subset of features). I would recommend running AggregateExpression on all features and then subsetting later if you prefer.

I hope this is helpful.

@rsatija
Copy link
Collaborator

rsatija commented Jun 24, 2024

Thanks for the question - closing this issue now as I think @mhkowalski gave a very complete answer.

@roi-meir
Copy link
Contributor

The reason for the higher fold change for group 6 months even though the expression is lower is the division of the pseudocount by the number of samples. See the issue here for more details #9346

Changing the mean fxn to

mean.fxn_norm <- function(x){log(x = (rowMeans(x = expm1(x = x)) + 0.000001, base = 2)}

will get you the expected direction (at least on the normalized count)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants