-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AggregateExpression results in contradictory fold change values depending on the number of features #8682
Comments
Hello, Thanks for your question. I wouldn't recommend calculating fold changes from aggregated counts in a pseudobulk object, as this might be confounded by how many cells (and how deeply sequenced these cells are). For example, if you have more cells in Group A compared to Group B, the aggregate counts of a gene may be higher in Group A even though the relative expression may be lower, which is reflected in the normalized data. I'm a bit confused on the data that you show. Is 0.01591993 the normalized data after pseudobulking, for example? I'm not able to reproduce the fold-changes that you calculate, so it would be helpful to provide a more reproducible example with either a seurat object or just a vector of values. If the data is higher in one group, I would expect the logFC's to be positive. Also, I'll comment that AggregateExpression normalizes the data automatically (as if you had run NormalizeData), but does so based on only the features that you provide (so the values will be different using a subset of features). I would recommend running I hope this is helpful. |
Thanks for the question - closing this issue now as I think @mhkowalski gave a very complete answer. |
The reason for the higher fold change for group 6 months even though the expression is lower is the division of the pseudocount by the number of samples. See the issue here for more details #9346 Changing the mean fxn to mean.fxn_norm <- function(x){log(x = (rowMeans(x = expm1(x = x)) + 0.000001, base = 2)} will get you the expected direction (at least on the normalized count) |
Dear Seurat team,
I came across an issue when performing pseudobulk using seurat, and comparing the result to my pseudobulk result obtained by edgeR. It appeared that when I performed pseudobulk in seurat using AggregateExpression, and DESeq2 as the statistical test, some of the fold changes were in the opposite direction compared to edgeR results obtained by glmLRT. I calculated the fold change manually in the count matrix in edgeR and verified that the calculation was done correctly.
I started to do the same in seurat and found out that for some genes (here I use the example of Ltf), a negative fold change is calculated for the 19 months, although the aggregate counts in the 19 months samples are higher than in the 6 months samples:
19 months:
6 months:
I tried to manually calculate the fold changes (the result is consistent with the values I get from Foldchange function and Findmarkers). Since the values are normalized, I used the following function:
#using normalized counts
mean.fxn_norm <- function(x){log(x = (rowSums(x = expm1(x = x)) + 1)/NCOL(x), base = 2)}
data.1 <- mean.fxn_norm(x)
data.2 <- mean.fxn_norm(x)
fc_norm <- (data.1 - data.2)
the resulting value for "Ltf" using mean.fxn_norm for the 19 months (data.1) is -1.8801676 and for the 6 months (data.2) is -1.5690043. Therefore, (data.1 - data.2) returns -0.3111632. Which mathematically is true, but given the fact that the 19 months values are higher than the 6 months values, it is weird.
Then I tried to perform AggregateExpression on only a few genes ("Xkr4" and "Ltf"). This way, the normalized values for Ltf were as the following:
19 months:
6 months:
here, the resulting fold change for Ltf is 2.562777027, so a positive change, as expected, but opposite to the previous calculation.
I spent some time to figure out how the normalization processes inside the AggregateExpression works, such that it returns values that might result in contradictory fold change values depending on the number of genes, but I couldn't reach any conclusion.
I would be thankful if you answer to my following questions:
mean.fxn_scaled <- function(x){rowMeans(x)}
mean.fxn_counts <- function(x){log(x = (rowSums(x = x) + 1)/NCOL(x), base = 2)}
Thank you very much in advance for your comments, and for your great work developing Seurat.
Babak
The text was updated successfully, but these errors were encountered: