-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585
Comments
Any idea what is going on here @lpantano @j-andrews7? Going to try and get a release together soonish so be great if we can mop up, fix and test all of these Salmon issues. |
rnaseq/modules/local/process/deseq2_qc.nf Line 24 in 3643a94
I still consider that a major issue, there should be a disclaimer indicating that there are issues with the merged tables and that the DESeq2 object may not be totally kosher. @Solyris83, can you confirm you used the exact same commands/inputs on the two different runs? |
rnaseq_2runs.zip The 2nd run I am using the STAR index created in the first run. |
PARs are where X and Y share homologous sequence, so the PAR genes are going to be very similar if not identical to a non-PAR gene (e.g. ENSG00000002586.20 and ENSG00000002586.20_PAR_Y). In cases where salmon can't differentiate between two genes for a given read, there is some amount of randomness in terms of count assignment. @rob-p may be able to give us a little insight into that process. I'd be inclined to just go remove all the PAR transcripts from my annotation file. |
Hi, thanks for the detailed report. Can you check the salmon output files ( for those genes that you see with a lot of difference) are the same or not across runs? I am trying to figure out if the issue is the processing of the salmon files or upstream. As well, 3.0 has a bug (as mentioned), so the merged files are not totally correct anyway, but it would be good to determine when the difference happens exactly. Thanks! |
I believe this is expected behavior due to the EM used by salmon to deal with multimapping to the transcriptome. In this case, my guess is that these sequences are very slightly different so that they don't get collapsed during the salmon indexing but the vast majority of mappings to each are impossible to distinguish, hence the differences in each run from the model. See relevant bit from the salmon FAQ:
I'm no expert, but I don't believe EM is deterministic, so this is going to happen. |
Hi @j-andrews7 , regarding the PAR genes, I initially thought those were the culprits causing the problem and tried to remove those with the PAR keyword along with their non-PAR keyword genes as well. But there is still a small population of genes (i remember counting 80 genes ) with this problem and that is when I started this thread. So with this known salmon behavior, is it possible to allow featureCount to just count on the STAR output and GTF coordinates instead of salmon estimating gene count? @lpantano sorry I am not exactly sure what would be a good check. Maybe can you name a few files which you think are good candidates? |
featureCount has its own issues and just tosses multi-mappers. So rather than those genes having slightly variable counts, they'll be dramatically undercounted. The other genes you see exhibiting this behavior are likely pseudogenes. As far as I know, there is no way to "seed" runs such that they are exactly reproducible, but those run to run variations should be quite small. |
These is a nice discussion of this here as well: COMBINE-lab/salmon#613 Rob Patro confirms that my above comment is correct, there is no way to ensure completely identical runs. |
Description added in #598 |
I have successfully ran the nfcore/rnaseq v3.0 with docker profile on a dataset with 4 samples twice to check the reproducibility of the result.
I am checking the
the comparison gene by gene versus run1 vs run2 produces 81 genes which are not pseudo-autosomal gene (PAR keyword in ensembl ID) and has more than 1 read count difference between the 2 runs. Some of these has over a hundred read count differences between the 2 run.
May I know if that is expected and how can I "seed" this to make the result in this run reproducible when I re-run this?
Regards
Zhihui
The text was updated successfully, but these errors were encountered: