Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585

Solyris83 · 2021-03-23T10:18:06Z

I have successfully ran the nfcore/rnaseq v3.0 with docker profile on a dataset with 4 samples twice to check the reproducibility of the result.

I am checking the

RData object produced in result/star_salmon/deseq2_qc/deseq2.dds.RData
Gene Count in result/star_salmon/salmon.merged.gene_counts.tsv

the comparison gene by gene versus run1 vs run2 produces 81 genes which are not pseudo-autosomal gene (PAR keyword in ensembl ID) and has more than 1 read count difference between the 2 runs. Some of these has over a hundred read count differences between the 2 run.

May I know if that is expected and how can I "seed" this to make the result in this run reproducible when I re-run this?

Regards
Zhihui

drpatelh · 2021-04-11T08:25:42Z

Any idea what is going on here @lpantano @j-andrews7? Going to try and get a release together soonish so be great if we can mop up, fix and test all of these Salmon issues.

j-andrews7 · 2021-04-11T16:52:45Z

My guess is this is related to #553. I'm not sure which counts table is eventually used for DESeq2, but I think it's the lengthScaled one (which on the main branch is incorrect anyway). I don't have the know-how to actually figure out what the counts being passed here are:

rnaseq/modules/local/process/deseq2_qc.nf

Line 24 in 3643a94

path counts

I still consider that a major issue, there should be a disclaimer indicating that there are issues with the merged tables and that the DESeq2 object may not be totally kosher. @Solyris83, can you confirm you used the exact same commands/inputs on the two different runs?

Solyris83 · 2021-04-12T06:44:31Z

rnaseq_2runs.zip
Hi @j-andrews7 , yes, the run command can be found in the execution_report and the output (salmon.merged.gene_counts.tsv) which I am looking at can be found attached herein.

The 2nd run I am using the STAR index created in the first run.

j-andrews7 · 2021-04-12T16:01:18Z

PARs are where X and Y share homologous sequence, so the PAR genes are going to be very similar if not identical to a non-PAR gene (e.g. ENSG00000002586.20 and ENSG00000002586.20_PAR_Y). In cases where salmon can't differentiate between two genes for a given read, there is some amount of randomness in terms of count assignment.

@rob-p may be able to give us a little insight into that process. I'd be inclined to just go remove all the PAR transcripts from my annotation file.

lpantano · 2021-04-13T02:31:00Z

Hi, thanks for the detailed report. Can you check the salmon output files ( for those genes that you see with a lot of difference) are the same or not across runs? I am trying to figure out if the issue is the processing of the salmon files or upstream. As well, 3.0 has a bug (as mentioned), so the merged files are not totally correct anyway, but it would be good to determine when the difference happens exactly. Thanks!

j-andrews7 · 2021-04-13T03:01:32Z

I believe this is expected behavior due to the EM used by salmon to deal with multimapping to the transcriptome. In this case, my guess is that these sequences are very slightly different so that they don't get collapsed during the salmon indexing but the vast majority of mappings to each are impossible to distinguish, hence the differences in each run from the model. See relevant bit from the salmon FAQ:

Q: The results in salmon’s quant.sf have fewer transcripts than the transcriptome I indexed. Why is this?
A When salmon indexes transcripts it collapses exact duplicates (i.e. transcripts with perfectly identical sequence). For each group of sequence-identical transcripts, it will select only one representative to quantify. Such sequence identical transcripts often arise from transcripts annotated on ALT contigs which nonetheless do not contain any variants with respect to the reference transcripts. While salmon does collapse these exact duplicates, it records all of the decisions it makes.

In the salmon index directory, there is a file called “duplicate_clusters.tsv”. This is a 2-column tsv file where the first column lists a retained transcript and the right column lists a discarded duplicate. A transcript can represent multiple discarded sequences, in which case it appears multiple times in the first column. This will allow you to see the sequence duplicate clusters. If quantified, all transcripts in these clusters would obtain identical quantification estimates.

I'm no expert, but I don't believe EM is deterministic, so this is going to happen.

Solyris83 · 2021-04-13T10:02:10Z

Hi @j-andrews7 , regarding the PAR genes, I initially thought those were the culprits causing the problem and tried to remove those with the PAR keyword along with their non-PAR keyword genes as well. But there is still a small population of genes (i remember counting 80 genes ) with this problem and that is when I started this thread.

So with this known salmon behavior, is it possible to allow featureCount to just count on the STAR output and GTF coordinates instead of salmon estimating gene count?

@lpantano sorry I am not exactly sure what would be a good check. Maybe can you name a few files which you think are good candidates?

j-andrews7 · 2021-04-13T13:26:09Z

featureCount has its own issues and just tosses multi-mappers. So rather than those genes having slightly variable counts, they'll be dramatically undercounted. The other genes you see exhibiting this behavior are likely pseudogenes.

As far as I know, there is no way to "seed" runs such that they are exactly reproducible, but those run to run variations should be quite small.

j-andrews7 · 2021-04-13T15:06:26Z

These is a nice discussion of this here as well: COMBINE-lab/salmon#613

Rob Patro confirms that my above comment is correct, there is no way to ensure completely identical runs.

drpatelh · 2021-04-16T22:27:59Z

Description added in #598

Solyris83 added the bug Something isn't working label Mar 23, 2021

drpatelh added this to the 3.1 milestone Apr 14, 2021

lpantano mentioned this issue Apr 16, 2021

Fix salmon files and add tips to docs #598

Merged

6 tasks

drpatelh closed this as completed Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585

Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585

Solyris83 commented Mar 23, 2021

drpatelh commented Apr 11, 2021

j-andrews7 commented Apr 11, 2021 •

edited

Loading

Solyris83 commented Apr 12, 2021 •

edited

Loading

j-andrews7 commented Apr 12, 2021 •

edited

Loading

lpantano commented Apr 13, 2021

j-andrews7 commented Apr 13, 2021

Solyris83 commented Apr 13, 2021

j-andrews7 commented Apr 13, 2021 •

edited

Loading

j-andrews7 commented Apr 13, 2021 •

edited

Loading

drpatelh commented Apr 16, 2021

Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585

Rdata object "deseq2.dds.RData" and gene count matrix "salmon.merged.gene_counts.tsv" not consistent between 2 runs on same data #585

Comments

Solyris83 commented Mar 23, 2021

drpatelh commented Apr 11, 2021

j-andrews7 commented Apr 11, 2021 • edited Loading

Solyris83 commented Apr 12, 2021 • edited Loading

j-andrews7 commented Apr 12, 2021 • edited Loading

lpantano commented Apr 13, 2021

j-andrews7 commented Apr 13, 2021

Solyris83 commented Apr 13, 2021

j-andrews7 commented Apr 13, 2021 • edited Loading

j-andrews7 commented Apr 13, 2021 • edited Loading

drpatelh commented Apr 16, 2021

j-andrews7 commented Apr 11, 2021 •

edited

Loading

Solyris83 commented Apr 12, 2021 •

edited

Loading

j-andrews7 commented Apr 12, 2021 •

edited

Loading

j-andrews7 commented Apr 13, 2021 •

edited

Loading

j-andrews7 commented Apr 13, 2021 •

edited

Loading