Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option --no-sort-output with dedup #669

Open
Rayan21100 opened this issue Nov 15, 2024 · 4 comments
Open

Option --no-sort-output with dedup #669

Rayan21100 opened this issue Nov 15, 2024 · 4 comments

Comments

@Rayan21100
Copy link

Rayan21100 commented Nov 15, 2024

Hi everyone !

Thank you for this amazing and user friendly tool !

I'm doing bulk RNAseq analysis:

  • I used fastp for trimming and qc analysis
  • STAR for alignement
  • UMI Tools for deduplication
  • Salmon for quantification

In salmon documentation there is a note:

Read / alignment order

Salmon, like eXpress 1, uses a streaming inference method to perform transcript-level quantification. One of the fundamental assumptions of such inference methods is that observations (i.e. reads or alignments) are made “at random”. This means, for example, that alignments should not be sorted by target or position. If your reads or alignments do not appear in a random order with respect to the target transcripts, please randomize / shuffle them before performing quantification with Salmon.

I know my bam files are not sorted after STAR as I didn't use the sorted option. I saw that for UMI Tools there is the --no-sort-output but I didn't find how the sorting was done, ny name ? By genomic position ? Do you think I should precise --no-sort-output to use the output of dedup with salmon ?

Thanks in advance !

@Rayan21100 Rayan21100 changed the title Option --no-sort-output with deduce Option --no-sort-output with dedup Nov 15, 2024
@IanSudbery
Copy link
Member

Without the --no-sort-output the reads coming out of dedup will be sorted by genome position. Definately not right for salmon. However, even with it, the order still won't be fully random. Also, the pairing information isn't uite what salmon would expect (i.e. exactly one read 2 alignment per read 1 alignment, with the pairing information pointing at each other. I suggest that you add the following steps to your protocol:

  • I used fastp for trimming and qc analysis
  • STAR for alignement
  • UMI Tools for deduplication
  • samtools sort -n to sort by read name
  • UMI Tools prepare-for-rsem to make the pairing info more salmon friendly.
  • Salmon for quantification

Remeber you need to align to the transcriptome if you are going to use salmon for quantification.

@Rayan21100
Copy link
Author

Rayan21100 commented Nov 17, 2024

Thank you for your answer ! So would you say that it's better to use the --no-sort-output ?
I was planning to use samtools collate to shuffle the reads but keeping them with their pairs but I have indeed pairs-related error with salmon so I might use prepare-for-rsem. Do you know if the reads will be randomize in that case ? I imagine that doing samtools sort -n will randomize them anyway (or maybe I could use samtools collate instead ?) but I was wondering how the order was kept with prepare-for-rsem

Thanks in advance !

@Rayan21100
Copy link
Author

Update: I tried both sort and collate before prepare-for-rsem and then I quantified with salmon. I didn't look exactly at the output of salmon but at least it's now running without error.
However I have a lot of warning during prepare-for-rsem in both cases:
2024-11-17 13:32:15,404 WARNING Alignment VH01309:279:AACMHJMHV:2:2506:64491:25895:UMI_ATTTTTTA 419 ENST00000619423 2053 has no mate -- skipped
I have 164 reads with no mates, do you think I should remove them with samtools view ?

@IanSudbery
Copy link
Member

No, prepare-for-rsem should do that for you. Collate should be fine - we just need all reads with the same name to be together. Note that this will be more than just two reads from a pair - if you are aligning to a transcriptiome, reads will map multiple times so the different transcripts of the same gene.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants