Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Splicing Events written in any output file,although in stdout I can see many that were counted. #89

Closed
mariakondili opened this issue Feb 17, 2021 · 17 comments

Comments

@mariakondili
Copy link

mariakondili commented Feb 17, 2021

Hello,
I find it really weird to obtain empty files in results folder (A3SS.MATS.JC/JCEC, etc.)
while no Error is poping up, and I am seeing in my slurm.out the following :

gtf: 25.90836787223816
There are 57820 distinct gene ID in the gtf file
There are 196520 distinct transcript ID in the gtf file
There are 35613 one-transcript genes in the gtf file
There are 1196293 exons in the gtf file
There are 24864 one-exon transcripts in the gtf file
There are 21893 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 3.398824
Average number of exons per transcript is 6.087386
Average number of exons per transcript excluding one-exon tx is 6.824282
Average number of gene per geneGroup is 7.482150
statistic: 0.03287625312805176
novel: 1314.3629331588745
The splicing graph and candidate read have been saved into /shared/projects/mbnl_dct/human_cellline/rMATS_SplicingCounts//Ctrl_vs_DM1_Splicing//tmp//2021-02-17-18:19:13_583532.rmats
save: 0.30976057052612305
loadsg: 0.05762648582458496

Done processing each gene from dictionary to compile AS events
Found 38132 exon skipping events
Found 2040 exon MX events
Found 12653 alt SS events
There are 7727 alt 3 SS events and 4926 alt 5 SS events.
Found 5558 RI events

ase: 2.775017023086548
count: 0.6100132465362549
Processing count files.
Done processing count files.

Only the " fromGTF." files per splicing type are not empty.
The others have only headers without any gene record.

Input are BAM files from STAR alignment, gtf is given same as one for STAR, there are 3 replicates per condition, and I run in cluster with 16 cpus-per-task.

My command :

group1="control_bam.txt"
group2="treatment_bam.txt"
gtf="gencode.v19.annotation.gtf"
cores=16

python ${rmats_dir}/rmats.py  --b1 ${group1}  --b2  ${group2} \
--gtf ${gtf} --readLength 100 \
--od ${outDir} --tmp  ${outDir}/tmp  \
-t "paired"  --libType fr-firststrand   \
--nthread ${cores} --tstat ${cores}  \
--task both

Could sb give me any hints on what may be happening and how to find out ?

  • I tried with --libType unstranded too, and I get the same. Reads should anyway be on the first-strand.
  • Also tried to run for 1 replicate per condition, always same.
@EricKutschera
Copy link
Contributor

If you are able to use rmats version 4.1.1 then the output will contain a section that starts with "read outcome totals across all BAMs". That section shows how many of the input reads were not able to be used along with the reason why. That output should help identify the reason for the empty output files

@mariakondili
Copy link
Author

Thanks Eric ! I launched 4.1.1, and found the following information in stdout:

read outcome totals across all BAMs
USED: 355595054
NOT_PAIRED: 0

NOT_NH_1: 200445609
NOT_EXPECTED_CIGAR: 8808873
NOT_EXPECTED_READ_LENGTH: 0

NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 102232863
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 779927
CLIPPED: 160580484
total: 828442810

I have results of Splicing events now, because I corrected readLength =101 from =100(!!!), when I saw NOT_EXPECTED_READ_LENGTH: 458607844 !

Can you tell me though, what the other parameters mean, and if there is something else bad in my bams that I can take care of ?
NOT_NH_1
NOT_EXPECTED_CIGAR
EXON_NOT_MATCHED_TO_ANNOTATION
JUNCTION_NOT_MATCHED_TO_ANNOTATION

@EricKutschera
Copy link
Contributor

NOT_NH_1
This means that the alignment did not have the tag NH=1, but rmats requires reads to be uniquely mapped. The NH tag is described in this document: https://samtools.github.io/hts-specs/SAMtags.pdf

NH i Number of reported alignments that contain the query in the current record

And from the STAR manual: https://raw.githubusercontent.com/alexdobin/STAR/master/doc/STARmanual.pdf

The number of loci Nmap a read maps to is given by NH:i:Nmap field

NOT_EXPECTED_CIGAR
This means that the cigar in the alignment did not match one of the expected patterns. The expected cigars are either "M" for an exon read, or ("MNM", "MNMNM", ...) for junction reads. Cigar operations are described in: https://samtools.github.io/hts-specs/SAMv1.pdf. It could be that the alignments have insertion or deletion cigar operations (I, D)

EXON_NOT_MATCHED_TO_ANNOTATION
This means that the cigar matched the expected exon read pattern "M", but the aligned region was not part of an annotated exon from the input gtf

JUNCTION_NOT_MATCHED_TO_ANNOTATION
This means that the cigar matched the expected junction read pattern ("MNM", "MNMNM", ...), but the aligned junctions did not match up with annotated exons from the input gtf

@mariakondili
Copy link
Author

Thanks so much for the explanations. I ll see about the non-annotated exons,if we are interested. Very nice improvement of the latest version of rMATS anyway !
Also, I d like to say that the developers of our cluster asked me if it's possible to pass also the turbo version by bioconda/conda env ,to allow an easier installation in the future.

@EricKutschera
Copy link
Contributor

the developers of our cluster asked me if it's possible to pass also the turbo version by bioconda/conda env ,to allow an easier installation in the future.

rMATS turbo v4.1.1 is available on bioconda: https://anaconda.org/bioconda/rmats

@barbarainb
Copy link

Hi there I have a similar problem but i dont really know how to fix it.

I am using a non-model species (Gerbillus gerbillus) mapped to Meriones unguiculatus in HISAT2, the reads are paired-end and the length of them is 150 bp. I am using .sorted.bam files and the following command to run rmats:

$ rmats.py --b1 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/b1_C_gg_hippocampus.txt --b2 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/b2_H_gg_hippocampus.txt --gtf /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/m_unguiculatus.gtf -t paired --readLength 150 --variable-read-length --nthread 2 --od /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/rmats/hippocampus --tmp /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/rmats/hippocampus

the result files are all empty even though in the end says "done" and describes several splicing events:
image

in my "read_outcomes_by_bam" file i have this:

/group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG03HpA.sorted.bam
USED: 0
NOT_PAIRED: 6888988
NOT_NH_1: 3368169
NOT_EXPECTED_CIGAR: 1121087
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 8043308
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 12764646
CLIPPED: 9106837
TOTAL_FOR_BAM: 41293035
/group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG04HpA.sorted.bam
USED: 0
NOT_PAIRED: 5903915
NOT_NH_1: 3243639
NOT_EXPECTED_CIGAR: 1065826
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 7497960
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 11515982
CLIPPED: 8710206
TOTAL_FOR_BAM: 37937528
/group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG05HpA.sorted.bam

(...)

So i dont really know whats the problem here, i already went to the FASTQC files even, to confirm the length and it is correct (...), could someone maybe help me here? I am not very experienced with this program I would really appreciate some help
Thanks in advance

@EricKutschera
Copy link
Contributor

From the read outcomes about 20% of the reads are CLIPPED. You can use those reads by passing --allow-clipping to rmats

About 15% of the reads are NOT_PAIRED which means the aligner did not mark them as properly paired

The main issue seems to be EXON_NOT_MATCHED_TO_ANNOTATION and JUNCTION_NOT_MATCHED_TO_ANNOTATION which together are about 50% of the reads. Those outcomes happen when the read passes the other filters but rmats is unable to match up the alignment to the annotated exons and splice sites. That could happen if the --gtf given to rmats doesn't agree with the reference files used to align the reads. Another potential cause of NOT_MATCHED_TO_ANNOTATION is that the reads just don't align well with the reference. Since you are using a non-model species maybe the reference files don't match the actual exons in your data

@barbarainb
Copy link

Hi EricKutschera, thanks for your reply, yesterday I was ging trough the documentation and I added the --allow-clipping to the code, nothing changed, it continues to give :

Done processing each gene from dictionary to compile AS events
Found 7053 exon skipping events
Found 443 exon MX events
Found 830 alt SS events
There are 477 alt 3 SS events and 353 alt 5 SS events.
Found 1131 RI events

but when I check the result files, I only have headers again.

The thing is I mapped my paired-end G. gerbillus RNAseq samples to unguiculatus with HISAT2, and even though the alignment was like of 50% something almost all the proteins in the used gff at the time were mapped (23000 something).

the thing is, we actually know there is splicing events here, because I have in parallel a colaborator that used star and got a higher mapping and found splicing events in the exact same samples with the exact same species...

Thanks why I think its very strange because if the results are shown and say that there are splicing evens, shouldn´t they be saved in the output text files ?

Thanks again for your kind reply, help is much appreciated
kind regards :)

@EricKutschera
Copy link
Contributor

rmats can detect events using information from the GTF and from the bams. There are different files that report the events detected based on different information as described in the README: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output

fromGTF.[AS_Event].txt: All identified alternative splicing (AS) events derived from GTF and RNA
fromGTF.novelJunction.[AS_Event].txt: Alternative splicing (AS) events which were identified only after considering the RNA (as opposed to analyzing the GTF in isolation). This does not include events with an unannotated splice site
fromGTF.novelSpliceSite.[AS_Event].txt: This file contains only those events which include an unannotated splice site. Only relevant if --novelSS is enabled

If the output has USED: 0 then the fromGTF novel files should be empty and all of the events were detected using the gtf. The final output files like [AS_Event].MATS.JC.txt are filtered unless rmats is run with --statoff. The filter removes events without supporting reads for each sample group and each isoform: https://github.com/Xinglab/rmats-turbo/blob/v4.1.2/rmats.py#L335
The MATS output files would be filtered to just the headers if there are no used reads

@Portulaca666
Copy link

Hi EricKutschera, thanks for your reply, yesterday I was ging trough the documentation and I added the --allow-clipping to the code, nothing changed, it continues to give :

Done processing each gene from dictionary to compile AS events Found 7053 exon skipping events Found 443 exon MX events Found 830 alt SS events There are 477 alt 3 SS events and 353 alt 5 SS events. Found 1131 RI events

but when I check the result files, I only have headers again.

The thing is I mapped my paired-end G. gerbillus RNAseq samples to unguiculatus with HISAT2, and even though the alignment was like of 50% something almost all the proteins in the used gff at the time were mapped (23000 something).

the thing is, we actually know there is splicing events here, because I have in parallel a colaborator that used star and got a higher mapping and found splicing events in the exact same samples with the exact same species...

Thanks why I think its very strange because if the results are shown and say that there are splicing evens, shouldn´t they be saved in the output text files ?

Thanks again for your kind reply, help is much appreciated kind regards :)

Hi . I also used HISAT2 and the same with you !

@Portulaca666
Copy link

Can you tell me what should I can do to deal with this matter ?
image
I only have the header in not FromGTF files .

@Portulaca666
Copy link

gtf: 175.58477663993835
There are 59251 distinct gene ID in the gtf file
There are 207289 distinct transcript ID in the gtf file
There are 35878 one-transcript genes in the gtf file
There are 2214932 exons in the gtf file
There are 23283 one-exon transcripts in the gtf file
There are 18592 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 3.498489
Average number of exons per transcript is 10.685237
Average number of exons per transcript excluding one-exon tx is 11.910747
Average number of gene per geneGroup is 1144.576891
statistic: 0.06993341445922852

read outcome totals across all BAMs
USED: 2640635110
NOT_PAIRED: 243814325
NOT_NH_1: 642401120
NOT_EXPECTED_CIGAR: 28491631
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 172196967
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 31942217
CLIPPED: 235498435
total: 3994979805

@EricKutschera
Copy link
Contributor

The output shows a lot of used reads USED: 2640635110. If you only have header lines in the output files like SE.MATS.JC.txt then it might be due to certain samples that don't have supporting read counts. You could try looking at the counts after re-running with --statoff. Here's a related post: https://groups.google.com/g/rmats-user-group/c/6X2kIyoQBZI/m/byMMhxLRAwAJ

@tud03125
Copy link

If you are able to use rmats version 4.1.1 then the output will contain a section that starts with "read outcome totals across all BAMs". That section shows how many of the input reads were not able to be used along with the reason why. That output should help identify the reason for the empty output files

I have rMATS verison 4.3.0, and I am also getting similar issues, where rMATS seem to show results and not show any errors, yet the output files, esp. JC and JCEC ones, are empty. Does it have to be version 4.1.1 specifically?

@EricKutschera
Copy link
Contributor

4.3.0 also shows the read outcome totals. They are printed to stdout and also written to the --tmp directory in files like *_read_outcomes_by_bam.txt

@tud03125
Copy link

@EricKutschera They do produce those. But still, when I ran rMATS through conda run rmats.py, using my USCS-formatted BAM files, only the "fromGTF.[AS event].txt" file produced information, while the rest: "fromGTF.novelSpliceSite.[AS event].txt," "JC.raw.input.[AS event}.txt," "JCEC.raw.input.[AS event].txt," "[AS event].MATS.JC.txt" and "[AS event].MATS.JCEC.txt" files only produced the header while the rest is blank. I was asking if I should lower the rMATS version to 4.1.1, or is 4.3.0 shouldn't be making any of these issues?

I'll attach some files here to show you what I meant (all the rMATS files as .txt, and RI-focused results just to not overload them, and since that one's specific to our research):
2024-10-30-18_52_07_654130_read_outcomes_by_bam.txt
2024-10-30-18_52_07_654130_0.txt
2024-10-30-18_52_07_654130_1.txt
2024-10-30-18_52_07_654130_2.txt
2024-10-30-18_52_07_654130_3.txt
2024-10-30-18_52_07_654130_4.txt
2024-10-30-18_52_07_654130_5.txt
2024-10-30-18_52_07_654130_6.txt
2024-10-30-18_52_07_654130_7.txt
2024-10-30-18_52_07_654130_8.txt
2024-10-30-18_52_07_654130_9.txt
summary.txt
fromGTF.RI.txt
fromGTF.novelSpliceSite.RI.txt
fromGTF.novelJunction.RI.txt
JC.raw.input.RI.txt
JCEC.raw.input.RI.txt
RI.MATS.JC.txt
RI.MATS.JCEC.txt

@EricKutschera
Copy link
Contributor

From the read outcomes file, only about 2,000 reads were USED out of about 800 million. About 65% were filtered for NOT_EXPECTED_READ_LENGTH and about 10% were filtered for CLIPPED

If your reads are a fixed length you can double check the value of --readLength. You could also use --variable-read-length and --allow-clipping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants