Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.7.8a 2-pass produces corrupt SAM #1220

Closed
schnabelr opened this issue Apr 24, 2021 · 8 comments
Closed

2.7.8a 2-pass produces corrupt SAM #1220

schnabelr opened this issue Apr 24, 2021 · 8 comments
Labels

Comments

@schnabelr
Copy link

STAR finishes normally. However, samtools sort complains about the file(s)

[W::sam_read1] Parse error at line 35603
samtools sort: truncated file. Aborting

And picard ValidateSamFile confirms issues.

picard ValidateSamFile I=HFD.93471.56804.R.AP.02.DUP.P.Aligned.out.sam MODE=SUMMARY
SAMFormatException on record 17662801
ERROR   2021-04-23 14:09:52     ValidateSamFile SAMFormatException on record 17662801

This happens for every file that I've tried and is reproducible. I'll only use one fastq pair as an example of the error/issue.
If I look at the SAM file, the RNAME position contains data that doesn't correspond to a chr/contig name.
This is cow so the chromosomes are named 1..29,X,Y,MT and NKLS* for unplaced contigs. Note lines 1,3-5,7-8 and the last line.

grep -v -e '^@' HFD.93471.56804.R.AP.02.DUP.P.Aligned.out.sam | awk '{print $3}' | sort -h | uniq -c
      6
  13890 *
      2 AS:i:90
      1 @@@DDDDDH?DDHCFGGE661?9CF?CFHGCHDGHIIGEBEHIEHIGHFH
      2 HI:i:1
1382559 MT
      5 NH:i:1
      2 NH:i:2
    266 NKLS02000210.1
      2 NKLS02000361.1
<snip>
     60 NKLS02002210.1
      2 RG:Z:HFD.93471.56804.R.AP.02
  63430 X
    412 Y
  24860 1
 546092 2
<snip>
 407952 28
 174062 29
      1 60

I run this via a Perl pipeline and was able to exclude issues with the input FASTQ and also the reference and GTF.
Everything works/completes normally using the exact same reference fasta and GTF to build the genome using 2.5.2b.
I also excluded compilation issues by testing with the binaries distributed with the source and get same type of errors.
I also tested using the master (2.7.8a_2021-03-08) and saw the same behavior. Note, when I rebuild the genome for each of these different tests they are in their own directories.

This appears somewhat similar to report #1209. However, I'm only using 6 threads for the alignment so I don't think it's a thread number issue per se. Please let me know if there are any other tests that I could do or parameters to change that may help fix/diagnose this.

Attached are the various logs.
2.7.8a_Log.out is the genomeGenerate log.
2.5.2b_Log.out is the genomeGenerate log.
Example log files
HFD.93471.56804.R.AP.02.DUP.P.Log.out

2.7.8a_Log.out.txt
2.5.2b_Log.out.txt
HFD.93471.56804.R.AP.02.DUP.P.Log.out.txt

Steps to build the reference. The rest is in the logs above for the example.

uname -a
Linux 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Build 2.7.8a reference

module purge
module load star/2.7.8a
STAR --runThreadN 40 \
 --runMode genomeGenerate \
 --genomeDir /storage/hpc/group/UMAG/REF_GENOME/STAR/9913/ensembl_ars1.2.103_2.7.8a \
 --genomeFastaFiles /storage/hpc/group/UMAG/REF_GENOME/Bos_taurus.ARS-UCD1.2.dna.toplevel.103.Btau_5.0.1_chrY.fa \
 --sjdbGTFfile /storage/hpc/group/UMAG/REF_GENOME/Bos_taurus.ARS-UCD1.2.Btau_5.0.1Y.103.gtf \
 --sjdbOverhang 99

Build 2.5.2b reference (This works fine)

module purge
module load star/star-2.5.2b
STAR --runThreadN 40 \
 --runMode genomeGenerate \
 --genomeDir /storage/hpc/group/UMAG/REF_GENOME/STAR/9913/ensembl_ars1.2.103_2.5.2b \
 --genomeFastaFiles /storage/hpc/group/UMAG/REF_GENOME/Bos_taurus.ARS-UCD1.2.dna.toplevel.103.Btau_5.0.1_chrY.fa \
 --sjdbGTFfile /storage/hpc/group/UMAG/REF_GENOME/Bos_taurus.ARS-UCD1.2.Btau_5.0.1Y.103.gtf \
 --sjdbOverhang 99
@alexdobin
Copy link
Owner

Hi @schnabelr

thanks for detailed report!
samtools sort seems to think the Aligned.out.sam file is truncated.
Is 17662801th line (that picard does not like) the last line of the file, and does it look truncated?
If not, please post this line.

The truncation may happen if you run out of disk space during a run, though it's strange that it worked fine with the older version.

Cheers
Alex

@schnabelr
Copy link
Author

@alexdobin
Disk space is not an issue, last I checked I have 163T free where I'm running this.

Well this is interesting. For the input for this sample, I have a total of 10 paired fastq files and 20 unpaired (after trimming with trimmomatic one of the reads was removed) for a total of 30 STAR output SAM files being produced.

I counted the number of lines in each of the SAM files and also ran picard ValidateSamFile on each file. For all 10 of the PE files ValidateSam gave the same error SAMFormatException on record NNNN and in every case the NNNN number was way higher than the number of lines in the SAM file. So I think that NNNN value isn't actually a line number. For every one of the SE file, picard ValidateSamFile gave the same WARNING:MISSING_TAG_NM XXXX where XXXX is a different number. I didn't look into these.

So it appears that for PE input data it is producing a truncated file and for the SE input data it is warning about a missing tag. As I said previously, all of this data works with 2.5.2b. I just ran 100 transcriptomes last night, including this example, and they all finished just fine with 2.5.2b.

Would it be of any value to repeat this with a STAR version between these two? If so, just let me know which one. It really doesn't take much effort. Alternatively, I can put the reference.fa, GTF and pair of fastq files somewhere for you grab if that would help.

Bob

@alexdobin
Copy link
Owner

Hi Bob,

thanks for investigating it. The NM tag absence is OK, since you did not specify it in --outSAMattrbiutes. You can suppress picard's checking of this tag, but it look like SE mapping is OK.

You are not using any uncommon parameter combinations, so the problem is either dataset specific, or system specific.

Can you do a couple of more checks with 2.7.8a?

  1. In my experience, picard record corresponds to the line number, though I do not remember if it counts lines in the SAM header. Also, samtools reported truncation at line 35k, while picard complained about line 17M - how many lines are actually in the Aligned.out.sam file? How does the tail of the file look?
  2. Use a different filesystem, preferrably local disk, not network. Some users reported issues with network-attached filesystems, though it's very rare.
  3. Map it without any extra parameters, i.e. only specify --genomeDir and --readFilesIn

Thanks!
Alex

@schnabelr
Copy link
Author

Alex,
TL;DR It turn out the offending flag is --outFilterType BySJout. That is the only one that produces the issue.

Map it without any extra parameters, i.e. only specify --genomeDir and --readFilesIn

That worked. So to try and figure out if the issue was one (or more) of the flags, I ran them sequentially, adding one at a time and checking the sam output until something broke. That's how I found out it was --outFilterType BySJout. Without that one, all the others worked.

For this run with --outFilterType BySJout , I grabbed the last 4 line from the sam.

tail -n4 Run5Aligned.out.sam
HWI-700167R:573:C6W1GACXX:1:2310:5346:72568     99      19      29483137        60      13S37M  =       29485251        2160    GTGAACCTCCCGGCTCTTCACTGATGATTTTTGTGTGAACCTCCCGGCTC       @@@DDDDDBHH@AHGIGBDHGHGHIGIIIIIGFCFH@@BGDFEEHAEGBG      NH:i:1  HI:i:1  AS:i:80 nM:i:0  RG:Z:HFD.93471.56804.R.AP.02
HWI-700167R:573:C6W1GACXX:1:2310:5346:72568     147     19      29485251        60      1S46M   =       29483137        -2160   TCCGCTTCCTCAGCTTGTCTCTTGTAAGATTTCACCTTTGCTTGCAG D@G?:?*1GDGAE:CBE@D@@FFDDF:CFB?C<<>FBC:D?DDD@@?  NH:i:1  HI:i:1  AS:i:80 nM:i:0  RG:Z:HFD.93471.56804.R.AP.02
HWI-700167R:573:C6W1GACXX:1:2311:11631:46288    99      19      29489287        60      11M163N39M      =       29489574        2102    GCGCTCTTGGCCTTTATCTCCTCTTCCAGCTGCCTTTTCAGTTCTTCAAT       CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJIJJIJJIJJJJJJII      NH:i:1  HI:i:1  AS:i:99 nM:i:1  RG:Z:HFD.93471.56804.R.AP.02
HWI-700167R:573:C6W1GACXX:1:2311:11631:46288    147     19      29489574        60      6M1765N44M      =       29489287        -2102   ATTCACCTGATTCTGTCTGCAGACGCGCTCTCTGAGTCGTTAGGTCATTG       JJIF@JJJJJJJJJJJJJJJJJIIJJJJJJJJJJJJJHHHHHFFFFFCCC      NH:i:1  HI:i:1  AS:i:99 nM:i:1  RG:Z:HFD.93471.56804.R.AP.02

I then go back to the output without any flags (that is complete) and pull out the matching reads. The second line below corresponds to the last line of the corrupt file. For the 3rd line below, am I reading that CIGAR correctly as a 149,879 base gap?

grep -n HWI-700167R:573:C6W1GACXX:1:2311:11631:46288 Aligned.out.sam

4751828:HWI-700167R:573:C6W1GACXX:1:2311:11631:46288    99      19      29489287        3       11M163N39M      =       29489574        2102    GCGCTCTTGGCCTTTATCTCCTCTTCCAGCTGCCTTTTCAGTTCTTCAAT       CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJIJJIJJIJJJJJJII      NH:i:2  HI:i:1  AS:i:99 nM:i:1
4751829:HWI-700167R:573:C6W1GACXX:1:2311:11631:46288    147     19      29489574        3       6M1765N44M      =       29489287        -2102   ATTCACCTGATTCTGTCTGCAGACGCGCTCTCTGAGTCGTTAGGTCATTG       JJIF@JJJJJJJJJJJJJJJJJIIJJJJJJJJJJJJJHHHHHFFFFFCCC      NH:i:2  HI:i:1  AS:i:99 nM:i:1
4751830:HWI-700167R:573:C6W1GACXX:1:2311:11631:46288    355     19      29339571        3       11M149879N39M   =       29489574        151818  GCGCTCTTGGCCTTTATCTCCTCTTCCAGCTGCCTTTTCAGTTCTTCAAT       CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJIJJIJJIJJJJJJII      NH:i:2  HI:i:2  AS:i:98 nM:i:0
4751831:HWI-700167R:573:C6W1GACXX:1:2311:11631:46288    403     19      29489574        3       6M1765N44M      =       29339571        -151818 ATTCACCTGATTCTGTCTGCAGACGCGCTCTCTGAGTCGTTAGGTCATTG       JJIF@JJJJJJJJJJJJJJJJJIIJJJJJJJJJJJJJHHHHHFFFFFCCC      NH:i:2  HI:i:2  AS:i:98 nM:i:0

Hopefully this may give you an idea of where something is going wrong.
Bob

@alexdobin
Copy link
Owner

Hi Bob:

thanks for the thorough investigation!
This is a bug, I could reproduce it on my test dataset. For some reason, a string of 0-bytes is added between the SAM records. I will fix it shortly. Issue #1209 is indeed related, I agree.

Cheers
Alex

@alexdobin
Copy link
Owner

Hi Bob,

the 2.7.9a release should have fixed this issue, please try it out.

Cheers
Alex

@schnabelr
Copy link
Author

Alex,
I tested this and it is now working. Thanks for the quick fix!
Bob

@alexdobin
Copy link
Owner

Hi Bob,

great! Many thanks for helping to find the bug!

Cheers
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants