report multi-mapping chimeric reads in BAM format #802

suhrig · 2020-01-03T20:33:30Z

Hi Alex,

This is a pull request which enables STAR to report multimapping chimeric reads in BAM format as discussed in issue #462.

The SAM format specification and SAM tags specification are not explicit on how to represent multimapping chimeric alignments. BWA reports them as secondary alignments without supplementary flag/tag. However, this has the major disadvantage that the supplementary nature of these alignments is not evident from the alignment record. Fusion detection tools that build on STAR would thus have a hard time extracting the relevant alignments. I contemplated what would be the best way and came up with the following:

All supplementary chimeric alignments should be flagged as supplementary (0x800 flag set). Exactly one of the supplementary chimeric alignments should be flagged primary (0x100 flag unset); all others should be flagged as secondary (0x100 flag set) in addition to the supplementary flag. This is in line with the initial discussion about introducing the supplementary flag, specifically with this mail and this mail.
The supplementary alignment with the highest alignment score is marked primary. In case of ties, the marking is arbitrary. This is recommended by the SAM format specification for normal (non-chimeric) reads and it makes sense to apply the same logic to chimeric reads.
The SAM format specification says it is arbitrary which segment should be considered the representative (a.k.a. anchor alignment) and which the supplementary. Therefore, we can stick to what STAR has done with uniquely mapping in the past, i.e., consider the mates mapping in close proximity as the representative and the (usually) short chimeric segment as the supplementary.
The SAM tags specfication requires that supplementary alignments point to their related representative alignment using the SA tag. Likewise, the representative alignment must point to the supplementary using the SA tag. This becomes an issue, when only the supplementary maps to multiple loci, while the representative maps to a single locus. Since the SAM format specification allows each tag to appear at most once, the representative would have to opt for one of the multimapping supplementaries. This issue can be resolved by also duplicating the representatives, such that each supplementary has exactly one related representative and vice versa. At first glance, this may seem as if this would generate a lot of redundant SAM records. But it makes the most sense to me for several reasons: (1) The case that the representative maps uniquely and the supplementary maps mutiple times really is a special case of the more general case that both map multiple times. Admittedly, it is a frequent case, but it is nevertheless a special case. It does not make sense to treat it differently than the generalized situation, because this would require quite a bit of additional code in STAR to remove the redundant SAM records, and tools that build on STAR would have to disentangle it again. (2) Multimapping non-chimeric paired-end reads are already treated by STAR like this: even when only one of them maps to multiple loci, the other one is duplicated, too. Doing the same for multimapping chimeric reads would only be consistent. (3) As stated above, it is the only way to fill the SA tags with consistent values.
All chimeric alignments with the same QNAME should have a MAPQ < 255 and the same value for the NH tag, even if only the supplementary segment maps multiple times. This makes sense for the same reasons given in the previous point.
Analogous to the multimapping supplementaries, exactly one of the representatives should be primary (0x100 flag unset), while all others should be flagged secondary (0x100 flag set). Related supplementaries and representatives should have identical HI tags. Thus, multimapping chimeric reads are represented exactly like uniquely mapping chimeric reads when they are segregated by their HI tags, which makes it very easy to adapt existing fusion detection methods to STAR's new ability. In the simplest case, a user could just remove all alignments marked as secondary and use an existing fusion detection algorithm without any modification.

In accordance with these conclusions, I adapted the code of STAR:

I moved the BAM-related code from ReadAlign::chimericDetectionOldOutput() to a new function ChimericAlign::chimericBAMoutput(), because you said that the ChimericAlign class should host the chimeric detection code in the future. I named the function analogously to the existing ChimericAlign::chimericJunctionOutput() function. The alternative would have been to have a lot of redundant code in the ReadAlign and ChimericAlign classes.
ReadAlign::chimericDetectionOldOutput() now simply makes a call to the new function and does not contain any BAM-related code anymore.
Since the ChimericAlign class is a separate class from the ReadAlign class, I had to make some members/methods of ReadAlign public that were previously private: outBAMoneAlign, outBAMoneAlignNbytes, alignBAM(). I hope this does not conflict with abstract concepts of the ReadAlign class. Alternatively, we could write getters and setters to do the same, but I don't really see a point in doing do.
Similarly, I had to make the RA member of the ChimericAlign class public, because it is passed as an argument to the new ChimericAlign::chimericBAMoutput() function. This can be undone, once the old chimeric detection code has been removed, because RA only needs to be passed as an argument for the call from ReadAlign::chimericDetectionOldOutput(). The ChimericAlign class could access it directly due to it being a member of the class.
Once the old chimeric detection code has been removed, the parameters al1, al2, RA, and P can also be removed from the new function ChimericAlign::chimericBAMoutput(), because they are only needed for the call from ReadAlign::chimericDetectionOldOutput(). I deliberately named them exactly like the members of the ChimericAlign class, because they are in fact the same, and cleaning up the code after the old chimeric detection code has been removed will then be as easy as wiping the parameters.
The new function ChimericAlign::chimericBAMoutput() is currently declared static. This declaration can also be removed when the old chimeric detection code has been deleted.

I tested the new code on a few samples and also made sure that STAR still generates the same output when the old code is used (i.e., --chimMultimapNmax 0).

Looking forward to your feedback,
Sebastian

suhrig · 2020-01-31T08:26:00Z

I noticed that the modification causes a segmentation fault when the --peOverlapNbasesMin option is used. I will figure out why and report back.

suhrig · 2020-02-10T20:29:47Z

Hi Alex,

Now, reporting multi-mapping chimeric reads in BAM format also works with --peOverlapNbasesMin. The reason for the segfault was that the PE overlap merge function made a copy of a ReadAlign object (to hold the SE read), but did not pass along the pointers to the BAM output streams (outBAMunsorted and outBAMcoord). After fixing this, the segfaults disappeared, but the SAM records had an invalid format: the sequence was all N and the CIGAR string was incorrect, the reason being: Before writing the chimeric alignment to the BAM file, the alignments need to be converted back to PE format. This was apparently not necessary when writing in the proprietary junctions format and was therefore not done previously. To solve this issue, I applied the same code that is used for the old chimeric detection code. Since the code is now called in two places, I moved it to a separate function ReadAlign::peOverlapChimericSEtoPE() to avoid code redundancy. The commit looks big, but most of it is just moving this code from ReadAlign::chimericDetectionPEmerged() to the new function.

I tested my modifications with a few samples and made sure that the results are identical to the results of the master branch wherever applicable. I successfully tested various combinations of --peOverlapNbasesMin, --chimMultimapNmax, --chimOutType, single-end, paired-end, soft-clipped, and hard-clipped chimeric alignments. I also searched for additional calls to ChimericDetection::chimericDetectionMult() in the code base to avoid further surprise segfaults when certain parameters are used, but the one in the PE overlap merge code was the only other instance.

While reading through the code, I noticed that the ChimericDetection class has a few member variables that aren't used anywhere. The function ChimericDetection::chimericDetectionMult() makes used of equally named variables, but they are all local. And since they are not used anywhere else, there is no reason to keep them as class member variables. To avoid confusion between local and class member variables, I removed them in the last commit.

Regards,
Sebastian

alexdobin · 2020-02-27T22:40:43Z

Hi Sebastian,

thanks a lot for this great contribution! I am sorry I am slow with incorporating this and other pull requests. I hope to get to it next month...

Cheers
Alex

suhrig · 2020-02-27T22:49:11Z

No worries. We all know what it's like ...

suhrig · 2020-07-22T09:12:09Z

Hi Alex, after testing this on a large cohort, including samples of massive size (>1 billion reads), I discovered a memory leak. STAR's memory footprint keeps growing while it's running. I will try to find the leak.

suhrig · 2020-07-28T09:02:49Z

I fixed the leaks. Actually it was two, and the leaks were not introduced by this pull request, but were already present in the main branch of STAR. I guess nobody every noticed, because the leak is triggered by chimeric reads only and those usually make up just a small fraction (1-2%) of the overall reads, so STAR's memory footprint grew by just a few GB. In my tests, I ran STAR on huge samples (1 billion reads) with very sensitive chimeric alignment parameters and --chimMultimapNmax set to 1000, such that a single chimeric read could give rise to hundreds of chimeric alignments. Then, the memory leak became obvious with STAR consuming ~150 GB.

Here is a command to reproduce the memory leak even with the stable version of STAR. It uses awk to generate an infinite stream of chimeric reads. Watch how STAR's memory footprint keeps growing.

STAR --runThreadN 4 --genomeDir human_genome --genomeLoad NoSharedMemory --readFilesIn <(awk 'BEGIN{while (1) print ">"(i++)"\nGCGCGAAGTCAACTTGAAACCTACAAGAGACAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACT"}') --outSAMtype BAM Unsorted --chimSegmentMin 10 --chimOutType WithinBAM

alexdobin · 2020-07-31T03:16:24Z

Hi Sebastian,

thanks a lot for catching the memory leak. The same problem is indeed present in the chimericDetectionOldOutput function. I am not going to fix it now, but rather pull in all of your code - next week, after I make 2.7.5b bug fix release tomorrow. Your code will be separate release 2.7.6a.

Cheers
Alex

alexdobin · 2020-09-19T19:20:31Z

Hi Sebastian,

thank you very much again for this excellent contribution!
After all the delays, I finally pulled it in, without any issues.
I will release 2.7.6a shortly.

Cheers
Alex

suhrig · 2020-09-20T10:25:34Z

Awesome! Thank you so much!

report multi-mapping chimeric reads in BAM format

bfffee4

suhrig added 2 commits February 10, 2020 19:49

make --peOverlapNbasesMin compatible with --chimOutType WithinBAM

10e563b

remove unused members of class ChimericDetection

bf39ba4

suhrig mentioned this pull request Feb 23, 2020

FusionCatcher Dataset suhrig/arriba#47

Closed

alexdobin mentioned this pull request Feb 26, 2020

Empty Chimeric SAM Files Created By Default #816

Open

fix memory leak of chimeric BAM output

b368154

alexdobin merged commit ef89953 into alexdobin:master Sep 19, 2020

alexdobin mentioned this pull request Sep 19, 2020

Release 2.7.6a #1033

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report multi-mapping chimeric reads in BAM format #802

report multi-mapping chimeric reads in BAM format #802

suhrig commented Jan 3, 2020

suhrig commented Jan 31, 2020

suhrig commented Feb 10, 2020 •

edited

Loading

alexdobin commented Feb 27, 2020

suhrig commented Feb 27, 2020

suhrig commented Jul 22, 2020

suhrig commented Jul 28, 2020

alexdobin commented Jul 31, 2020

alexdobin commented Sep 19, 2020

suhrig commented Sep 20, 2020

report multi-mapping chimeric reads in BAM format #802

report multi-mapping chimeric reads in BAM format #802

Conversation

suhrig commented Jan 3, 2020

suhrig commented Jan 31, 2020

suhrig commented Feb 10, 2020 • edited Loading

alexdobin commented Feb 27, 2020

suhrig commented Feb 27, 2020

suhrig commented Jul 22, 2020

suhrig commented Jul 28, 2020

alexdobin commented Jul 31, 2020

alexdobin commented Sep 19, 2020

suhrig commented Sep 20, 2020

suhrig commented Feb 10, 2020 •

edited

Loading