-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
report multi-mapping chimeric reads in BAM format #802
report multi-mapping chimeric reads in BAM format #802
Conversation
I noticed that the modification causes a segmentation fault when the |
Hi Alex, Now, reporting multi-mapping chimeric reads in BAM format also works with I tested my modifications with a few samples and made sure that the results are identical to the results of the While reading through the code, I noticed that the Regards, |
Hi Sebastian, thanks a lot for this great contribution! I am sorry I am slow with incorporating this and other pull requests. I hope to get to it next month... Cheers |
No worries. We all know what it's like ... |
Hi Alex, after testing this on a large cohort, including samples of massive size (>1 billion reads), I discovered a memory leak. STAR's memory footprint keeps growing while it's running. I will try to find the leak. |
I fixed the leaks. Actually it was two, and the leaks were not introduced by this pull request, but were already present in the main branch of STAR. I guess nobody every noticed, because the leak is triggered by chimeric reads only and those usually make up just a small fraction (1-2%) of the overall reads, so STAR's memory footprint grew by just a few GB. In my tests, I ran STAR on huge samples (1 billion reads) with very sensitive chimeric alignment parameters and Here is a command to reproduce the memory leak even with the stable version of STAR. It uses awk to generate an infinite stream of chimeric reads. Watch how STAR's memory footprint keeps growing.
|
Hi Sebastian, thanks a lot for catching the memory leak. The same problem is indeed present in the chimericDetectionOldOutput function. I am not going to fix it now, but rather pull in all of your code - next week, after I make 2.7.5b bug fix release tomorrow. Your code will be separate release 2.7.6a. Cheers |
Hi Sebastian, thank you very much again for this excellent contribution! Cheers |
Awesome! Thank you so much! |
Hi Alex,
This is a pull request which enables STAR to report multimapping chimeric reads in BAM format as discussed in issue #462.
The SAM format specification and SAM tags specification are not explicit on how to represent multimapping chimeric alignments. BWA reports them as secondary alignments without supplementary flag/tag. However, this has the major disadvantage that the supplementary nature of these alignments is not evident from the alignment record. Fusion detection tools that build on STAR would thus have a hard time extracting the relevant alignments. I contemplated what would be the best way and came up with the following:
All supplementary chimeric alignments should be flagged as supplementary (
0x800
flag set). Exactly one of the supplementary chimeric alignments should be flagged primary (0x100
flag unset); all others should be flagged as secondary (0x100
flag set) in addition to the supplementary flag. This is in line with the initial discussion about introducing the supplementary flag, specifically with this mail and this mail.The supplementary alignment with the highest alignment score is marked primary. In case of ties, the marking is arbitrary. This is recommended by the SAM format specification for normal (non-chimeric) reads and it makes sense to apply the same logic to chimeric reads.
The SAM format specification says it is arbitrary which segment should be considered the representative (a.k.a. anchor alignment) and which the supplementary. Therefore, we can stick to what STAR has done with uniquely mapping in the past, i.e., consider the mates mapping in close proximity as the representative and the (usually) short chimeric segment as the supplementary.
The SAM tags specfication requires that supplementary alignments point to their related representative alignment using the
SA
tag. Likewise, the representative alignment must point to the supplementary using theSA
tag. This becomes an issue, when only the supplementary maps to multiple loci, while the representative maps to a single locus. Since the SAM format specification allows each tag to appear at most once, the representative would have to opt for one of the multimapping supplementaries. This issue can be resolved by also duplicating the representatives, such that each supplementary has exactly one related representative and vice versa. At first glance, this may seem as if this would generate a lot of redundant SAM records. But it makes the most sense to me for several reasons: (1) The case that the representative maps uniquely and the supplementary maps mutiple times really is a special case of the more general case that both map multiple times. Admittedly, it is a frequent case, but it is nevertheless a special case. It does not make sense to treat it differently than the generalized situation, because this would require quite a bit of additional code in STAR to remove the redundant SAM records, and tools that build on STAR would have to disentangle it again. (2) Multimapping non-chimeric paired-end reads are already treated by STAR like this: even when only one of them maps to multiple loci, the other one is duplicated, too. Doing the same for multimapping chimeric reads would only be consistent. (3) As stated above, it is the only way to fill theSA
tags with consistent values.All chimeric alignments with the same
QNAME
should have aMAPQ
< 255 and the same value for theNH
tag, even if only the supplementary segment maps multiple times. This makes sense for the same reasons given in the previous point.Analogous to the multimapping supplementaries, exactly one of the representatives should be primary (
0x100
flag unset), while all others should be flagged secondary (0x100
flag set). Related supplementaries and representatives should have identicalHI
tags. Thus, multimapping chimeric reads are represented exactly like uniquely mapping chimeric reads when they are segregated by theirHI
tags, which makes it very easy to adapt existing fusion detection methods to STAR's new ability. In the simplest case, a user could just remove all alignments marked as secondary and use an existing fusion detection algorithm without any modification.In accordance with these conclusions, I adapted the code of STAR:
I moved the BAM-related code from
ReadAlign::chimericDetectionOldOutput()
to a new functionChimericAlign::chimericBAMoutput()
, because you said that theChimericAlign
class should host the chimeric detection code in the future. I named the function analogously to the existingChimericAlign::chimericJunctionOutput()
function. The alternative would have been to have a lot of redundant code in theReadAlign
andChimericAlign
classes.ReadAlign::chimericDetectionOldOutput()
now simply makes a call to the new function and does not contain any BAM-related code anymore.Since the
ChimericAlign
class is a separate class from theReadAlign
class, I had to make some members/methods ofReadAlign
public that were previously private:outBAMoneAlign
,outBAMoneAlignNbytes
,alignBAM()
. I hope this does not conflict with abstract concepts of theReadAlign
class. Alternatively, we could write getters and setters to do the same, but I don't really see a point in doing do.Similarly, I had to make the
RA
member of theChimericAlign
class public, because it is passed as an argument to the newChimericAlign::chimericBAMoutput()
function. This can be undone, once the old chimeric detection code has been removed, becauseRA
only needs to be passed as an argument for the call fromReadAlign::chimericDetectionOldOutput()
. TheChimericAlign
class could access it directly due to it being a member of the class.Once the old chimeric detection code has been removed, the parameters
al1
,al2
,RA
, andP
can also be removed from the new functionChimericAlign::chimericBAMoutput()
, because they are only needed for the call fromReadAlign::chimericDetectionOldOutput()
. I deliberately named them exactly like the members of theChimericAlign
class, because they are in fact the same, and cleaning up the code after the old chimeric detection code has been removed will then be as easy as wiping the parameters.The new function
ChimericAlign::chimericBAMoutput()
is currently declaredstatic
. This declaration can also be removed when the old chimeric detection code has been deleted.I tested the new code on a few samples and also made sure that STAR still generates the same output when the old code is used (i.e.,
--chimMultimapNmax 0
).Looking forward to your feedback,
Sebastian