-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trivial spark tool for extracting original SAM records based on a file … #3589
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3589 +/- ##
===============================================
+ Coverage 79.736% 79.753% +0.017%
- Complexity 18164 18173 +9
===============================================
Files 1218 1219 +1
Lines 66671 66688 +17
Branches 10430 10431 +1
===============================================
+ Hits 53161 53186 +25
+ Misses 9297 9288 -9
- Partials 4213 4214 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes... what about integration testing?
private String outputSAM; | ||
|
||
@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg", | ||
fullName = "requireRG", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would define constants for full and short names of all these arguments that do not make reference to the StandardArgumentDefinitions
constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comments above and bellow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added short name for readNameFile
final Broadcast<HashSet<String>> namesToLookForBroadcast = ctx.broadcast(parseReadNames()); | ||
|
||
final JavaRDD<GATKRead> reads = | ||
getUnfilteredReads().repartition(80) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 80? magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historically put there before using writeReads
for performance tuning.
Removed.
final HashSet<String> namesToLookFor = new HashSet<>(); | ||
String line; | ||
while ( (line = rdr.readLine()) != null ) { | ||
namesToLookFor.add(line.replace("@", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a guaranteed that such sequences @ /1 /2 cannot be in the middle of a read name? Otherwise you could use a regex ("^@") and ("/\d$").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps then you would call replaceAll or replaceFirst instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you could use BufferedReader's Stream<String> lines()
:
rdr.lines().map(s -> s.replace(...).replace(...)).collect(Collectors.toSet());
But remember to capture UncheckedIOException.
|
||
try ( final BufferedReader rdr = | ||
new BufferedReader(new InputStreamReader(BucketUtils.openFile(readNameFile))) ) { | ||
final HashSet<String> namesToLookFor = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be as unspecific as possible... here namesToLookFor could be declared as Set<String>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@BetaFeature | ||
public final class ExtractOriginalAlignmentRecordsByNameSpark extends GATKSparkTool { | ||
private static final long serialVersionUID = 1L; | ||
private final Logger localLogger = LogManager.getLogger(ExtractOriginalAlignmentRecordsByNameSpark.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that logger
has already been initialized appropriately, you don't need to declared another logger here unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been doing this for a while. Seems unnecessary. Removed. Thanks!
|
||
@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg", | ||
fullName = "requireRG", optional = true) | ||
private boolean require = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem that require is never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
Done for now, please take a look at the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments.
private String outputSAM; | ||
|
||
@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg", | ||
fullName = "requireRG", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comments above and bellow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did requested changes. Also added an integration test.
back to you @vruano . Thanks!
@BetaFeature | ||
public final class ExtractOriginalAlignmentRecordsByNameSpark extends GATKSparkTool { | ||
private static final long serialVersionUID = 1L; | ||
private final Logger localLogger = LogManager.getLogger(ExtractOriginalAlignmentRecordsByNameSpark.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been doing this for a while. Seems unnecessary. Removed. Thanks!
private String outputSAM; | ||
|
||
@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg", | ||
fullName = "requireRG", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added short name for readNameFile
|
||
@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg", | ||
fullName = "requireRG", optional = true) | ||
private boolean require = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
final Broadcast<HashSet<String>> namesToLookForBroadcast = ctx.broadcast(parseReadNames()); | ||
|
||
final JavaRDD<GATKRead> reads = | ||
getUnfilteredReads().repartition(80) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historically put there before using writeReads
for performance tuning.
Removed.
|
||
try ( final BufferedReader rdr = | ||
new BufferedReader(new InputStreamReader(BucketUtils.openFile(readNameFile))) ) { | ||
final HashSet<String> namesToLookFor = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
It's weird that PrintReadsSpark requires a coordinate sorted bam, we should fix that. |
@lbergelson There's an old ticket #929 |
public void testExtractOriginalAlignmentRecordsByNameSparkRunnableLocal() throws IOException { | ||
|
||
final File tempWorkingDir = BaseTest.createTempDir("extractOriginalAlignmentRecordsByNameSparkIntegrationTest"); | ||
tempWorkingDir.deleteOnExit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createTempDir
already calls IOUtils.deleteRecursivelyOnExit()
.
In anycase deleteOnExit()
won't work with directories, at best if the directory is empty it will delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
expectedHeader = readsSource.getHeader(); | ||
expectedRecords = Utils.stream(readsSource.iterator()).filter(r -> r.getName().equals("asm013903:tig00002")) | ||
.sorted(Comparator.comparingInt(GATKRead::getAssignedStart)).map(r -> r.convertToSAMRecord(expectedHeader)).collect(Collectors.toList()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
try (final ReadsDataSource readsSource = new ReadsDataSource(IOUtils.getPath(tempWorkingDir+"/names.bam"))) { | ||
Assert.assertEquals(expectedHeader, readsSource.getHeader()); | ||
final List<SAMRecord> samRecords = Utils.stream(readsSource.iterator()).map(r -> r.convertToSAMRecord(expectedHeader)).collect(Collectors.toList()); | ||
Assert.assertEquals(expectedRecords.stream().map(SAMRecord::getSAMString).collect(Collectors.toList()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think expected and actual in Assert.assertEquals
are in the reverse order in its signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.filter(read -> namesToLookForBroadcast.getValue().contains(read.getName())).cache(); | ||
localLogger.info("Found these many alignments: " + reads.count()); | ||
|
||
getUnfilteredReads().filter(read -> namesToLookForBroadcast.getValue().contains(read.getName())).cache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what is the cpu/walltime impact of this cache just for the sake of outputting the count in the INFO message below. Is it really worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a count
after the write, so I prefer to cache it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... but is it worth the extra CPU/Wall time... the inside you get by outputting the count
is it worth the extra CPU/Wall time? In any case that one was jus a comment, no change needed.
@SHuang-Broad looks good, you can merge at your discretion. |
…containning read names
5539d40
to
4669ad4
Compare
…containing read names.
PrintReadsSpark
requires bam to be coordinate sorted, this doesn't.I find myself frequently using this tool for looking at the SAM records of some templates. Maybe useful for others as well. So I'm putting it in, but hiding it from displaying in help.
@vruano mind reviewing?