Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trivial spark tool for extracting original SAM records based on a file … #3589

Merged
merged 1 commit into from
Sep 24, 2017

Conversation

SHuang-Broad
Copy link
Contributor

…containing read names. PrintReadsSpark requires bam to be coordinate sorted, this doesn't.

I find myself frequently using this tool for looking at the SAM records of some templates. Maybe useful for others as well. So I'm putting it in, but hiding it from displaying in help.

@vruano mind reviewing?

@codecov-io
Copy link

codecov-io commented Sep 19, 2017

Codecov Report

Merging #3589 into master will increase coverage by 0.017%.
The diff coverage is 88.235%.

@@               Coverage Diff               @@
##              master     #3589       +/-   ##
===============================================
+ Coverage     79.736%   79.753%   +0.017%     
- Complexity     18164     18173        +9     
===============================================
  Files           1218      1219        +1     
  Lines          66671     66688       +17     
  Branches       10430     10431        +1     
===============================================
+ Hits           53161     53186       +25     
+ Misses          9297      9288        -9     
- Partials        4213      4214        +1
Impacted Files Coverage Δ Complexity Δ
...ls/ExtractOriginalAlignmentRecordsByNameSpark.java 88.235% <88.235%> (ø) 6 <6> (?)
...er/tools/spark/sv/discovery/AlignmentInterval.java 88.889% <0%> (+0.463%) 52% <0%> (+1%) ⬆️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 78.571% <0%> (+1.948%) 39% <0%> (ø) ⬇️
...ute/hellbender/tools/spark/sv/utils/FileUtils.java 24% <0%> (+24%) 2% <0%> (+2%) ⬆️

@vruano vruano self-assigned this Sep 19, 2017
Copy link
Contributor

@vruano vruano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes... what about integration testing?

private String outputSAM;

@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
fullName = "requireRG", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would define constants for full and short names of all these arguments that do not make reference to the StandardArgumentDefinitions constants.

Copy link
Contributor

@vruano vruano Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address comments above and bellow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added short name for readNameFile

final Broadcast<HashSet<String>> namesToLookForBroadcast = ctx.broadcast(parseReadNames());

final JavaRDD<GATKRead> reads =
getUnfilteredReads().repartition(80)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 80? magic number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historically put there before using writeReads for performance tuning.
Removed.

final HashSet<String> namesToLookFor = new HashSet<>();
String line;
while ( (line = rdr.readLine()) != null ) {
namesToLookFor.add(line.replace("@", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a guaranteed that such sequences @ /1 /2 cannot be in the middle of a read name? Otherwise you could use a regex ("^@") and ("/\d$").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps then you would call replaceAll or replaceFirst instead.

Copy link
Contributor

@vruano vruano Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you could use BufferedReader's Stream<String> lines():

rdr.lines().map(s -> s.replace(...).replace(...)).collect(Collectors.toSet());

But remember to capture UncheckedIOException.


try ( final BufferedReader rdr =
new BufferedReader(new InputStreamReader(BucketUtils.openFile(readNameFile))) ) {
final HashSet<String> namesToLookFor = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be as unspecific as possible... here namesToLookFor could be declared as Set<String>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@BetaFeature
public final class ExtractOriginalAlignmentRecordsByNameSpark extends GATKSparkTool {
private static final long serialVersionUID = 1L;
private final Logger localLogger = LogManager.getLogger(ExtractOriginalAlignmentRecordsByNameSpark.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that logger has already been initialized appropriately, you don't need to declared another logger here unless I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been doing this for a while. Seems unnecessary. Removed. Thanks!


@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
fullName = "requireRG", optional = true)
private boolean require = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem that require is never used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@vruano
Copy link
Contributor

vruano commented Sep 19, 2017

Done for now, please take a look at the suggested changes.
@SHuang-Broad

@vruano vruano assigned SHuang-Broad and unassigned vruano Sep 19, 2017
Copy link
Contributor

@vruano vruano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments.

private String outputSAM;

@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
fullName = "requireRG", optional = true)
Copy link
Contributor

@vruano vruano Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address comments above and bellow.

Copy link
Contributor Author

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did requested changes. Also added an integration test.
back to you @vruano . Thanks!

@BetaFeature
public final class ExtractOriginalAlignmentRecordsByNameSpark extends GATKSparkTool {
private static final long serialVersionUID = 1L;
private final Logger localLogger = LogManager.getLogger(ExtractOriginalAlignmentRecordsByNameSpark.class);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been doing this for a while. Seems unnecessary. Removed. Thanks!

private String outputSAM;

@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
fullName = "requireRG", optional = true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added short name for readNameFile


@Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
fullName = "requireRG", optional = true)
private boolean require = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

final Broadcast<HashSet<String>> namesToLookForBroadcast = ctx.broadcast(parseReadNames());

final JavaRDD<GATKRead> reads =
getUnfilteredReads().repartition(80)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historically put there before using writeReads for performance tuning.
Removed.


try ( final BufferedReader rdr =
new BufferedReader(new InputStreamReader(BucketUtils.openFile(readNameFile))) ) {
final HashSet<String> namesToLookFor = new HashSet<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lbergelson
Copy link
Member

It's weird that PrintReadsSpark requires a coordinate sorted bam, we should fix that.

@SHuang-Broad
Copy link
Contributor Author

@lbergelson There's an old ticket #929

@SHuang-Broad SHuang-Broad assigned vruano and unassigned SHuang-Broad Sep 22, 2017
public void testExtractOriginalAlignmentRecordsByNameSparkRunnableLocal() throws IOException {

final File tempWorkingDir = BaseTest.createTempDir("extractOriginalAlignmentRecordsByNameSparkIntegrationTest");
tempWorkingDir.deleteOnExit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createTempDir already calls IOUtils.deleteRecursivelyOnExit().

In anycase deleteOnExit() won't work with directories, at best if the directory is empty it will delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expectedHeader = readsSource.getHeader();
expectedRecords = Utils.stream(readsSource.iterator()).filter(r -> r.getName().equals("asm013903:tig00002"))
.sorted(Comparator.comparingInt(GATKRead::getAssignedStart)).map(r -> r.convertToSAMRecord(expectedHeader)).collect(Collectors.toList());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try (final ReadsDataSource readsSource = new ReadsDataSource(IOUtils.getPath(tempWorkingDir+"/names.bam"))) {
Assert.assertEquals(expectedHeader, readsSource.getHeader());
final List<SAMRecord> samRecords = Utils.stream(readsSource.iterator()).map(r -> r.convertToSAMRecord(expectedHeader)).collect(Collectors.toList());
Assert.assertEquals(expectedRecords.stream().map(SAMRecord::getSAMString).collect(Collectors.toList()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think expected and actual in Assert.assertEquals are in the reverse order in its signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.filter(read -> namesToLookForBroadcast.getValue().contains(read.getName())).cache();
localLogger.info("Found these many alignments: " + reads.count());

getUnfilteredReads().filter(read -> namesToLookForBroadcast.getValue().contains(read.getName())).cache();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is the cpu/walltime impact of this cache just for the sake of outputting the count in the INFO message below. Is it really worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a count after the write, so I prefer to cache it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... but is it worth the extra CPU/Wall time... the inside you get by outputting the count
is it worth the extra CPU/Wall time? In any case that one was jus a comment, no change needed.

@vruano
Copy link
Contributor

vruano commented Sep 23, 2017

@SHuang-Broad looks good, you can merge at your discretion.

@vruano vruano assigned SHuang-Broad and unassigned vruano Sep 23, 2017
@SHuang-Broad SHuang-Broad merged commit ac89f91 into master Sep 24, 2017
@SHuang-Broad SHuang-Broad deleted the sh_sam_extraction_by_readnames branch September 24, 2017 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants