-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-843] Aggressively project out metadata fields. #844
Conversation
Test PASSed. |
sc.loadParquetAlignments(args.inputPath) | ||
} else if (args.forceLoadParquet || | ||
args.limitProjection || | ||
args.useAlignedReadPredicate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if either of these are present along with -force_load_[something other than parquet]
, should there be a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, not a bad idea. I've added logging for this at https://github.com/bigdatagenomics/adam/pull/844/files#diff-1ee985bcfcfad885cfc143944371af07R161.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it weird to have args.limitProjection || args.useAlignedReadPredicate
cause interpretation of a file as parquet? would we rather throw an error if someone tries to load a .bam
file with one of these flags? I'm not sure, just wondering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to throw an IAE.
Test FAILed. Build result: FAILUREGitHub pull request #844 of commit b878311 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/844/merge^{commit} # timeout=10 > git branch -a --contains 02db391edc474258ceb81bcab9913d4a5b6f9bfd # timeout=10 > git rev-parse remotes/origin/pr/844/merge^{commit} # timeout=10Checking out Revision 02db391edc474258ceb81bcab9913d4a5b6f9bfd (origin/pr/844/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 02db391edc474258ceb81bcab9913d4a5b6f9bfdFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
b878311
to
0455377
Compare
Test PASSed. |
@@ -28,5 +28,5 @@ import org.bdgenomics.formats.avro.AlignmentRecord | |||
*/ | |||
object AlignmentRecordField extends FieldEnumeration(AlignmentRecord.SCHEMA$) { | |||
|
|||
val contig, start, end, mapq, readName, sequence, mateAlignmentStart, cigar, qual, recordGroupId, recordGroupName, readPaired, properPair, readMapped, mateMapped, readNegativeStrand, mateNegativeStrand, firstOfPair, secondOfPair, primaryAlignment, failedVendorQualityChecks, duplicateRead, mismatchingPositions, attributes, recordGroupSequencingCenter, recordGroupDescription, recordGroupRunDateEpoch, recordGroupFlowOrder, recordGroupKeySequence, recordGroupLibrary, recordGroupPredictedMedianInsertSize, recordGroupPlatform, recordGroupPlatformUnit, recordGroupSample, mateContig, origQual, supplmentaryAlignment = SchemaValue | |||
val contig, start, end, mapq, readName, sequence, mateAlignmentStart, cigar, qual, recordGroupId, recordGroupName, readPaired, properPair, readMapped, mateMapped, readNegativeStrand, mateNegativeStrand, firstOfPair, secondOfPair, primaryAlignment, failedVendorQualityChecks, duplicateRead, mismatchingPositions, attributes, recordGroupSequencingCenter, recordGroupDescription, recordGroupRunDateEpoch, recordGroupFlowOrder, recordGroupKeySequence, recordGroupLibrary, recordGroupPredictedMedianInsertSize, recordGroupPlatform, recordGroupPlatformUnit, recordGroupSample, mateContig, origQual, supplementaryAlignment, secondaryAlignment = SchemaValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth wrapping this line? hard to tell what changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wrap the line, the scripts/format-source
script helpfully unwraps the line. Sigh! I agree though.
0455377
to
a70d776
Compare
Test PASSed. |
Thanks! |
[ADAM-843] Aggressively project out metadata fields.
Resolves #843. I have tested this on our cluster and seen a 2.5x performance improvement for INDEL realignment and MarkDups, and a 1.5x performance improvement for BQSR. Additionally, we see approximately a 3-4x reduction in shuffle size.