Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-783] Write @SQ header lines in sorted order. #784

Merged
merged 2 commits into from
Aug 21, 2015

Conversation

fnothaft
Copy link
Member

This change resolves #783 and #760. Specifically, now we write the SAM/BAM @sq header
lines using the same lexicographic ordering that we use for sorting records, and we write the @hd line to note that we are sorted in coordinate order.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/848/

Build result: FAILURE

GitHub pull request #784 of commit 2b3ed5e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/784/merge^{commit} # timeout=10 > git branch -a --contains 53d70081fdfe9797c24be895796e68d8f567ec80 # timeout=10 > git rev-parse remotes/origin/pr/784/merge^{commit} # timeout=10Checking out Revision 53d70081fdfe9797c24be895796e68d8f567ec80 (origin/pr/784/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 53d70081fdfe9797c24be895796e68d8f567ec80First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@ryan-williams
Copy link
Member

lgtm from a quick pass

@fnothaft
Copy link
Member Author

Whoops! Forgot to add the test collateral...

@fnothaft fnothaft added this to the 0.17.1 milestone Aug 20, 2015
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/849/
Test PASSed.

@fnothaft
Copy link
Member Author

Rebased.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/852/
Test PASSed.

@fnothaft
Copy link
Member Author

Rerebased.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/854/
Test PASSed.

@fnothaft
Copy link
Member Author

This now resolves #760 as well. Can I get review/merge?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/855/
Test PASSed.

@fnothaft
Copy link
Member Author

Ping on review/merge... This is the last issue pending for the 0.17.1 release.

@ryan-williams
Copy link
Member

looking

@fnothaft
Copy link
Member Author

Thanks @ryan-williams!

@ryan-williams
Copy link
Member

This lgtm; one question: this optionally sorts the header lines to match the sort of the reads that can be optionally done with adamSortReadsByReferencePosition, right?

That seems reasonable, but feels a little backwards, since the spec says that the @SQ order should define the read order.

If ADAM will only ever output a lex-sort of the @SQs, and the corresponding "coordinate" sort of the reads, that's fine. I have a few more thoughts about making transform round-trips identical byte-for-byte, but I'll put that in a separate issue.

@ryan-williams
Copy link
Member

Finally, I don't know what the preferred way of merging is these days :)

@fnothaft
Copy link
Member Author

Thanks for the review!

@ryan-williams for now, we're just emitting coordinate sorted order. The other orders are defined here and are duplicate and queryname. Neither of those seem to be defined anywhere (like, you know, the SAM spec), but a queryname imp'l is at https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecordQueryNameComparator.java and a duplicate imp'l is at https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecordDuplicateComparator.java. I believe queryname is just sorting lexicographically by read name.

I agree it is a bit backwards, but I think an equivalent way to read it is that you need to have the same lexicographic order for both the reads and the header.

We can continue using the merge button for now! After I've got scripts ready, we can cut over.

@ryan-williams
Copy link
Member

The other orders are defined here and are duplicate and queryname. Neither of those seem to be defined anywhere (like, you know, the SAM spec)

FWIW, queryname is in the SAM spec, and is what you described:

duplicate is not, and there's this note about it not being in the spec in htsjdk.

My point, here and in #794, is that the SAM spec doesn't actually say that @SQ should be sorted anywhere, just that "The order of @sq lines defines the alignment sorting order". coordinate sort of the reads is then defined not lexicographically based on the @SQ name, but by whatever order the @SQs are already in, with POS as a secondary sort.

Arguably we should be able to leave @SQs in whatever order they come to us in (and sort reads to conform to that) independently of whether we also want to lex-sort the @SQs (and therefore also lex-sort the reads by RNAME).

Anyway, unless this sounds like such a good idea to you that you want to do it here, I'm find to merge this and then address the possibility of @SQs in non-lex order (and sorting reads to match them) in #794.

This change resolves bigdatagenomics#783. Specifically, now we write the SAM/BAM @sq header
lines using the same lexicographic ordering that we use for sorting records.
Sets the header line "@hd" sort order to "coordinate" when saving a sorted file
in BAM.
@fnothaft
Copy link
Member Author

Rebased.

@ryan-williams
Copy link
Member

k, i'll merge when test passes

@fnothaft
Copy link
Member Author

@ryan-williams doesn't that snippet just specify what the coordinate sort order is?

I get what you're saying. I am personally OK with rewriting the @SQ lines to lexicographic order. We could have a different sorting approach that would take in the full sequence dictionary, but it would be a bit more difficult to implement and would probably be a bit slower. Let's look into this more as part of #794.

@ryan-williams
Copy link
Member

doesn't that snippet just specify what the coordinate sort order is?

ah, you mean that they don't explicitly say what queryname sort is? that makes sense. I guess we're all just to assume that they mean lex-sort by QNAME.

Yea, we can discuss further on #794; always sorting @SQs, and then also sorting reads, seems reasonable, but the reads' side of it should maybe be implemented in terms of the @SQ order. We sort of backed into this "always lex-sort @SQs" approach after having been non-spec-compliant.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/859/
Test PASSed.

ryan-williams added a commit that referenced this pull request Aug 21, 2015
[ADAM-783] Write @sq header lines in sorted order.
@ryan-williams ryan-williams merged commit de3fa55 into bigdatagenomics:master Aug 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sequence dictionary is written in incorrect order
4 participants