-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM/BAM sorting order clarification #5
Comments
Picard implements a lexicographic sort (see SAMRecordQueryNameComparator.java) while samtools implements the fancy sort you've seen. This was discussed extensively a while back with a view to describing something in the spec (see this Sort algorithm thread from samtools-devel in Nov/Dec 2010), but no consensus eventuated. |
Is there any likelihood that the definition of "queryname sorted" will be unified between samtools and Picard? |
At this point there I personally can provide little motivation to finalize the sort specification and more importantly provide an efficient implementation in both samtools and htsjdk. |
Reading your linked issue further, it looks like you would want to guarantee we somehow break ties by having the first end of a pair before the second end of a pair. In fact, we could get away with ordering by the various or selected bits in the flag field. |
Nils, thank you so much for your reply. Yes, breaking ties in such a way was what I had in mind. Having cause to read the SAM spec more closely, I now realise that no guarantee is made about the relative order of the ends of a paired-end read. I understand that it may not even by desirable to do this tie breaking due to an associated cost. I'm also unsure of the implications for multi-fragment reads. For now I will seek to make my own code more robust to handling paired-end reads in the appropriate fashion. Thanks again. |
Redirected from the above, +1 to using plain old string ordering everywhere if at all possible. This is complicated a little by Samtools having made releases where "-n" means "fancy mixed sort", so it would be improper to break existing use cases by changing its meaning, but adding "-N" to mean "lex ordering" would be a big improvement, plus altering the help text to make clear that "-n" produces surprising results. As it stands there are dozens of unpleasant surprises waiting for the user, whenever "-n" sorted results are used in concert with datasets from Picard, Coreutils sort, or practically anything else treating a plain string. With regard to chr10 coming before chr2 etc, this is already taken care of by coordinate order being declared in the header; Qnames being sorted numerically like this is pretty but a subtle bug waiting to bite the user. |
|
My particular surprises arose from (a) users attempting comm(1) style comparison between a Samtools-sorted and a Picard-sorted dataset, and (b) trying to develop an htslib tool that does an "annotated merge" (like samtools merge, but adds attributes to the records indicating where they came from, using qname sort to easily correspond identical records in one input or the other). I found that I kept finding bugs relating to sort order... because again users were trying to use it with Samtools and Picard datasets interchangably. Basically the unpleasant surprises occur due to inconsistency. Samtools likes one order, Picard, Coreutils, Java's String.compareTo etc like another, and the orders look similar enough to be deceive a casual viewer into thinking we have done the lexical sort that is naturally expected of a string type. This is why I think the "-n" option should not be annotated "Sort by read name" because that conceals very important detail! Instead it should say "Sort by read name, using mixed string and integer ordering (see man page for details); -N Sort by read name, using lexical ordering" or similar. This would have saved me much pain! |
To be sure, there is a samtools documentation issue here.
That's an oversimplification. Coreutils respects LC_COLLATE et al, so you can surprise yourself just by using |
True enough, hence the big scary warning in sort's man page. This doesn't mean we should introduce (or conceal) extra ways to confuse ourselves, however :) |
Was going to have a go at putting a patch for Samtools together, but noticed a problem: Samtools has been writing SO:queryname for its mixed-string-integer ordering, whilst htsjdk has been writing SO:queryname for its lexicographic ordering. This means that a file that has been samtools sort -n'd and is then Picard-sorted will be passed through unchanged due to the SO tag, despite the actual ordering required being different. As I see it there are two choices:
Personally I favour #2 because it will produce the most sane results in the long run. |
(regarding the above, whilst the SO: tag is being messed with, perhaps it would be a good idea to support the tie-breaking cases with secondary keys, represented as a comma-separated SO (or perhaps an SO2 tag if you don't want to confuse old scripts), e.g. SO:queryname,fragment or SO:coordinate,queryname) |
Concurring strongly with @smowton: the 'queryname' sort-order should be unambiguous, and fancy sorting of the samtools type is pointless and trips people up people who want to use venerable Unix tools like join (1) and comm (1) |
@lh3 - one use case where the difference in sort order is causing problems for me involves scaffold names from a draft assembly, in which scaffold10021.1 and scaffold100211 are used as scaffold names for different scaffolds. These are sorted in different orders by samtools sort (...21.1 before 211) and the sort command-line utility (...211 before ...21.1), and that causes a problem in using bedtools coverage with the -sorted option. This does not involve the -n option to samtools sort, so it seems the sort order by reference sequence name is also not lexicographical. |
@rwhetten: That's unrelated. Samtools sort (without But first you probably want to look at |
@jmarshall - thanks for the clarification. I had tried including a genome file in the bedtools coverage command line, and that was compatible with the BAM input to coverage, but not with the BED file input, which was not sorted in the same order as the BAM file or genome file. I did not use the bedtools sort -faidx command, so I need to do more reading. |
Fix RFC3986 encoding for "%"
Will be "fixed" by #361 for some definition of fixed: acknowledge the issue, but not defining the one true sort order. |
commit ecd40f0 Author: Tim Hefferon <theffero@nih.gov> Date: Mon Dec 18 13:43:12 2017 -0500 editorial improvements commit a1b6ad4 Merge: 986e835 7aa24be Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Fri Dec 15 12:41:29 2017 -0500 Merge pull request samtools#5 from samtools/master Fix RFC3986 encoding for "%" commit 986e835 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:48:21 2017 -0500 Update VCFv4.3.tex commit c598717 Merge: 46042a8 4c18a2c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:41:20 2017 -0500 Merge pull request samtools#4 from thefferon/revert-whitespace Update VCFv4.3.tex commit 4c18a2c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:39:55 2017 -0500 Update VCFv4.3.tex commit 46042a8 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:31:48 2017 -0500 Add files via upload commit 1d3c079 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:35:38 2017 -0500 cleanup samtools#1 commit 6d30b09 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:33:39 2017 -0500 incorporated @d-cameron's changes from PR samtools#266 commit 1de4ac4 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:25:41 2017 -0500 backed out whitespace changes commit 406e9da Merge: 489b696 ce1e750 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 6 09:11:52 2017 -0500 Merge pull request samtools#3 from thefferon/thefferon-patch-1 Add files via upload commit ce1e750 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 6 09:11:20 2017 -0500 Add files via upload commit 489b696 Merge: 527e04b 2f915a8 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Tue Dec 5 15:54:37 2017 -0500 Merge pull request samtools#2 from samtools/master bringing my fork up to date commit 527e04b Merge: 7c5259c 85a0fef Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Nov 16 16:21:31 2017 -0500 Merge pull request samtools#1 from samtools/master bringing my fork up to date commit 7c5259c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Aug 9 16:25:24 2017 -0400 First go at changes requested in pull request samtools#231 Still evaluating effect of these changes on pdf layout... commit cf2ffa9 Author: Tim Hefferon <theffero@ncbi.nlm.nih.gov> Date: Tue Aug 8 11:22:22 2017 -0400 Made significant updates to Section 3, INFO keys used for structural variants # Conflicts: # VCFv4.3.tex
Reading
samtools
source code revealed that thequeryname
sorting order is far from being lexicographic: https://github.com/samtools/samtools/blob/develop/bam_sort.c#L13It would also be good to require reads with same position to be sorted by strand.
The text was updated successfully, but these errors were encountered: