Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM/BAM sorting order clarification #5

Open
lomereiter opened this issue Nov 16, 2013 · 17 comments
Open

SAM/BAM sorting order clarification #5

lomereiter opened this issue Nov 16, 2013 · 17 comments
Labels

Comments

@lomereiter
Copy link

Reading samtools source code revealed that the queryname sorting order is far from being lexicographic: https://github.com/samtools/samtools/blob/develop/bam_sort.c#L13

It would also be good to require reads with same position to be sorted by strand.

@jmarshall
Copy link
Member

Picard implements a lexicographic sort (see SAMRecordQueryNameComparator.java) while samtools implements the fancy sort you've seen. This was discussed extensively a while back with a view to describing something in the spec (see this Sort algorithm thread from samtools-devel in Nov/Dec 2010), but no consensus eventuated.

@PeteHaitch
Copy link

Is there any likelihood that the definition of "queryname sorted" will be unified between samtools and Picard?

@nh13
Copy link
Member

nh13 commented Mar 18, 2015

At this point there I personally can provide little motivation to finalize the sort specification and more importantly provide an efficient implementation in both samtools and htsjdk.

@nh13
Copy link
Member

nh13 commented Mar 18, 2015

Reading your linked issue further, it looks like you would want to guarantee we somehow break ties by having the first end of a pair before the second end of a pair. In fact, we could get away with ordering by the various or selected bits in the flag field.

@PeteHaitch
Copy link

Nils, thank you so much for your reply. Yes, breaking ties in such a way was what I had in mind. Having cause to read the SAM spec more closely, I now realise that no guarantee is made about the relative order of the ends of a paired-end read. I understand that it may not even by desirable to do this tie breaking due to an associated cost. I'm also unsure of the implications for multi-fragment reads.

For now I will seek to make my own code more robust to handling paired-end reads in the appropriate fashion. Thanks again.

@smowton
Copy link

smowton commented May 29, 2015

Redirected from the above, +1 to using plain old string ordering everywhere if at all possible. This is complicated a little by Samtools having made releases where "-n" means "fancy mixed sort", so it would be improper to break existing use cases by changing its meaning, but adding "-N" to mean "lex ordering" would be a big improvement, plus altering the help text to make clear that "-n" produces surprising results.

As it stands there are dozens of unpleasant surprises waiting for the user, whenever "-n" sorted results are used in concert with datasets from Picard, Coreutils sort, or practically anything else treating a plain string.

With regard to chr10 coming before chr2 etc, this is already taken care of by coordinate order being declared in the header; Qnames being sorted numerically like this is pretty but a subtle bug waiting to bite the user.

@lh3
Copy link
Member

lh3 commented May 29, 2015

  1. samtools sort starts to put read_1 before read_2 since 0.1.19, released on 03/19/2013.
  2. I agree it would be good to add option "-N". It is easy to implement.
  3. Nonetheless, I would be interested in the "dozens of unpleasant surprises" caused by differences in name sorting. No one has reported a concrete example as I remember. @PeteHaitch was mentioning a problem about how to break tie, not about name sorting.

@smowton
Copy link

smowton commented May 29, 2015

My particular surprises arose from (a) users attempting comm(1) style comparison between a Samtools-sorted and a Picard-sorted dataset, and (b) trying to develop an htslib tool that does an "annotated merge" (like samtools merge, but adds attributes to the records indicating where they came from, using qname sort to easily correspond identical records in one input or the other). I found that I kept finding bugs relating to sort order... because again users were trying to use it with Samtools and Picard datasets interchangably.

Basically the unpleasant surprises occur due to inconsistency. Samtools likes one order, Picard, Coreutils, Java's String.compareTo etc like another, and the orders look similar enough to be deceive a casual viewer into thinking we have done the lexical sort that is naturally expected of a string type. This is why I think the "-n" option should not be annotated "Sort by read name" because that conceals very important detail! Instead it should say "Sort by read name, using mixed string and integer ordering (see man page for details); -N Sort by read name, using lexical ordering" or similar. This would have saved me much pain!

@jmarshall
Copy link
Member

To be sure, there is a samtools documentation issue here.

Samtools likes one order, Picard, Coreutils, Java's String.compareTo etc like another

That's an oversimplification. Coreutils respects LC_COLLATE et al, so you can surprise yourself just by using sort alone. These problems are nothing new.

@smowton
Copy link

smowton commented May 29, 2015

True enough, hence the big scary warning in sort's man page. This doesn't mean we should introduce (or conceal) extra ways to confuse ourselves, however :)

@smowton
Copy link

smowton commented Jun 5, 2015

Was going to have a go at putting a patch for Samtools together, but noticed a problem: Samtools has been writing SO:queryname for its mixed-string-integer ordering, whilst htsjdk has been writing SO:queryname for its lexicographic ordering. This means that a file that has been samtools sort -n'd and is then Picard-sorted will be passed through unchanged due to the SO tag, despite the actual ordering required being different.

As I see it there are two choices:

  1. Continue to write SO:queryname even though the orders differ; accept that Picard will barf (if verifying order) or silently emit incorrect results (if not).
  2. Switch samtools to write SO:queryname_mixed (or htsjdk to write SO:queryname_lex); if you like add queryname_mixed ordering to htsjdk; accept that users' scripts might break, and old versions of Samtools or htsjdk will become confused by the new ordering (hopefully they will assume the unknown ordering means the file should be assumed unordered?)

Personally I favour #2 because it will produce the most sane results in the long run.

@smowton
Copy link

smowton commented Jun 5, 2015

(regarding the above, whilst the SO: tag is being messed with, perhaps it would be a good idea to support the tie-breaking cases with secondary keys, represented as a comma-separated SO (or perhaps an SO2 tag if you don't want to confuse old scripts), e.g. SO:queryname,fragment or SO:coordinate,queryname)

@plijnzaad
Copy link

Concurring strongly with @smowton: the 'queryname' sort-order should be unambiguous, and fancy sorting of the samtools type is pointless and trips people up people who want to use venerable Unix tools like join (1) and comm (1)

@rwhetten
Copy link

@lh3 - one use case where the difference in sort order is causing problems for me involves scaffold names from a draft assembly, in which scaffold10021.1 and scaffold100211 are used as scaffold names for different scaffolds. These are sorted in different orders by samtools sort (...21.1 before 211) and the sort command-line utility (...211 before ...21.1), and that causes a problem in using bedtools coverage with the -sorted option. This does not involve the -n option to samtools sort, so it seems the sort order by reference sequence name is also not lexicographical.
If the decision is not to change how samtools sort works, either with or without the -n option, a workaround might be to extract the sorting code as a stand-alone utility that users like me could apply to other files that have to be sorted in the same order as a BAM file.

@jmarshall
Copy link
Member

jmarshall commented Jul 25, 2016

@rwhetten: That's unrelated.

Samtools sort (without -n, i.e., coordinate sort) sorts the scaffolds according to the order in which they appear in the SAM/BAM/CRAM headers, which (if the aligner used is doing the usual thing) is the order in which they appear in the reference assembly .fasta file. So if you want them to appear in some particular order, you should order your .fasta file accordingly.

But first you probably want to look at bedtools coverage -sorted -g assembly.fa, which, along with bedtools sort -faidx assembly.fa.fai, will I believe make bedtools follow the same reference sequence ordering.

@rwhetten
Copy link

@jmarshall - thanks for the clarification. I had tried including a genome file in the bedtools coverage command line, and that was compatible with the BAM input to coverage, but not with the BED file input, which was not sorted in the same order as the BAM file or genome file. I did not use the bedtools sort -faidx command, so I need to do more reading.

@jmarshall jmarshall mentioned this issue Jun 1, 2017
cyenyxe pushed a commit to cyenyxe/hts-specs that referenced this issue Sep 22, 2018
@jkbonfield
Copy link
Contributor

Will be "fixed" by #361 for some definition of fixed: acknowledge the issue, but not defining the one true sort order.

d-cameron pushed a commit to d-cameron/hts-specs that referenced this issue Dec 18, 2019
commit ecd40f0
Author: Tim Hefferon <theffero@nih.gov>
Date:   Mon Dec 18 13:43:12 2017 -0500

    editorial improvements

commit a1b6ad4
Merge: 986e835 7aa24be
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Fri Dec 15 12:41:29 2017 -0500

    Merge pull request samtools#5 from samtools/master

    Fix RFC3986 encoding for "%"

commit 986e835
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:48:21 2017 -0500

    Update VCFv4.3.tex

commit c598717
Merge: 46042a8 4c18a2c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:41:20 2017 -0500

    Merge pull request samtools#4 from thefferon/revert-whitespace

    Update VCFv4.3.tex

commit 4c18a2c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:39:55 2017 -0500

    Update VCFv4.3.tex

commit 46042a8
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:31:48 2017 -0500

    Add files via upload

commit 1d3c079
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:35:38 2017 -0500

    cleanup samtools#1

commit 6d30b09
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:33:39 2017 -0500

    incorporated @d-cameron's changes from PR samtools#266

commit 1de4ac4
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:25:41 2017 -0500

    backed out whitespace changes

commit 406e9da
Merge: 489b696 ce1e750
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 6 09:11:52 2017 -0500

    Merge pull request samtools#3 from thefferon/thefferon-patch-1

    Add files via upload

commit ce1e750
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 6 09:11:20 2017 -0500

    Add files via upload

commit 489b696
Merge: 527e04b 2f915a8
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Tue Dec 5 15:54:37 2017 -0500

    Merge pull request samtools#2 from samtools/master

    bringing my fork up to date

commit 527e04b
Merge: 7c5259c 85a0fef
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Nov 16 16:21:31 2017 -0500

    Merge pull request samtools#1 from samtools/master

    bringing my fork up to date

commit 7c5259c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Aug 9 16:25:24 2017 -0400

    First go at changes requested in pull request samtools#231

    Still evaluating effect of these changes on pdf layout...

commit cf2ffa9
Author: Tim Hefferon <theffero@ncbi.nlm.nih.gov>
Date:   Tue Aug 8 11:22:22 2017 -0400

    Made significant updates to Section 3, INFO keys used for structural variants

# Conflicts:
#	VCFv4.3.tex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants