Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools 1.10 fails to index files with wrong INFO/END values but bcftools 1.9 worked fine #1154

Closed
freeseek opened this issue Feb 7, 2020 · 21 comments · Fixed by samtools/htslib#1021
Labels

Comments

@freeseek
Copy link
Contributor

freeseek commented Feb 7, 2020

The following code reproduces the error:

bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Oz -o output.vcf.gz \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
bcftools index -ft output.vcf.gz

The problem here is that the 1000 Genomes project team made a mess and forgot to liftOver the END coordinates of the VCF. I am not sure whether this counts as a bcftools bug. But the VCF specification does not really call for the END field to always be bigger than the POS field, which is what is causing the problem here. I do find odd that tabix and previous versions of bcftools worked fine with these odd cases though.

If this is considered appropriate bcftools behavior, then I think that to a minimum bcftools norm should also check than END is always larger than POS.

@freeseek
Copy link
Contributor Author

freeseek commented Feb 7, 2020

Also, more importantly, if I output the VCF to a binary format and then I try to index as follows:

bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Ob -o output.bcf \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
bcftools index -f output.bcf

I get this very weird error:

[E::hts_idx_check_range] Region 4461912..4299347302 cannot be stored in a csi index with min_shift = 14, n_lvls = 5. Try using min_shift = 14, n_lvls >= 7
index: failed to create index for "output.bcf"

It seems like there was a failure to check for END being greater than POS in the encoding to binary format that caused some sort of overflow.

@pd3
Copy link
Member

pd3 commented Feb 10, 2020

Can you please try with the latest htslib version? This is a duplicate of samtools/htslib#1006 and was addressed by samtools/htslib@ea879e1.

@freeseek
Copy link
Contributor Author

I have just re-cloned and re-compiled bcftools and I get the same exact errors. I have noticed though that the download of the VCF snippet with bcftools view often fails, sometimes yielding a VCF without variants, but if I retry it many many times it eventually works. I am not sure why this is. Maybe it is a temporary issue at the time of this writing with the EBI webiste.

@freeseek
Copy link
Contributor Author

Actually I am noticing this HTSlib problem that is unrelated but it seems also like a serious HTSlib bug, so it might be worth it reporting it. At the very moment the EBI website seems to return 404 errors for files that are actually present. When I run the following command:

bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz

I get three possible outputs:

Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: could not parse header

Or I get a VCF without variants or I get a VCF with both rs6640136 and rs57675010 variants.

The first output is okay, as it means the EBI website provided a 404 answer. The third outcome is the desired outcome. The second outcome seems to be due to the fact that the EBI website provided the header fine but when bcftools tried to get the X:4380006-4461913 region it provided a 404 answer. This is troublesome because bcftools did not throw any error. This does not seem like a desired behavior.

@pd3
Copy link
Member

pd3 commented Feb 10, 2020

I am not able to reproduce, I only get a correct warning with the latest version:

$ bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[W::vcf_parse] INFO/END=4380006 is smaller than POS at X:4461913

$ bcftools index rmme.bcf 

$ bcftools --version
bcftools 1.10.2-18-g5774f32
Using htslib 1.10.2-22-gbfc9f0d

Make sure the latest version is used and a truncated version of the remote index is not present in the local directory.

@freeseek
Copy link
Contributor Author

I have the same version as you do.

$ bcftools index -ft output.vcf.gz
[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913
index: failed to create index for "output.vcf.gz"
$ bcftools index -f output.bcf   
[E::hts_idx_check_range] Region 4461912..4299347302 cannot be stored in a csi index with min_shift = 14, n_lvls = 5. Try using min_shift = 14, n_lvls >= 7
index: failed to create index for "output.bcf"
$ bcftools --version          
bcftools 1.10.2-18-g5774f32
Using htslib 1.10.2-22-gbfc9f0d
Copyright (C) 2019 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Also, I am sure it is not just me as this bug was reported to me from someone in Finland that just installed HTSlib and BCFtools from github a few days ago.

@pd3
Copy link
Member

pd3 commented Feb 10, 2020

You are only showing the indexing step. Can you make sure the BCF was created by the new version as well? Either remove and recreate or look in the header for the stamp.

Otherwise, after some experimentation I see that several problems can occur:

  1. existing but empty index file
$ rm -f ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi  && touch ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi

$ bcftools view -G -r X:4380006-4461913 -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[E::hts_idx_load3] Could not load local index file 'ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi'
Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: could not load index
  1. existing but truncated index file
$ wget -O xxx.tbi http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi
$ dd if=xxx.tbi of=ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi bs=1 count=100
$ bcftools view -G -r X:4380006-4461913 -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
[E::hts_idx_load3] Could not load local index file 'ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi'
Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: could not load index
  1. failed download because of remote 404 errors
$ bcftools view -G -r X:4380006-4461913 -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[E::hts_open_format] Failed to open file "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz" : No such file or directory
Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: No such file or directory

The error message in the last case could be made more informative and give a hint that the problem can be temporary, otherwise it seems to be working OK.

@freeseek
Copy link
Contributor Author

I am sorry to mix two different bugs, but it is not a problem of the indexing file being truncated. That was downloaded once and for all and it copied fine. What happens to me occasionally is that I run:

$ bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz

which runs without any errors or warnings. And then I get:

$ bcftools view rmme.bcf | grep -v ^##
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

Which means the header loaded okay, but then I get no bgzf_read or hts_idx_load3 error. Somehow EBI just did not send anything. I am not sure whether HTSlib can tell that it was a 404 error message.

@freeseek
Copy link
Contributor Author

Sorry, you are right, the latest version of bcftools does not reproduce the malformed bcf file. May this got mixed with a malformed index file or something. However, the following recreates the problem with vcf.gz files:

$ bcftools --version
bcftools 1.10.2-18-g5774f32
Using htslib 1.10.2-22-gbfc9f0d
Copyright (C) 2019 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ (echo "##fileformat=VCFv4.1"
echo "##contig=<ID=X,length=155270560>"
echo "##INFO=<ID=END,Number=1,Type=Integer,Description=\"End coordinate of this variant\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "X\t4461533\trs6640136\tA\tG\t.\t.\t."
echo -e "X\t4461913\trs57675010\tGAAGTCATTGTACAAGGAAAACATTCTGTATGCACTAAGATTTCCCAAAAGAT\tG\t.\t.\tEND=4380006") | \
  bcftools view --no-version -Oz -o output.vcf.gz
[W::vcf_parse] INFO/END=4380006 is smaller than POS at X:4461913
$ bcftools index -ft output.vcf.gz
[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913
index: failed to create index for "output.vcf.gz"

While bcftools 1.9 was capable of indexing just fine. You can close the issue.

@jmarshall
Copy link
Member

But the VCF specification does not really call for the END field to always be bigger than the POS field, which is what is causing the problem here.

In fact it does — see samtools/hts-specs#266 (comment) and samtools/hts-specs#436, and also #1153.

I do find odd that tabix and previous versions of bcftools worked fine with these odd cases though.

In fact they did not. Prior to 1.10, records with INFO/END < POS could lead to bcftools index producing a broken index that would fail to return any variants for certain queries. See this biostars posting, which led to the addition of the error message you're seeing in samtools/htslib#917:

$ bcftools index -ft output.vcf.gz
[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913

It is unclear whether your query is returning no results due to this problem with a previously built broken index. Do your query coordinates vary, or are you saying you are intermittently getting different results for X:4380006-4461913?

@freeseek
Copy link
Contributor Author

Okay, I have made a mess of a bug report, but here is the bug, reproducible with the latest version of bcftools:

bcftools view -i 'ID="rs59780433"' -Ou -G -r X:85725113-86470037 \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz | \
  bcftools norm -Ob -o output.bcf -m -any && \
  bcftools index -f output.bcf

This is quicker way to reproduce it

(echo "##fileformat=VCFv4.1"
echo "##contig=<ID=X,length=155270560>"
echo "##INFO=<ID=END,Number=1,Type=Integer,Description=\"End coordinate of this variant\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "X\t86470037\trs59780433\tTTTCATTGATATTCTGCTGTGGCACTGTTAATCTTATTTTGATTGAAACAGAGTTCTGGACAGTAAAATTGATG\tTGGTTCATTGATATTCTGCTGTGGCACTGTTAATCTTATTTTGATTGAAACAGAGTTCTGGACAGTAAAATTGATG,T\t.\t.\tEND=85725113") | \
  bcftools norm --no-version -m -any -Ob -o output.bcf && \
  bcftools index -f output.bcf

In both cases the error I get is the following:

[E::hts_idx_check_range] Region 86470036..4380692409 cannot be stored in a csi index with min_shift = 14, n_lvls = 5. Try using min_shift = 14, n_lvls >= 7

The problem seems to arise from bcftools norm splitting a variant with INFO/END smaller than POS. I previously dissected the bug incorrectly.

@pd3
Copy link
Member

pd3 commented Feb 11, 2020

I see it now. The norm command creates a new VCF record and htslib does not check the sanity of the END tag when writing out BCF.

@pd3 pd3 added the bug label Feb 11, 2020
pd3 added a commit to pd3/htslib that referenced this issue Feb 12, 2020
This is an extension of ea879e1 which added checks for reading
VCF.

Resolves samtools/bcftools#1154
daviesrob pushed a commit to daviesrob/htslib that referenced this issue Feb 27, 2020
This is an extension of ea879e1 which added checks for reading
VCF.

Resolves samtools/bcftools#1154
daviesrob pushed a commit to daviesrob/htslib that referenced this issue Feb 27, 2020
This is an extension of ea879e1 which added checks for reading
VCF.

Resolves samtools/bcftools#1154
@PlatonB
Copy link

PlatonB commented Jun 12, 2020

@jmarshall @pd3 This bug recently hit pysam. It wasn't in v0.15.4, it appeared in v0.16.0.

@jmarshall
Copy link
Member

This is a feature, not a bug. The previous behaviour of not diagnosing the invalid INFO/END value and producing an unusable index was buggy. See #1154 (comment), in particular the biostars posting linked from there.

What program has produced the INFO/END values that you are seeing diagnostic messages about?

@PlatonB
Copy link

PlatonB commented Jun 13, 2020

@jmarshall I tried to index the same chrX archive from 1000 Genomes as @freeseek.

Just in case, I quote an error message:

[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913
X.vcf.gz.tbi... Traceback (most recent call last):
  File "ld_triangle.py", line 444, in <module>
    prep_single_proc = PrepSingleProc(args)
  File "ld_triangle.py", line 95, in __init__
    self.intgen_convdb_path = collect_intgen_data(self.intgen_dir_path)
  File "/home/platon/_0_Диссертация/00_Программы/ld-tools-master/backend/collect_intgen_data.py", line 133, in collect_intgen_data
    tabix_index(intgen_vcf_path, preset='vcf')
  File "pysam/libctabix.pyx", line 1035, in pysam.libctabix.tabix_index
OSError: building of index for /home/platon/_0_Диссертация/Exp/LD_heatmap_test/1000G/X.vcf.gz failed

@susanfairley
Copy link

@jmarshall, for background, the file being discussed is a liftover of 1000 Genomes phase three from GRCh37 to GRCh38. The END coordinate issue was introduced by the liftover process (which in this case was fairly complicated, going via dbSNP). Brief documentation (including this issue) is here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/README_GRCh38_liftover_20170504.txt. Ensembl share a version of this data where (I believe) these cases will have been filtered out - I notice that that isn't mentioned in the existing README (we will aim to update it). Also, probably worth highlighting that there are increasing numbers of call sets for the 1000 Genomes samples called directly on GRCh38 (where liftover artefacts are not an issue and all of GRCh38 was included in the calling) . We are happy to take questions on the data at info@1000genomes.org.

@PlatonB
Copy link

PlatonB commented Jun 16, 2020

As far as I understand, liftover was applied for all chromosomes. Then why does this problem occur only for chrX?

@freeseek
Copy link
Contributor Author

I think chromosome X received special treatment. Notice for example that they forgot to liftOver the PAR1 and PAR2 regions ...

@susanfairley
Copy link

@freeseek - regarding chromosome X, as noted, the liftover process for this data set was somewhat atypical/complicated going via dbSNP (rather than liftOver directly GRCh37 to GRCh38). Given that chromosome X frequently suffers from anomalies associated with the PARs, etc., I find it plausible that there are some problems specific to chromosome X but I'm not aware that the processing of chromosome X was any different from the others (based on the information available to me).

@PlatonB - I think the safest approach is to treat these case by case. This could be down to circumstances that only occur on chromosome X but I'm afraid I can't guarantee that. You are correct that the liftover was applied to all chromosomes.

@jkbonfield
Copy link
Contributor

This doesn't remove the fact that the data is erroneous otherwise the newer code wouldn't be complaining. If 1000 genomes have erroneous data then they need to be informed. Previously it used this invalid fields without checking and could silently give the wrong results. There's nothing we can do to make it give the correct results as the primary data needs fixing.

@susanfairley
Copy link

@jkbonfield, agreed that this is a data issue and not an issue with the software - the current software behaviour described (not allowing the error to go undetected) sounds correct to me. For the data, 1000 Genomes are aware - this has been listed as a known issue for two years. We will, however, point people to the filtered version of the files hosted by Ensembl.

PlatonB added a commit to PlatonB/ld-tools that referenced this issue Oct 28, 2020
- В тот раз отсев разноразмерных повторных вставок я реализовал неправильно. Новый алгоритм уж точно верный.
- Пополнение CHROM-POS-ID-таблицы конвертационной БД больше не производится кусками. Не так уж и много туда загоняется данных, чтобы возникало опасение перерасхода оперативной памяти.
- Временно обошёл проблему индексации chrX-файла новыми версиями Pysam (samtools/bcftools#1154). Теперь для chrX будет качаться готовый индекс из FTP 1000 Genomes. Почему я не стал так делать для всех 1000G-архивов? Тогда htslib заспамил бы вас многочисленными ворнингами (см. pysam-developers/pysam#877).
- Сделал код проверки существования тех или иных файлов более элегантным.
wstokes pushed a commit to GSLBiotech/htslib that referenced this issue Jan 6, 2021
commit a7a90fe913f8a466f32f6e284cf46653944acd6f
Merge: fd0f895 cd0a84b
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Tue Sep 22 13:15:49 2020 +0100

    Release 1.11

commit cd0a84b490e910c085b987e3026f724be98349bf
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Tue Sep 22 13:14:15 2020 +0100

    More minor NEWS fixes

commit 80161279b755d83df6e0dbc52942f625df414452
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Fri Sep 18 11:44:04 2020 +0100

    Add back assert.h to vcf.h.

    This undoes a change in #1060.

    It's a philosphical point.  On one hand we have an inclusion of a
    spurious file that we do not utilise ourselves.  On the other hand we
    have an API breakage where old software (in this case bcftools 1.10)
    attemting to compile against current htslib no longer succeeds because
    we've changed which symbols we declare.

    This is unclean, but I think backwards compatibity trumps tidyness in
    this case.

commit 75e2017bb097cb47722ac8796de10036d21c9625
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Thu Sep 17 09:43:36 2020 +0100

    Minor NEWS updates

    Reword tabix --separate-regions description. Fix PR#1094 text -- the
    original has <space> in the description (edit the PR comment to see it).
    Credit @lczech. Trivial formatting fixes. Spelling mistakes PR surely
    too trivial to mention :-).

commit 1748a2dd34e5ff4b4800fcf04dbd5860d8c30f22
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Sep 16 10:10:46 2020 +0100

    More NEWS updates

commit d50a3a531c597a24644c6878fc0068d0d1cc0a0d
Author: Andrew Whitwham <whitwham@users.noreply.github.com>
Date:   Mon Sep 14 20:49:22 2020 +0100

    News update Sept 2020.

commit af6dbb49b2d522bfd6d9762bd9329b28aa0b1e70
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Wed Sep 16 10:17:50 2020 +0100

    Spelling corrections [minor] (PR #1137)

commit 170047656473a7fadf9f23f3afdfac46b9c52b21
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Mon Aug 17 17:18:27 2020 +0100

    Handle q==0.0 case in kt_fisher_exact() and add tests

    When some of kt_fisher_exact()'s input arguments are very large,
    hypergeo(...) is dominated by the subtracted lbinom() so returns
    exp(-750 or less), which underflows, hence hypergeo() returns 0.0.

    When q is 0.0, kt_fisher_exact()'s loops never execute and which of
    i and j is closer to n11 is indeterminate. To avoid this problem
    a special case is added for q == 0.0 which determines the output
    values by comparing n11 to the location of the peak of the
    hypergeometric distribution.

    Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

commit 9067dee60e91eafba657b13e502307d500dcd489
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Fri Sep 11 14:16:28 2020 +0100

    Fixed a crash in bcf_record_check.

    Commit 3ac8a04 breaks the check if hdr is passed in as NULL.  This
    never happens in htslib tests, but does in bcftools.

commit 8bab82bdb8c2613e1ca7bd5573d7c12117a2dc02
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Thu Sep 10 07:05:16 2020 +0000

    Check against the proper type lenght.

commit 34ba6c726925efceaa2cc995d7d4b0409907f331
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Tue Sep 8 18:23:24 2020 +0100

    Add sam_format_aux1 and bam_aux_get_str APIs. (PR #1134)

    sam_format_aux1 is extracted out from sam_format1_append, and
    indeed is the lion's share of this function.  It is useful as an external
    API for anything that is generating its own tags or a partial tag list in
    SAM format, such as "samtools fastq".  The tag key, type and data
    are passed in as separate pointers to aid usage with bam_aux_get
    which returns 2 bytes into the in-memory representation of a tag.
    It also allows the function to be used in cases where these are not
    stored together, for example the internal CRAM representation.

    bam_aux_get_str is a wrapper around sam_format_aux1 which
    simplifies the case of converting a single tag from a record in
    a bam1_t struct.

commit 3ac8a04f8f6071be0901a9ddcda296f58b2bcf0c
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Wed Aug 26 14:29:55 2020 +0100

    Sped up bcf_check_record.

    This is mainly by speeding up bcf_dec_typed_int1_safe, but also by
    some other code in there to streamline the function and keep the
    loops tighter.

    The effect on clang is slight.  On gcc9 reading an uncompressed BCF
    file can be a bit faster (up to 15% on one file), bringing the speed
    of gcc and clang to similar speeds when using -O3.  (Gcc -O2 is still
    noticably poorer than clang -O2 on this test though.)

    Examples on CHM1_CHM13_2.1.gatk_raw.gvcf.ubcf and gcc9 -O3.

    Before:
               1549.79 msec task-clock                #    0.999 CPUs util
            5549055377      cycles                    #    3.581 GHz
           16764915993      instructions              #    3.02  insn per

    After:
               1345.83 msec task-clock                #    0.999 CPUs util
            4806341286      cycles                    #    3.571 GHz
           13942053336      instructions              #    2.90  insn per

    Record checking is still around 50% of the entire decode CPU cost though.

commit f4f7f24914251744a928eb6f2aa8e89fa5788d0c
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Tue Sep 1 12:28:33 2020 +0100

    Improvement (personal preference) to previous C99 fix.

    In #498 I removed the warning:

        vcf.c:3030:26: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
                     d->allele[i] = tmp.l;

    by storing a pointer starting from a fixed "" string, and later on
    subtracting it back again to get the length.  While that works, it's
    pretty icky and I think undefined behaviour as it uses pointer
    arithmetic outsides the bounds of an object.  (In reality it's fine on
    all modern systems as the days of segmented pointer archictures are
    long gone.)

    Instead I now cast the length into a char* as the size of allele is
    never going to grow beyond the size of a pointer, and then cast back
    prior to adding to the d->als pointer.

    [Ideally we wouldn't need these shenanigans at all and keep it in index
    space instead of pointers so growing memory doesn't require a lot of
    pointer juggling, but bcf_dec_t is a public structure.]

commit e9447f86ec386798bdf0cc9ee3646f982328e321
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Thu Aug 27 12:22:01 2020 +0100

    Remove now unused ks_resize2 function.

    NB: This never made it into a release so it isn't an ABI breaking
    change and nor does it need to be listed in the NEWS file.

commit e1d314931fe6b2c8d8bc5d0330b989d33adb175d
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Thu Aug 27 11:09:08 2020 +0300

    Have realloc and free in the same layer of kstring (#1129)

    Bring realloc back to the inline function, so that the allocation and freeing
    of memory are done by the calling layer.
    Replace the roundup to a power of 2 with a simpler formula that multiplies
    the desired realloc size with 1.5.
    Try to allocate a maximum of SIZE_MAX/2 bytes.

commit 78648e2a93aaec850ae47fa56cc5c68deb924fa0
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Tue Aug 25 16:52:59 2020 +0100

    Bug fix to the pileup constructor.

    We call constructor and destructor for each new bam object that is
    added to and removed from the pileup.  Pileup also takes its own copy
    of the bam1_t struct as sam_read1 loops may be using the same
    structure over and over again.

    However the constructor was called with the passed in bam1_t instead
    of pileups own copy of it (unlike destructor).  Hence anything the
    constructor attempts to cache, such as the location of an aux tag,
    may get written over.

commit 0456cec92e06f4354bbdde26dd3a3b9d3b8bb0ae
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Fri Jul 31 11:16:59 2020 +0100

    Clarify that htsFile fields are not to be used directly

commit adeeb8870eadf3e005ae0f17e773d7d90cf58624
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Thu Jul 30 11:27:20 2020 +0100

    Add bcf_ prefix to variant_t struct

    This struct is used only as the type of a field within bcf_dec_t,
    which itself is an internal field of bcf1_t. So user code is very
    unlikely to be using this identifier directly.

commit 71b9e4ff763e97f66e0a139a4f3b2974075fb317
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Sat Jul 11 12:34:42 2020 +0100

    Rename public HTSlib struct tags to omit leading underscores

    File-scope identifiers starting with an underscore are reserved to
    the compiler, and there is nothing wrong with a struct tag matching
    a typedef name.

    In cases where foo_t is actually a typedef to a pointer, rename the
    struct tag from __foo_t to foo_s to distinguish it from foo_t.

commit 2fb3b7e178304eff76381c630e54e6408a94f834
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Thu Jul 30 10:53:04 2020 +0100

    Remove unused aux_key_t typedef

    This was added during PR #608 but accidentally not removed when the
    code using the typedef was removed later in the PR's development.

commit 3c6f04be2fa45ee8eb4c2f9a171f2ba8a6032588
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Wed Jul 29 14:09:14 2020 +0100

    Add struct tags for public typedeffed structs

    This enables user code to forward declare these types with
    e.g. "struct htsFile;" without including <htslib/hts.h>.
    This commit adds tags for typedefs that previously didn't have one.

commit 7ecaaf90d4515fa9747e7536234722e5d31e514a
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Thu Aug 20 17:43:27 2020 +0100

    Ignore errors when closing the input file in the fuzzer

    Since 3855184b, hts_close() returns non-zero if earlier
    errors were detected on the file handle.  As this frequently
    happens with fuzz input, it should not be treated as a fuzzing
    failure.

commit 4162046b28a7d9d8a104ce28086d9467cc05c212
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Thu Aug 13 15:45:59 2020 +0100

    Tidy bcf_hdr_subset() error handling

    Gather clean-up code in one place, fixing some
    cases that were incompletely handled.

    Catch more memory errors directly.

commit 564baf0364762cdd7fadde03ff333c379ec078f8
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Thu Aug 13 12:21:29 2020 +0100

    Improve clean-up in hts_tpool_init()

    Allocate hts_tpool::t_stack earlier so it's easier to clean up
    if it fails.

    Add error handling code to tidy up any threads that got created
    prior to a pthread_create() failure.  Rename variable i to
    t_idx so it's easier to see how it links between the thread creation
    and the clean-up code.  Adds a dependency on htslib/hts_log.h so
    it can use hts_log_error().

commit 260c5a35ad855788baba16c5419ae50b6d23061a
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Mon Aug 10 09:59:50 2020 +0100

    Improve bcf_hdr_format error checking.

    It now handles ksprintf failing, and all the calls to it now also
    check for failure and bubble it up one layer.  Everything that calls
    that layer already had checks, although mostly these are absent in
    bcftools.

    (That's an entirely new repository and PR though.)

    Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

commit 6e759d374751f2743811bedfb761d6feacbced6b
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Mon Aug 10 09:46:04 2020 +0100

    Improved return from bcf_hdr_parse_line to enable error detection.

    The function returns NULL for both failure and also for running out of
    text.  Hence it's executed in a loop until it returns NULL, assuming
    success.

    On successful end it also sets *len to 0, so now on error we
    explicitly set *len to non-zero so it is possible to use that as
    a side channel for disambiguating EOT from error.  Values returned
    in *len are documented in the header file.

    If bcf_hdr_parse_line() returns NULL due to a malformed line,
    bcf_hdr_parse() can use the returned value in *len to skip to
    the next one instead of searching for the newline itself.

    To stop multiple warning messages from being printed on finding
    malformed lines, make hts_hdr_parse_line() responsible for logging
    both cases that it traps and remove the corresponding logging from
    hts_hdr_parse().  Use hts_strprint() when printing bad lines to
    sanitise any unprintable characters.

    Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

commit 3855184b58ff51735e9a70282a286702a24ae7f3
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Thu Aug 6 11:11:35 2020 +0100

    More mem-checking & threading hardening (write side).

    - Failing to allocate the structure to store a thread result now shuts
      down all processing queues.  This avoids potential deadlock.

    - Failure in the thread pool now sets the shutdown signal to 2 instead
      of 1, offering a simple way to determine error-induced shutdown
      instead of normal shutdown.

    - BGZF now checks for errors in bgzf_close.  Previously some errors
      were caught, but still returned 0 from close and weren't propagated
      back to the application.

    - khash failure in cram_encode_aux could leave fd->metrics_lock
      locked.  Delete unused hash entry if cram_new_metrics() fails.

    - Improved hts_tpool_process_flush to avoid deadlocks by periodically
      checking for shutdown.  Also fixed a bug in a previous commit where
      a shutdown queue could lead to flush returning before n_processing
      hits zero, which in turn could lead to memory being freed for
      running jobs.

    - Check for string_ndup failure in cram_encode_aux.

    - Removed double free in error handling of cram_stats_encoding.

    - As with SAM reader, the threaded writer now uses the
      hts_tpool_process_shutdown function instead of ..._destroy to avoid
      a race condition between that and fd->q = NULL.

    - Don't attempt to free h->target_name elements when they havn't been
      initialised yet due to error in creation.

    - Fix potential double free in sam_format_worker (on error).

commit 8ae25df4bfba9c2bc9b4ac1f6c9fa1de70d5eec9
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Wed Aug 5 12:22:55 2020 +0100

    More protection against running out of memory (all formats read side).

    These have been found by using a malloc replacement (added to the
    Makefile plus header include using CFLAGS="-g -include malloc_fuzz.h")
    so that the Nth memory allocation will always fail.  We then loop from
    1 upwards until we succeed, checking every place for a proper handling
    of the error.

    Eg:

        for i in `seq 1 20000`
        do
            echo $i
            FUZZ=$i ./test/test_view -@4 -B in.bam 2>&1
            test "$?" ">" 1 && break
        done

    Changes include

    - bgzf MT reading now sets mt->command = CLOSE and has additional
      calls to shutdown the queue on memory errors.
      This is checked for in bgzf_check_EOF to avoid deadlock.
    - bgzf now has a pool_create() error check.
    - Distinguish between errors vs would-block in cram_decode_slice_mt
    - Make cram_drain_rqueue cope with failed jobs.
    - Improve cram_close to cope with shuting down after failed
      initialisation, such as rqueue not being initialised.
    - Check for realloc failure in refs_load_fai.
    - Fix segfault in cram_decode_slice if cram_get_ref failed.
    - Removed abort in load_hfile_plugins.  It now returns -1 and is
      checked for in find_scheme_handler.
    - Documented return values in hts_getline
    - Attempt to propagate memory error in kgetline2 to file pointer, but
      this only works for hgets (not gets).  Fundamentally this is
      unsolvable with the current API.
    - Fixed memory tear-down in sam_parse_worker so it doesn't claim to
      have allocated more than it managed.
    - sam_set_thread_pool now copes with failure to create the queue.
    - Tabix index_load now checks calloc return.
    - Extra allocation return checks in hts_tpool_process_init and
      hts_tpool_init.
    - Check ksprintf worked in vcf fix_chromosome.

    Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

commit 773b6ed3c962c434901e0194607bbc8fc6c2fde2
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Mon Aug 3 11:25:03 2020 +0100

    Fixed deadlocks and crashes in threaded SAM parsing.

    There was a race with the closing code, caused by early termination
    (eg failure during parsing).  This has been solved by waiting for the
    sam_dispatcher_read to acknowledge and return a SAM_CLOSE_DONE call.

    There were also some crashes caused by aborting even earlier, eg
    failing to launch all the threads due to running out of memory.

    Note extreme low memory may still lead to crashes in this code (but no
    more deadlocks).  It's challenging to cover even the basics failing to
    initialise.  In practical scenarios it should be robust.

    (Via Rob Davies): sam_dispatcher_read now uses
    hts_tpool_process_shutdown instead of hts_tpool_process_destroy.  It
    also removes any change from SAM_CLOSE_DONE to SAM_CLOSE or SAM_NONE
    to avoid race conditions or deadlocks.

    Fixed a thread pool bug where worker threads could still launch new
    jobs after a queue (process) had been shutdown provided the pool
    itself hadn't.  We now check both pool and process.

    Similarly fixed hts_tpool_process_flush to work after shutdown instead
    of wedging.  However we don't use this so the change is not strictly
    needed for this particular fix.

    Finally improved on the error handling so we use errno over EIO if
    errno is set, to avoid masking the true source of error.

commit 0e8a6e21e8e30182041cb0b83407de952b521ffd
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Wed Aug 5 10:08:43 2020 +0100

    Fix bug in the line reallocation for threaded SAM decoding.

    (The +NM/2 on the wrong side of the comparison!)

    The +8 changes here may not be necessary, but we did it for some
    allocs so I'm just being consistent and making sure it's always got 8
    more than it uses.  Either it's not necessary in which case those
    changes are harmless, or it is necessary in which case we were
    sometimes not always leaving at least 8 bytes remaining.

    I believe this fixes https://github.com/samtools/samtools/issues/1293

commit a79009b38ce83e39bcbc8f54c00cf203621aa5bb
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Wed Aug 5 10:17:46 2020 +0100

    Don't trash errno in bam_tag2cigar.

    bam_aux_get sets errno=ENOENT when we look for a tag and don't find
    it.  Restore it to the old value when failing to find CG tags as
    not finding the tag is not an error.

    The consequence of samtools/samtools#1296 was that some errors
    found while decoding a BAM file got reported as "No such file
    or directory" prior to this commit.

    Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

commit ab045d8c9d13a023ed873bd4788e851dd52937f2
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Aug 4 15:24:20 2020 +0100

    Add missing Makefile dependency [minor]

commit 31f0a76d338c9bf3a6893b71dd437ef5fcaaea0e
Author: Andrew Whitwham <whitwham@users.noreply.github.com>
Date:   Wed Jul 22 09:55:16 2020 +0100

    Update dates in LICENSE and program versions.

commit f098e4da5728e8ba4ca4351b9a1b945d17dd8982
Author: Andrew Whitwham <whitwham@users.noreply.github.com>
Date:   Fri Jul 17 12:01:59 2020 +0100

    Update the copyright to 2020 where needed.

commit a2fdf3b2fa29844a2d48c4d786471f05bd7821a9
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Sat May 9 13:42:28 2020 +0100

    Consider alignments which consume no reference bases to have length 1

    In bam_endpos(), similarly to reads that are unmapped or have no CIGAR,
    consider reads whose CIGARs consume no reference bases to have length 1.
    Thus consider them to cover one reference position (rather than zero)
    for indexing purposes. Fixes samtools/samtools#1240.

    Use bam_cigar2rlen() directly in bam_plp_push() to avoid this and other
    adjustments. (This function already ignores unmapped reads separately.)

    Update code that sets the bin field accordingly: use bam_endpos() where
    applicable; when not all fields have been set up, add similar code to
    make adjustments when rlen==0.

commit 302843d739d65d3192c52558b154c25f2a00c103
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Tue Jul 28 15:59:14 2020 +0300

    [tabix] Add --separate-regions option (#1108)

    * Add tabix --separate-regs option.

    * Rename option to --separate-regions.
    Print the region name on top of the corresponding group.
    Add unit test and minor refactoring.

commit 3a359cd61322a2fc546198f72cb36988924ab765
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Jul 28 12:01:52 2020 +0100

    Download index files atomically in idx_test_and_fetch() (PR #1112)

    * Reuse cram_populate_ref()'s atomic downloading in idx_test_and_fetch()

    Extract filename uniquification into a new hts_open_tmpfile() function
    declared in hts_internal.h. No code in hts.c currently uses pthreads
    directly, so instead of mixing pthread_self() into tmpname, mix in the
    pointer value of one of the arguments -- which will vary across threads.

    Test whether the local index file already exists using access() rather
    than a trial fopen() into local_fp, which is no longer a FILE*.

    * Improve cram_populate_ref() download error handling

    Clean up the temporary file if any of hwrite/hclose/chmod/rename
    fail. Putting hclose() first in the if condition ensures it is always
    executed, and putting rename() last ensures that the file is still
    at path_tmp in any failing case.

commit da595880d134b29fb62f726b72bdc6302ffc9ccc
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Tue Jul 21 10:19:56 2020 +0100

    Fixed a (harmless) CRAM uninitialised data access.

    This was caused by the fix in PR #1094.

    The problem there was data without explicit quality scores being set
    (CF data series with "preserve qual scores" not set) could be coupled
    with explicit Q or q feature codes to add qualities at specific
    sites.  Without quals, the default is "*" (a run of Q255), but we
    can't just amend a few qualities here and there as that leads to a mix
    of Q255 and Q-something-else.  We memset from Q255 to Q-something-else
    the on the first explicit quality change.

    The flaw is that this check is applied even on data where the
    "preserve qual scores" CF value is set, but we haven't initialised the
    quality string to all-Q255 in that scenario.

    However it was totally harmless, as if we preserve quality scores,
    then the entire array of quality values are then subsequently
    overwritten (after applying any per-base qualities - dumb I know, but
    it was never intended for that scenario).  So irrespective of the
    outcome of the uninitialised check and whether or not the memset to
    Q30 happened, the data still ends up the same thing.

    Note curiously this does not show up with -fsanitize=address, but
    valgrind does detect it.  Hence the test harness was passing fine.

commit dcd4b7304941a8832fba2d0fc4c1e716e7a4e72c
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Mon Jul 13 11:48:49 2020 +0100

    Fix check for VCF record size

    The check for excessive record size in vcf_parse_format() only
    looked at individual fields.  It was therefore possible to
    exceed the limit and overflow fmt_aux_t::offset by having
    multiple fields with a combined size that went over INT_MAX.
    Fix by including the amount of memory used so far in the check.

    Credit to OSS-Fuzz
    Fixes oss-fuzz 24097

commit df31e599a3d3778804b6a7fda9bb5fdc683ee34c
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Fri Jul 10 16:53:43 2020 +0100

    Avoid crash for `bcf_update_alleles(hdr, rec, NULL, 0)`

    Updating a bcf1_t to have 0 alleles is not particularly meaningful, but
    it is a valid state as that's what bcf_init() and bcf_clear() produce.
    Hence it is a valid thing for bcf_update_alleles() to do too, so calling
    it with nals==0 should not crash. Instead set rlen the same way that
    bcf_init() and bcf_clear() leave it. Fixes #994.

    Also fix bcf_get_format_values() allocation failure check [minor].
    Hat tip @reedacartwright.

commit c7c7fb56dba6f81a56a5ec5ea20b8ad81ce62a43
Author: daviesrob <rmd+git@sanger.ac.uk>
Date:   Wed Jul 15 09:03:25 2020 +0100

    Fix hfile_libcurl breakage when using libcurl 7.69.1 or later (#1105)

    Curl update https://github.com/curl/curl/pull/5050 (pause: bail out
    on bad input) made curl_easy_pause() return an error if passed
    a disconnected CURL handle.  In hfile_libcurl this could happen
    when libcurl_read() or libcurl_close() was called after all
    bytes had been read; or all the time in restart_from_position().

    Fix by checking for fp->finished in libcurl_read() and
    libcurl_close(); and by removing the curl_easy_pause() in
    restart_from_position().

    Thanks to @jmarshall for tracking down the exact libcurl change
    that caused the incompatibility.

commit 9c357445c5dab64db4000497193975fa12f63225
Author: Ilya Vorontsov <vorontsov.i.e@gmail.com>
Date:   Wed Jul 8 15:52:24 2020 +0300

    typo "Number=R" instead of "Number=G" in error msg

commit 2c49d1913b05e754e1f180fc64d187ba0baba1a8
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Jun 24 18:44:24 2020 +0100

    Don't call exit(1) on invalid INFO numbers in vcf_format()

    Instead, set errno = EINVAL and return -1, as for other types of
    invalid record.

commit 6fc307b777cb20ea52820d88e7e7c2c645edfd4b
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Jun 24 17:02:14 2020 +0100

    Catch invalid dictionary indexes in vcf_format()

    bcf_record_check() can't check the validity of references to the
    header dictionaries when using the iterator interface as it doesn't
    have access to the header.  Add a check to vcf_format() to catch any
    invalid records that make it that far.

    Changes the existing, incomplete,  check for FORMAT ids so
    that it returns -1 instead of calling abort().

    Adds 'const' qualifier on 'rec' to bcf_seqname() and
    bcf_seqname_safe() they can be used without casts in vcf_format().

commit 1d6b68263a58786b6ab88f579e2866ad9e13c17b
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Jun 24 16:29:12 2020 +0100

    Better catch BCF CHROM/FILTER/FORMAT ids absent from the header

    It is possible to make a sparse set of ids in the VCF
    header using IDX= annotations.  This can cause problems
    if a BCF record refers to one of the indexes that does
    not exist.

    Update bcf_record_check() so that these bad records while
    files are being read.  It already did this for INFO; this
    adds similar checks for CHROM, FILTER and FORMAT data.

    Credit to OSS-Fuzz
    Fixes oss-fuzz 23670

commit 832e4ba7b43b68a53ef2277cfbc643b85c8d32fd
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Tue Jul 7 18:46:05 2020 +0100

    Fix paths for Windows tests.

commit 818b271f8be6d71258364aff69075ede97a49054
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Jul 1 18:33:18 2020 +0100

    Add synced reader no-index tests

commit 6e0ed10ef42f393628ee4a0e628a8acc5023ed0a
Author: Petr Danecek <pd3@sanger.ac.uk>
Date:   Wed Jun 24 14:46:17 2020 +0100

    Support for simultaneous reading of unindexed VCF/BCF files.

    In order for this to work, the caller must ensure that the input
    files have chromosomes in the same order and consistent with the
    order of sequences in the header.

    Replaces #1076.

commit c9fdcbf2bb1e700f388a3302914c6f272e3f84ff
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Fri Jun 12 16:43:36 2020 +0100

    Don't dlclose() plugins in atexit() handler

    Prevents a segfault when HTSlib has been imported using dlopen(),
    a plugin that itself links against HTSlib is in use and
    hts_lib_shutdown() was not called before dlclose(htslib).

    If hts_lib_shutdown() is not called, dlclose(htslib) will
    not remove libhts.so if any plugins linking it are installed.
    In this case the atexit() handler runs at program exit where
    it can clean up any memory still in use.  There's no need to
    remove the plugin shared libraries (which causes the segfault
    as it removes libhts.so while code inside it is running) as
    the runtime will tidy them up anyway.

    If hts_lib_shutdown() has been called, the plugin shared
    libraries will already have been removed.

commit 913fccf7edeb4e959428996e040b958645dc0b2b
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Fri Jun 12 12:55:13 2020 +0100

    Add --enable-plugins to travis (and fix -ldl configure test breakage)

    Using Address Sanitizer on Ubuntu Xenial with gcc-8 somehow
    breaks the configure test to see if -ldl is needed, making it
    incorrectly decide that it isn't.  This causes a link failure
    when building test/plugins-dlhts when the compiler decides
    that it really did want -ldl after all.  Oddly, other
    outputs that reference dlopen(), dlsym() etc. have already
    built successfully by this point.

    Fix by making configure test for dlsym() instead of dlopen(),
    which seems to work better.

commit 9a15df9d394777c45abcdebf27dd65fbaf1a8b79
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Jun 23 13:40:45 2020 +0100

    Log when calling hts_lib_shutdown/dlclose/etc

    Facilitates adding debugging printfs to e.g. hfile_shutdown()
    to observe exactly where in the sequence they occur.

commit 00c61d2c8d83324494d4edab42c452c2d3cc26fc
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Jun 16 08:06:49 2020 +0100

    Provide separate dlsym() wrappers for functions and data

    Avoid -pedantic warnings about data pointer <-> function pointer
    conversions (which are required to work on POSIX platforms so that
    dlsym() works, but are nonetheless warned about by compilers).

    As we're providing wrappers around dlsym() anyway, we can provide
    separate function and data wrappers; change load_plugin()'s return
    type as it (practically) always returns a function. This uses the
    workaround recommended in POSIX Issue 6 dlsym() (since removed in
    Issue 7 by [1]; at least for GCC, type-punning through a union
    might be preferable). Hat tip Rob Davies.

    [1] https://www.austingroupbugs.net/view.php?id=74

commit 2cb229aedbdf81c968a9030521a9b7ed42cce599
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Sat May 16 09:01:12 2020 +0100

    Link libhts.so into plugins on Unix&macOS; add hts_lib_shutdown()

    When libhts.so/.dylib has been dynamically loaded with RTLD_LOCAL
    (which is the default on Linux), plugins that use hts_always_remote()
    etc will not be loadable unless they make libhts's symbols available
    themselves.

    When libhts.so/.dylib has been dynamically loaded, some plugins have been
    opened, and dlclose(htslib) is called: htslib is not actually unloaded
    (or its atexit functions e.g. hfile_exit() called) as the plugins are still
    using libhts symbols. Eventually at program exit, all atexit functions are
    called; so the plugins are closed by hfile_exit(), htslib is suddenly
    unloaded when the final plugin is unloaded, and a segfault occurs because
    this all happens within hfile_exit() -- which has just been unloaded.

    To break this cycle, introduce hts_lib_shutdown() which must be called
    before dlclose(htslib), if that is called explicitly prior to program exit.
    This unloads the plugins so that dlclose() can immediately unload htslib.

    Add test exercising libhts via dlopen(). This ensures that all available
    hFILE plugins have been loaded, then calls hts_lib_shutdown() and
    dlclose(htslib). As this is the first htslib test that tests things
    linked to the shared library, we need to set $LD_LIBRARY_PATH etc so
    that the freshly-built libhts.so/.dylib/etc is used.

    Add with-shlib.sh script that creates a suitable temporary libdir
    containing *only* the freshly-built shared libhts. On Windows platforms,
    this directory is added to %PATH% so shouldn't contain anything else.
    On macOS, having hfile_*.bundle files in $DYLD_LIBRARY_PATH would reduce
    the ability to test against different sets of plugins by varying $HTS_PATH.

commit 82d01ba63b20eed8905ea60c2e1cfe0f96c73bab
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Fri Jul 3 14:41:19 2020 +0300

    Implement vcf_open_mode method (PR #1096)

    Which can detect the opening mode of a file, based on its extension.

commit cb44caaebb25ec334f257b96f9c207cfa6f346c7
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Wed Jun 24 07:32:51 2020 +0100

    Check for null pointer in source string.

commit 29d06c2f84dc20fbc5f94ca1b181292433b69a21
Author: daviesrob <rmd+git@sanger.ac.uk>
Date:   Thu Jul 2 13:07:21 2020 +0100

    Improve bam_aux_update_str() (#1088)

    Makes bam_aux_update_str() easier to use by not insisting that
    the byte count passed in via the len parameter should include
    the NUL-terminator for the string.  This makes it less likely
    that an invalid tag will be generated by forgetting to include
    the NUL, and also allows tags to be made from part of a longer
    string.  To support this, it is made to handle adding new tags
    itself instead of delegating to bam_aux_append().

    Where existing tags are replaced, bam_aux_update_str() is
    changed to more accurately calculate how much extra space is
    required.  A call to bam_aux_del() is removed and the tag
    shuffling code updated to reduce the number of memmove() calls
    needed to get the following tags in the right place.

    Adds some extra tests to ensure growing and shrinking tags
    both work correctly.  Also fixes a bug in the test/sam.c
    test_mempolicy() function, which was not adding the correct
    number of bytes when appending the ZZ tag to each alignment
    record.

commit 33bfcfa37df8beae67d3b0874c7253fe14d8929e
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Tue Jun 30 17:01:27 2020 +0100

    Fix mpileup regression between 1.9 and 1.10.

    When RNEXT / PNEXT / TLEN are set to the uninitialised values (* / 0 /
    0) the overlap_push function was escaping the overlap removal code.

    We now still look for overlaps in this case.  It's not as efficient
    when it doesn't have these values as we can't optimise the case of
    detecting non-overlapping reads, but we do get the correct answer again.

commit 3d55140fead0a42762abc7c62032e3ee1265f30f
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Fri Jun 26 12:29:48 2020 +0100

    Fix support for decoding CRAM 'q' feature.

    This was incorrectly marching along seq / ref / cig-len, which broke
    the CIGAR string and base calls.

    Test file: https://github.com/jkbonfield/hts-specs/blob/CRAM_tests/test/cram/passed/1005_qual.cram

commit 146e6a28c948e935add16e106f77482cfc7e383d
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Thu Jun 25 14:18:07 2020 +0100

    Set CRAM default qual for lossy qual modes.

    If we have lossy quality enabled and couple that with the use of 'B',
    'q' or 'Q' features, CRAM starts off with QUAL being all 255 (as per
    BAM spec and "*" quality) and then starts to modify individual
    qualities as dictated by the specific features.

    However that then produces ASCII quality <space> (q=-1) for the
    unmodified bases.  Instead we use ASCII quality "?" (q=30), as per
    htsjdk.  We use have quality 255 for sequences with no modifications
    at all though.

    NB: Neither htslib nor scramble can produce test data to demonstrate
    this bug.  I used gdb to tweak the CF value (b cram_encode.c:2769 and
    then set cr->cram_flags=0) along with using a SAM file using IUPAC
    codes in the sequence.

commit 808affd6c25d2df7c59ab8d463dc76015bcbc31a
Author: Damien Zammit <damien@zamaudio.com>
Date:   Sat Jun 27 02:17:23 2020 +1000

    Fix dirty default build by including latest pkg.m4 instead of using aclocal.m4 (PR #1091)

commit 1e1186923f67d82edb181ce6da9ab7d38a20e3eb
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Thu Jun 25 14:07:30 2020 +0100

    Fix a bug, which caused cram_codec_decoder2encoder to return an error code.

commit 313a9fc1e3df139bb32d81d9b660a43a5dc9483a
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Fri Jun 19 15:27:08 2020 +0100

    CRAM fix for MD tag generation with "b" FC.

    We don't normally generate these features, using "B" instead.  However
    some hand crafted CRAM files have shown we didn't create the correct
    MD tag due to forgetting to clear the distance since last edit, giving
    rise to the last numeric in MD being high.

    Eg: MD:Z:0A98T0 came out as MD:Z:0A98T98 if that last T edit was a "b"
    feature.

commit 16f62c5663bc09fda7c87f78322d1c561f483055
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Fri Jun 5 19:23:19 2020 +0100

    Avoid warnings from recent and upcoming Clang and GCC releases

    Prevent GCC 10 strncpy() warning by telling the compiler that
    cram_file_def::file_id is not a C-style NUL-terminated string.

    GCC 10 warns for `int_var = uint32_var = (uint32_t) -1`.

    Clang 11 warns that float doesn't have enough precision to represent
    RAND_MAX (0x7fffffff) exactly. Recode using uint64_t arithmetic rather
    than (32-bit) float arithmetic.

commit 34c15969fa49d78b07c649374fb008fb3ed1e98a
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Fri Jun 5 14:35:42 2020 +0100

    Add documentation to hts_itr_t structure.

    Add a few minor changes.

commit d3147150b4f90a75746500137357f63bb8530d81
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Wed May 6 15:30:03 2020 +0100

    Fix the multi-threaded CRAM multi-region iterator regression.

    Fixes #1061

    This appears to have been introduced during bfc9f0d and fixed for BAM
    only in 6149ea6.  The effect on multi-threading decoding for CRAM was
    significant.

    This fix takes a totally different approach, which not only fixes the
    regression but passes it.  CRAM can do CRAM_OPT_RANGE queries, used by
    the single version of the iterator, which informs the decode of both
    start and end range positions.  When threading this means we don't
    preemptively decode the next container unless it'll actually be used.
    This same logic is now used in the multi-region iterator, although
    it's complex.

    The general strategy is as follows:

    - Ensure we know the next container start from index.  This needs a
      small tweak to cram_index struct as the next container isn't quite
      the same as this container + slice offset + slice size (sadly).  I
      think it's missing the size of the container struct itself.

    - When being given a start..end offset, step into the reg_list to find
      the corresponding chr:start-end range.

    - Produce a new CRAM_OPT_RANGE_NOSEEK option that does the same job
      as the old CRAM_OPT_RANGE opt but without the seek.  This is
      necessary hts.c does the seek for us and the pseek/ptell only work
      with each other and not within the cram subdir code itself).

    - Identify neighbouring file offsets and merge together their
      correponding ranges so a block of adjacent offsets becomes a single
      CRAM_OPT_RANGE query.  We cache the end offset (.v) in
      iter->end so we can avoid duplicating the seek / range request
      in subsequent intervals.

    - Manage the EOF vs EOR (end of range) return values.  For EOR we have
      do the incrementing ourselves as we need to restart the loop without
      triggering the end-of-file or end-of-multi-iterator logic below it.

    - Tweak the region sorting a bit so ties are resolved by the max
      value.  We want the index into reg_list to also be sorted, so we can
      accurately step through them.  Note this logic is CRAM specific, as
      the sorting wouldn't work on BAI anyway due to the R-tree.

    Some benchmarks on ~37,000 regions returning ~110 million seqs across
    NA06985.final.cram:

    CRAM-1.10 no threads:  real  4m37.361s   user  4m21.128s   sys  0m7.988s
    CRAM-1.10 -@16:        real  3m14.371s   user 28m48.872s   sys  4m52.442s
    CRAM-dev -@16:         real  5m55.670s   user 61m0.005s    sys 11m23.147s
    CRAM-current -@16:     real  1m55.701s   user  5m54.234s   sys  0m51.699s

    The increase in user time between unthreaded and 16 threads is modest.
    System time increase is considerable, but this appears to primarily be
    down to glibc malloc inefficiencies.  Using libtcmalloc.so instead gives:

    1M CRAM-current -@16:  real  1m30.882s   user  6m9.103s    sys  0m4.960s

    We're still nowhere near using 16 threads, but this is because the
    average size of each region is significantly smaller than 16
    containers worth.

commit 9de45b72b3a3f3df7a0ffda3554aeab1f6da50ba
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Tue May 12 17:28:21 2020 +0100

    Further speed up of VCF parser (formats).

    The if elseif checks are now a switch statement and juggled in order a
    bit.  Also the fmt[j].max_m type code is now f->max_m with f
    incremented along with j.

    The impact on gcc builds is minor (maybe 1%), but for clang this was
    8-9% speed improvement on a 1000 genome multi-sample VCF.  Example
    timings for clang:

    Previous commit
          23333.77 msec task-clock                #    1.000 CPUs utilized
       83667512727      cycles                    #    3.586 GHz
      199145555089      instructions              #    2.38  insn per cycle
       43099743981      branches                  # 1847.097 M/sec
         665687093      branch-misses             #    1.54% of all branches

    This commit
       75967289857      cycles                    #    3.585 GHz
      195076309580      instructions              #    2.57  insn per cycle
       43265084488      branches                  # 2041.736 M/sec
         640008186      branch-misses             #    1.48% of all branches

commit b1acab6219c12c3a6cc127b37e0b03bcf4b90c3f
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Thu Apr 23 13:23:03 2020 +0100

    Implement and use the `hts_str2dbl` method (by @jkbonfield)

    For faster conversion of strings to floating point values.  It
    speeds up the conversion of values which match the regexp
    [+-]?[0-9]*[.]?[0-9]* and have at least one and no more than 15
    digits in total.  Values that don't match (for example they include
    an exponent) are handled by passing to strtod().

    Co-Authored-By: James Bonfield <jkb@sanger.ac.uk>
    Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

commit 6d7da635eaa744f8140e10be90995864da381c58
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Tue Apr 21 11:42:23 2020 +0100

    Speed up GT parsing

    * Parse numeric GTs as unsigned integers with at most 30 bits.
      (no more then 30 bits are needed due to the maximum permitted
       value of the field).

    * Pull error checking out of the innermost loop.

commit 9c4c64d86a1711a1f67add84a42e8d7a823d33c5
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Wed Feb 19 14:27:37 2020 +0000

    Extract functionality from `vcf_parse` into individual parsing methods.

commit 4a783b84276997187c3a168d161429a23a2e1548
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Thu Feb 6 12:58:11 2020 +0000

    Use hts_str2int, instead of strtol, in the VCF parser.

commit 08488f7a7910d8fdedf9c0d5ba7a882cdcabedc0
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Thu May 21 00:18:17 2020 +0100

    Use including-file-relative paths to top-level private headers

    Similarly to the previous commits, includes of these headers from
    files in subdirectories can be subverted if $CPATH/etc or $(CFLAGS)
    activates a directory containing them. This is unlikely,  but the
    particularly misguided might perhaps add a previous HTSlib source
    directory to $C_INCLUDE_PATH.

    To avoid that scenario, and for consistency with all the other
    intra-HTSlib #includes, add ../ to #includes of top-level headers
    in subdirectories.

    With this, the only remaining #includes in HTSlib requiring `-I.` (as
    is hard-coded in Makefile) are each *.c file's <config.h> inclusion.
    (Using <> for that is as recommended by the autoconf documentation.)

commit 1ce80dd6e006b6554d1387d013bbfba942eaf824
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Jan 7 16:06:15 2020 +0000

    Use including-file-relative paths to cram/*.h headers

    Similarly to the previous commit, includes of cram/*.h from files in
    subdirectories can be subverted if $CPATH/etc or $(CFLAGS) activates
    a directory containing cram/*.h files. This is less likely than the
    previous case of previously-installed public htslib/*.h headers, but
    note that e.g. Debian for a time shipped cram/*.h headers, and the
    particularly misguided might perhaps add a previous HTSlib source
    directory to $C_INCLUDE_PATH.

    Avoid this by removing cram/ from cram/*.h #includes in cram/*.[ch]
    (and adding ../ elsewhere) and converting any instances of <> to "".

commit e569d806ea84b70793d08f290b85e294257d7814
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Jan 7 15:32:47 2020 +0000

    Use including-file-relative paths to public htslib/*.h headers

    The full preprocessor search path for #include "..." is typically:

        1) directory of the file containing the #include directive
        2) environment variables $CPATH, $C_INCLUDE_PATH, etc
        3) command line options -I etc
        4) system include directories

    When a top-level *.[ch] file #includes "htslib/foo.h", the nearby
    header file within the source tree is found due to (1). When a file
    in a subdirectory (htslib/*.h, cram/*.[ch], or test/*.c) does the
    same #include, the nearby header file is intended to be found due
    to (3) via the hard-coded `-I.` in the Makefile. (`.` here being the
    compiler process's $PWD, i.e., the top-level HTSlib source directory.)

    However the latter can be subverted if there is an "htslib" directory
    containing foo.h in $CPATH, $C_INCLUDE_PATH, or on the command line
    prior to the hard-coded `-I.`. This will be the case if users have
    made a previous HTSlib installation accessible via $CPATH/etc or by
    adding a -I option to $(CFLAGS) (rather than $(CPPFLAGS)).

    Avoid this by adding ../ to #includes in subdirectories (and always
    using "" rather than <>), so that all includes of htslib/*.h within
    HTSlib are found due to (1). Fixes #347.

commit e3cdb1df911116978c40c7fbfb3fd69ad20c3c1e
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Wed May 20 07:36:36 2020 +0100

    Use the new hts_strprint() instead of dump_char()

commit 3a73df587add7a2bc7600a57a3ab06ff629475f5
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Jan 21 14:07:05 2020 +0000

    Improve SAM parser unrecognised RNAME warnings

    Fix tid vs mtid typo in RNEXT parsing. Fix "urecognized" typo.
    Use __VA_ARGS__ to simplify the _parse_err macros. Use hts_strprint()
    to add the unrecognised reference names to the warning messages.

commit 9d5c6533edbdf8eba7f4bc621695a802099ffa37
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue May 19 22:36:16 2020 +0100

    Add new hts_strprint() utility function

    Prints user-supplied text, optionally with quote marks, truncated
    to a sensible length and with unprintable characters escaped.
    Thus it is usable on potentially malicious input data.

commit e05d8dcd435876c6ec7635fd3819ca7915e302f3
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Mon May 18 15:27:16 2020 +0100

    Fix test data to have "=" for RNEXT when matching RNAME.

commit c0e65a1ec4db767808ffcde1b44305051aea11c2
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Fri May 15 13:41:41 2020 +0100

    Mention TBI chromosome length limit in tabix man page

commit 382867a850b74e7285166a67ee3243560cd974ac
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Fri May 15 17:19:35 2020 +0100

    Fix reference length check. (PR #1068)

    Credit to OSS-Fuzz
    Fixes oss-fuzz 22231

commit fc431ebc252166f47f798e20954c5ccf8be83a8c
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Thu May 14 09:28:17 2020 +0100

    Take even more care in converting uintN_t to intN_t

    This code converts positive and negative values from uintN_t to intN_t
    separately, but was being undone by having different types on the two
    sides of the colon, and hence their common type being uintN_t -- thus
    requiring an implicit conversion to intN_t after all!

    (int8_t and int16_t are smaller than int so this matters even less
    for these types, but it's good to follow the pattern anyway.)

    Uncovered via -Wsign-compare, and grepping the code for /:.*(int/ etc.
    Followup to dc3303b6da542f74c965b3f3f50a58776f87104b.

commit ea16e9d736911f757d830b7e718a7281758123ed
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Wed May 13 22:58:19 2020 +0100

    Always check for malformed end < beg ranges

    Check this too in the case that the "change of chromosome" if-statement
    above was also executed.

commit 315d08dca68bdb7fcbd7386108307a4ab48d2246
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Wed May 13 22:43:25 2020 +0100

    Prevent -Wsign-compare warnings in public htslib/*.h headers

    (Also fix nearby htslib/hfile.h whitespace.)

commit b6d48aaa3809767417422595277eac8fbfe56d34
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Tue May 12 12:44:46 2020 +0100

    Tolerate PG lines with invalid PP entries, already present in the file,

    but log a warning.

commit cdf5d18768ba29975b70de3194e1fd9ab0c5f062
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Thu May 7 12:57:15 2020 +0100

    Fix a test case.

commit e49b6665ec0d1383a5574459959c7a7a925daeed
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Wed May 6 16:30:04 2020 +0100

    Compute the PG chain lengths and don't count leaves as starting points.

    If there are only leaves (no chain longer than 1), choose the last entry.
    Redo the PG links before adding a new PG line.

commit cb8a05138e3904f7993d71026d3ee463e0cc80e9
Author: Petr Danecek <pd3@sanger.ac.uk>
Date:   Wed Apr 22 11:11:58 2020 +0100

    Print position of offending VCF/BCF records to help with debugging

    Adds bcf_seqname_safe which returns "(unknown)" when the sequence
    name cannot be determined.

commit 8859b092f23ece485391a251be616feb2422ad70
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Feb 18 11:13:43 2020 +0000

    Remove spurious system headers from public HTSlib headers

    Reduce unnecessary inclusion of in particular <assert.h> and <stdio.h>.

commit 0344174128c7a234c6a2d2a23e2ae9483ee4b434
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Thu Mar 19 11:11:05 2020 +0000

    Use ks_expand() instead of kroundup32+realloc [minor]

    Use newer kstring utility function to avoid explicit low-level munging.
    Report any memory allocation error in bgzf_getline(); ks_getuntil2()
    has no error handling ability, so ignores ks_expand()'s return value.

commit 48c4e9a4e6b7ded874630f4ff84f458c77402266
Author: daviesrob <rmd+git@sanger.ac.uk>
Date:   Wed Apr 29 12:35:04 2020 +0100

    Add tabix --cache option; make bgzf cache work better with hfile_libcurl (#1053)

    * Add tabix bgzf block caching.

    * Preserve buffer contents for deferred seek in hfile_libcurl.

    Commit ad2e24f added delayed seeks to hfile_libcurl so that
    calling hseek() lots of times without an hread() in between would
    not cause unnecessary web requests (the bgzf block cache tends to
    do this).  While it mostly worked, a small amount of data that
    has been read but not consumed yet was being dropped by hseek().
    Where this data was actually needed after all the hseek()ing,
    it would cause hfile_libcurl() to have to rewind slightly resulting
    in a new web request.  Again, this can frequently happen when
    using the bgzf block cache.

    To prevent this, make hfile_libcurl take a copy of the data that
    would have been lost.  If a later read with a deferred seek hits
    the preserved region, it can be returned from the cached data
    without having to start a new web request.  Once this cached
    data has been used up, the original web connection can carry
    on reading from where it left off.

    * Fix libcurl_read() when not all preserved data is used up

    fp->preserved_bytes should not be altered.

    Assertion at line 780 needs to be changed as it's no longer
    true that fp->base.begin and fp->base.end will always point to
    the start of the buffer.

commit 2e36fa6ea4d753fc9aac0c6259078bdc9720a930
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Tue Mar 31 15:29:30 2020 +0100

    Fix memory leak on failure in cram_decode_slice()

    Make the error case decrement the reference count on any used
    refs and free the refs array.

    Credit to OSS-Fuzz
    Fixes oss-fuzz 21393

commit f58a6f3796f94f3b7ed933b24af57c13e0215282
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Tue Mar 31 12:17:03 2020 +0100

    Document special END tag handling in bcf_update_info()

commit 88a588a7b083c1a24051ce8b9ef16c70e78b9f65
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Mar 4 11:18:33 2020 +0000

    Add tests for VCF file invalid END tag handling

commit e784f64950b91cdba8fc8e6f203218957139fd91
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Thu Feb 27 10:00:21 2020 +0000

    Try to handle BCF files with negative rlen fields.

    This can happen where VCF files with invalid END tag values (less
    than the reference position) have been converted to BCF.

    bcf_read1_core() is changed to treat rlen as a signed value (which
    is what the BCF2 specification says).  This means the maximum
    rlen now supported by HTSlib for BCF files is reduced to 2^31.

    If rlen is negative, bcf_record_check() will now print out a
    warning.  It will also attempt to fix up rlen by substituting
    the length of the reference allele.

    A call to bcf_record_check() is added to bcf_readrec() so
    reads using iterators will be protected from bad input data.
    Unfortunately checks involving the header have to be disabled
    for this interface as it isn't available.

commit 6a14289be5b349afcb99bf730bbcbb38ffa6703a
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Thu Feb 27 09:54:42 2020 +0000

    Add check for invalid VCF END tags to tabix

    Where END is less than or equal to the reference position, a
    warning is printed and the length of the REF allele (which has
    already been measured) will be used instead.

commit d704389e051e5fdc8e69530233b10b33c653176f
Author: Petr Danecek <pd3@sanger.ac.uk>
Date:   Thu Feb 20 09:46:19 2020 +0000

    Make sure rlen is updated only from non-missing END values.

commit db079d5be536d170d9868127857b66834de3caa3
Author: Petr Danecek <pd3@sanger.ac.uk>
Date:   Wed Feb 12 13:50:12 2020 +0800

    Prevent negative rlen on erroneous INFO/END in bcf_update_info()

    This is an extension of ea879e19b8 which added checks for reading
    VCF.

    Resolves https://github.com/samtools/bcftools/issues/1154

commit 0bff7c1cb60f3465e86e6e5c826f37a5948d8d04
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Mar 11 11:17:43 2020 +0000

    Fix missing newline in error messages [minor]

commit b34a1ae1a91f7878eabfd288de92a4c7d29dab2e
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Mar 11 11:06:23 2020 +0000

    Make hts_itr_next() return an error on seek failure

    Return code needs to be less that -1, or the caller will
    incorrectly interpret it as end of data, causing it to silently
    finish too early.

    Also adds error logging messages.

commit 90d3002401c8813c0ed64c314e0877a65b68bcd2
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Mar 11 10:25:16 2020 +0000

    Make bgzf_read_block() print more error messages

    To make it easier to find out where BGZF reads have failed.
    It prints the offset of the BGZF block where the read / decompress
    failed, which makes it much easier to locate bad parts of files.
    Unfortunately it can't report the file name as it's not stored
    by bgzf or hfile, but hopefully one of the callers will divulge
    it.

    This introduces a small change of behaviour as block_address is
    now updated after an empty block has been read.  This may change
    the location stored in indexes of files with empty blocks,
    however the index should be more efficient as it will now point
    to the location where data starts instead of the previous
    empty block.

commit 04b71e6b10bdc55cd288a45118a9a563b67add5f
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date:   Tue Mar 17 18:52:37 2020 +0000

    Move the bulk of ks_resize() to a non-inlined helper function

commit ba8d1a11bed8f660594ffd104149df31bb31ba0c
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Tue Mar 17 12:41:00 2020 +0000

    Unify kroundup macros and fix gcc warnings

    Replace several kroundup32 and kroundup_size_t macro definitions
    with a new unified implementation.  The new one will work
    correctly for all integer types (both signed and unsigned) up to
    64 bits.  An attempt to round up that would overflow results
    in the maximum value that can be stored in the given type.

    The implementation avoids conditionals involving large values
    in an attempt to fix some gcc warnings in samtools and bcftools.
    They appeared after commit 29c294e which fixed a bug where kroundup
    would wrap around to zero on large values.  The warnings are
    caused by jump threading optimisations that create code paths
    for the large value cases, which gcc then warns about. See:
    https://developers.redhat.com/blog/2019/03/13/understanding-gcc-warnings-part-2/

    Includes extra tests to ensure the new kroundup works as desired.

commit 9a1035586382ef3272e81c2a5f9eeba876986115
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Tue Mar 17 18:07:37 2020 +0000

    Prevent a double free in case of failure. (PR #1047)

commit b22e03d805866c04ea02a4b04c9b58a167853418
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Fri Mar 6 17:02:31 2020 +0000

    Add crypt4gh file detection

    Adds format detection for crypt4gh files (see
    https://samtools.github.io/hts-specs/crypt4gh.pdf).

    If hts_hopen() finds a crypt4gh file, it will try to re-open it
    via the hfile_crypt4gh() plug-in.  Detection occurs at the
    same point as htsget, so it does both in a loop to allow for
    crypt4gh served via htsget.

    Actual decryption of the file requires an external plug-in,
    available from https://github.com/samtools/htslib-crypt4gh
    As this may not be present, a dummy handler is included that
    prints an error message indicating that it is needed.  This
    is overridden by the real one when it is available.

commit c6ce820d5ccfeca1c38e6644d72ca6543da01eb1
Author: Petr Danecek <pd3@sanger.ac.uk>
Date:   Wed Mar 11 15:30:28 2020 +0100

    Acknowledge the existence of the symbolic <NON_REF> allele produced by GATK (PR #1045)

commit 29c294e6842a56ba3b9a24a24a5f6de1575b0961
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Tue Mar 10 10:31:45 2020 +0000

    Fixed a raft of integer overflows in VCF land.

    - Cast data into size_t before multiplication to avoid wrapping around
      int32.

    - Added checks for return values to align_mem and ks_resize

    - Simplified the byzantine calculation in align_mem

    - Fixed kroundup_size_t and kroundup32 so they cannot wrap around to
      zero and turn the realloc into a free.

    - Also added a check for ~2Gb on total length of FORMAT fields, which
      nullifies the need for some of the above.  We may wish to remove
      this at some point if we want to cope with truely mammoth
      multi-sample data, and the above fixes means doing so will not
      expose bugs.

      However for now this check adds protection against malformed data
      creating excessive memory usage and CPU requirements.

    Credit to OSS-Fuzz
    Fixes oss-fuzz 21139
    Fixes oss-fuzz 20881

commit 8ff8bb328f5d4c2981933ff7847374ddc747d572
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Mon Mar 9 11:45:22 2020 +0000

    Improved CRAM error messages in cram_decode_slice (PR #1042)

    To help diagnose #1038 (Embedded reference issues) by reporting
    which ref_id and region is affected.

commit 3b1ff68575c9db582a122048a7e95aa2b1a47b11
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Thu Mar 5 21:39:48 2020 +0000

    Add tabix --verbosity option

commit 31b23540491e114e7cbfde790f5129c26c18eef4
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Wed Mar 4 14:12:20 2020 +0000

    Add more error checking to the tabix application

    Check for error returns from bcf_itr_next(), hts_getline()
    and tbx_itr_next() so failures can be reported properly.

    Add checks in other places that could fail, mostly memory
    allocations and writes to the output file.

    Note that bcf_itr_querys() and tbx_itr_querys() currently returns
    NULL for both general failures and those caused by the requested
    region not being present.  As tabix silently skips missing regions,
    it is currently not possible to detect errors in these functions.

commit 9979fac560319507cefe5d5682617304d609b589
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date:   Mon Mar 2 08:16:20 2020 -0800

    Free allocated SN strings when encountering an error. (PR #1034)

    Upon successful completion of sam_hdr_create, ownership of the
    newly created SN strings is given to sam_hdr_t::target_name, which
    takes care of freeing them at the end. But if an error occurs, the
    handover might fail and the allocated strings need to be cleaned up
    by iterating through the hash table instead.

    Credit to OSS-Fuzz
    Fixes oss-fuzz 20888

commit 6149ea6f779939451b56e6f63b28409baead2990
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date:   Mon Feb 24 09:41:15 2020 +0000

    Make multi-region iterator better at skipping unneeded data

    Revert change to hts_itr_t::off in bfc9f0d so the `max` field
    can be used to store information about which interval goes
    with each entry.

    Don't merge hts_itr_t::off entries for different intervals.

    Make hts_itr_multi_next() skip offsets that will return no
    data as they are for intervals that have already been completed.

    Deal with possible failure of reg2intervals() in
    hts_itr_multi_bam().

commit 27b670444a372809586eabd921d1f568c8cc6c30
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Fri Feb 21 11:22:11 2020 +0000

    Add BAI linear index usage for multi-region iterator

commit 45715e421e1ce9d15d8880a24454ec7dc42bf4bc
Author: James Bonfield <jkb@sanger.ac.uk>
Date:   Thu Feb 20 14:21:12 2020 +0000

    Add use of BAI linear index to BGZF querys.

    Previously it used the bin "loff" field for BAI, to match the CSI data
    structure (which no longer has a linear index).  Unfortunately when
    coupled with the compress_binning() function this causes BAI to seek
    to data earlier than is necessary when bins have been compressed.
    This generally only affects low coverage data set and has minimal
    impact on deep data as the bins are too large to remove.

    For CSI we have no option of using the linear index, so this does not
    change behaviour.  Sadly this means CSI indices are less efficient to
    use than BAI for many small queries on shallow data sets.

    Some benchmarks on a shallow 1.9x coverage data set, using 1000
    randomly picked regions on chr1 each between 1 bp and 10k bp in size.

    Indexer     Branch   Time(s) BAM I/O(Mb)  Idx I/O(Mb)
    -----------------------------------------------------
    htslib-BAI  develop  1.488   164.1        2.43
    htsjdk-BAI  develop  0.502    53.6        8.27
    htslib-BAI  this     0.498    53.5        2.43
    htsjdk-BAI  this     0.530    53.3        8.27
    htslib-CSI  this     1.532   163.3        0.49

    For completeness, CRAM with varying seqs_per_slice of 300, 1000 or the
    default of 10000, demonstrating the usefulness of changing this if you
    need fine grained random access.

    CRAM-s10k   this     22.12   602.1	  0.11
    CRAM-s1k    this     6.062    92.7	  0.93
    CRAM-s300   this     5.115    42.1        2.98

    Htslib BAI index, samtools 1.10
    -------------------------------

    $ io_trace -S -r 9827_2 -- samtools1.10 view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R

    File: 9827_2#49.bam
        Num. open           1
        Num. close          1
        Num. seeks          969
        Num. read           6040
        Bytes read          164144977

    File: 9827_2#49.bam.bai
        Num. open           1
        Num. close          1
        Num. read           40
        Bytes read          2431088

    real    0m1.488s
    user    0m1.448s
    sys     0m0.036s

    Htslib BAI index, this commit
    -----------------------------

    io_trace -S -r 9827_2 -- …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants