Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent negative rlen on erroneous INFO/END when reading BCF #1021

Merged
merged 6 commits into from
Apr 3, 2020

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Feb 12, 2020

This is an extension of ea879e1 which added checks for reading
VCF.

Resolves samtools/bcftools#1154

@jmarshall
Copy link
Member

The same comment applies to this as to the previous ea879e1 commit (see #1006 (comment)):

HTSlib 1.10 already has code to detect this problem (via PR #917), though due to other factors it is probably triggered only when indexing .vcf.gz not .bcf and there is room for improvement.

Your patch is probably an improvement, though it now produces BCFs with rlen fields that don't match the original VCF POS and INFO/END fields. Shouldn't it have been raised as a pull request so it could be reviewed?

Happily this now is a PR, so there is an opportunity to consider the rlen values that are written by both commits, in both vcf_parse() (the previous commit) and now the same code change applied to bcf_update_info().

In particular, line->rlen is not updated at all in the emits-a-warning case. This didn't matter too much for vcf_parse because it always calls bcf_clear1 first so rlen is left as 0. It does matter for bcf_update_info which will leave the previous value of rlen unchanged.

@pd3
Copy link
Member Author

pd3 commented Feb 18, 2020

From what I can see the code in vcf_parse does the correct thing, rlen is set by https://github.com/samtools/htslib/blob/develop/vcf.c#L2589.

Arguably, also the minimal change to bcf_update_info in this pull request does exactly and only what it is supposed to do: make sure that rlen is modified only when INFO/END is present and when it makes sense. Otherwise rlen should be set by _bcf1_sync_alleles.

Maybe there are other situations I am unable to forsee, in that case it would be helpful to be more specific when raising objections, ideally with a test case highlighting the problem

@daviesrob
Copy link
Member

Doesn't _bcf1_sync_alleles() have the same problem, as it also uses the END tag without checking that it's greater than POS?

Should we actually be keeping these bad END tags? (Or maybe fixing them?) Part of the problem in samtools/bcftools#1154 was that the bad value was retained and then came back again via bcf_update_info(). Presumably there are other places where a similar thing could happen - would it be easier to prevent them all by fixing the END tag on reading?

As for rlen, vcf_parse() leaves it as the length of REF, which is probably the best that could be hoped for. bcf_read1_core() uses the value in the file, which could be a problem if the file was made by a version of htslib before ea879e1. Should we also check for negative rlen when reading BCF, and correct it to length of REF if necessary?

@pd3
Copy link
Member Author

pd3 commented Feb 19, 2020

Ah, that's right, _bcf1_sync_alleles should be checked as well.

As for the fixing the END tag - possibly. I am not sure what the philosophy of the library should be in terms of sanitizing: recover from errors internally and leave the data as is or fix the data when possible as well? We don't know the correct value of END so the only sensible thing would be to aggressively set to missing. This may not be a bad thing as it will force the users to fix the files. However, this change will not be transparent, one would expect bcftools view to display the data as is, including potential non-critical errors.

I am happy to be convinced either way though.

@daviesrob
Copy link
Member

None of this code checks for missing values on END either, although the new checks will reject them with a somewhat misleading error message. We would need to either fix the value, remove the tag completely, or ensure every place that uses the value checks that it's valid (which isn't too many places as it turns out).

@pd3
Copy link
Member Author

pd3 commented Feb 20, 2020

Added a check in _bcf1_sync_alleles and for the missing INFO/END value.

@pd3 pd3 requested a review from daviesrob February 26, 2020 15:04
@daviesrob
Copy link
Member

A few more fixes are needed to get these broken files to work properly, especially if you try to index them. I now have solutions for tabix and reading BCF files made with versions of HTSlib prior to this fix. I'll push them up here tomorrow, and maybe rebase if you don't mind a forced push.

I've also noticed that bcf_update_info() can set rlen to zero here. Is that strictly correct? Currently I'm treating zero rlen as a manifestation of this problem, but I could relax it to just negative values.

@pd3
Copy link
Member Author

pd3 commented Feb 27, 2020

The line is correct. It removes an existing INFO/END annotation and assigns rlen=0 in case there is no allele set.

@jmarshall
Copy link
Member

This is similar to what I was getting at in my previous comment. For #917 we considered 0 to be meaningless and invalid but (barely) acceptable.

When there is no INFO/END field, rlen is set to the length of the REF field. It seems to me that this would be the value to set it to for an invalid INFO/END value too. Or a case could be made for 1 — but not for 0.

@pd3
Copy link
Member Author

pd3 commented Feb 27, 2020

Note this part of the code handles the case of a partially constructed BCF record. If there is no REF allele set, it is correct to set rlen to 0.

@jmarshall
Copy link
Member

As I’m sure you agree is obvious, my comment applies to the patches to vcf_parse() (in the previous commit) and bcf_update_info().

@pd3
Copy link
Member Author

pd3 commented Feb 27, 2020

No, not obvious. I have no idea what you are talking about. My comment was about the line @daviesrob highlighted in bcf_update_info().

@daviesrob
Copy link
Member

I've rebased and pushed up my extra commits.

For the value of rlen in the event of an invalid END tag:

  • In vcf_parse() it will be set to the length of REF here.

  • tbx_parse1() will use the length of REF, as set here.

  • For BCF files, bcf_record_check() will try to fix a negative rlen using the length of the REF allele collected here. Both bcf_read() and bcf_readrec() now call bcf_record_check().

  • In bcf_update_info(), the existing rlen will be left unchanged. Hopefully after the changes to VCF and BCF reading, the initial value will make some sort of sense.

This all works fine if you actually have a REF. If n_alleles == 0 (which is strictly not allowed as the specification says REF is mandatory) or the reference has zero length (which does not seem to be disallowed, but is likely a bad idea) then rlen will end up with a value of zero.

For bcf_update_info(), it's arguable that it should really treat bad END values as an error instead of a warning. If a program is setting END, it really ought to be told that the value it's trying to use is not sensible as it possibly indicates a bug. If such a check had been present, the 1000genomes team would have spotted the problem with the files mentioned in samtools/bcftools#1154. Of course this would push some complication up to the caller (like split_info_numeric() in bcftools norm, which would need to add special handling for END tags) so may be something to do later once we've looked into the implications.

@daviesrob
Copy link
Member

Any other comments before this gets merged?

@jmarshall
Copy link
Member

Are there any test cases exercising all this? Perhaps some should be added to e.g. test-vcf-api.c — namely, to some C code so that the resulting rlen values can be easily checked.

@daviesrob
Copy link
Member

I'm having a go at adding some simple tests, and it has caused an issue to surface. If you read a file with a valid END tag, and then try to set it to an incorrect value with bcf_update_info(), the END tag will be replaced but rlen will be unchanged. If the record is then saved as VCF and read back in again, rlen will be set to the length of the REF allele, which is likely to be different to the value it had before. (Presumably this does not happen for BCF, which stores rlen directly).

This really goes back to the comment that bcf_update_info() should tell its caller that it did not work as expected. Otherwise, should it set rlen to the length of the REF allele if it's given a bad END tag, or should it simply refuse to change END (which might still be wrong if someone has already messed with POS)?

@jmarshall
Copy link
Member

Yup, that's what I was getting at in #1021 (comment) and my other comments 🤣

For the value of rlen in the event of an invalid END tag:

  • In bcf_update_info(), the existing rlen will be left unchanged. Hopefully after the changes to VCF and BCF reading, the initial value will make some sort of sense.

I didn't really understand your choice here. It seems to me that the invariant would be best maintained by resetting rlen to length(REF) in the case of bcf_update_info() removing INFO/END or setting it to an invalid value. And you're quite right that bcf_update_info() ought to be able to report the latter problem to its caller in some way.

@pd3
Copy link
Member Author

pd3 commented Mar 6, 2020

If bcf_update_info should be made this smart, also the order in which the BCF record is built would have to be enforced. Before knowing POS, the END cannot be checked. In my view, setting wrong END is a programming error, I find more sensible to enforce error checking at the input/output stage, not in bcf_update_info.

@daviesrob
Copy link
Member

You would need to set POS first, or fix-up rlen later, even without any smarts as bcf_update_info() calculates rlen based on the existing value of POS.

The basic problem is that in BCF (and therefore HTSlib's internal format) the data is denormalised between POS, rlen, the END tag, and to some extent REF. If one of them is updated then the rest need to be too, so what's really needed is a function that takes POS, rlen (and maybe REF), and sets all of them at the same time to give a consistent result.

Meanwhile, for bcf_update_into(), maybe we should drop the update to make it catch invalid END tags, let it set a negative rlen, and treat negative rlen as an error when the data is written? Note that this would cause some tools to break on the invalid files, for example bcftools norm which creates END tags as part of a general loop around INFO tags and doesn't take account of the END tag's special rlen-setting properties.

@daviesrob
Copy link
Member

I've pushed up my latest changes to this:

  • bcf_update_info() now sets rlen to the length of the REF allele when an attempt is made to set an invalid END tag.

  • I've added documentation on the rlen-setting side-effects that happen in bcf_update_info() when it's used to change END.

pd3 and others added 6 commits April 3, 2020 11:52
This is an extension of ea879e1 which added checks for reading
VCF.

Resolves samtools/bcftools#1154
Where END is less than or equal to the reference position, a
warning is printed and the length of the REF allele (which has
already been measured) will be used instead.
This can happen where VCF files with invalid END tag values (less
than the reference position) have been converted to BCF.

bcf_read1_core() is changed to treat rlen as a signed value (which
is what the BCF2 specification says).  This means the maximum
rlen now supported by HTSlib for BCF files is reduced to 2^31.

If rlen is negative, bcf_record_check() will now print out a
warning.  It will also attempt to fix up rlen by substituting
the length of the reference allele.

A call to bcf_record_check() is added to bcf_readrec() so
reads using iterators will be protected from bad input data.
Unfortunately checks involving the header have to be disabled
for this interface as it isn't available.
@daviesrob
Copy link
Member

Rebased and squashed a bit...

@daviesrob daviesrob merged commit f58a6f3 into samtools:develop Apr 3, 2020
@pd3 pd3 deleted the bt.1154 branch April 7, 2020 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bcftools 1.10 fails to index files with wrong INFO/END values but bcftools 1.9 worked fine
3 participants