-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent negative rlen on erroneous INFO/END when reading BCF #1021
Conversation
The same comment applies to this as to the previous ea879e1 commit (see #1006 (comment)):
Happily this now is a PR, so there is an opportunity to consider the In particular, |
From what I can see the code in Arguably, also the minimal change to Maybe there are other situations I am unable to forsee, in that case it would be helpful to be more specific when raising objections, ideally with a test case highlighting the problem |
Doesn't Should we actually be keeping these bad As for |
Ah, that's right, As for the fixing the END tag - possibly. I am not sure what the philosophy of the library should be in terms of sanitizing: recover from errors internally and leave the data as is or fix the data when possible as well? We don't know the correct value of END so the only sensible thing would be to aggressively set to missing. This may not be a bad thing as it will force the users to fix the files. However, this change will not be transparent, one would expect I am happy to be convinced either way though. |
None of this code checks for missing values on END either, although the new checks will reject them with a somewhat misleading error message. We would need to either fix the value, remove the tag completely, or ensure every place that uses the value checks that it's valid (which isn't too many places as it turns out). |
Added a check in |
A few more fixes are needed to get these broken files to work properly, especially if you try to index them. I now have solutions for tabix and reading BCF files made with versions of HTSlib prior to this fix. I'll push them up here tomorrow, and maybe rebase if you don't mind a forced push. I've also noticed that |
The line is correct. It removes an existing INFO/END annotation and assigns rlen=0 in case there is no allele set. |
This is similar to what I was getting at in my previous comment. For #917 we considered 0 to be meaningless and invalid but (barely) acceptable. When there is no INFO/END field, |
Note this part of the code handles the case of a partially constructed BCF record. If there is no REF allele set, it is correct to set rlen to 0. |
As I’m sure you agree is obvious, my comment applies to the patches to |
No, not obvious. I have no idea what you are talking about. My comment was about the line @daviesrob highlighted in |
I've rebased and pushed up my extra commits. For the value of
This all works fine if you actually have a REF. If For |
Any other comments before this gets merged? |
Are there any test cases exercising all this? Perhaps some should be added to e.g. test-vcf-api.c — namely, to some C code so that the resulting |
I'm having a go at adding some simple tests, and it has caused an issue to surface. If you read a file with a valid This really goes back to the comment that |
Yup, that's what I was getting at in #1021 (comment) and my other comments 🤣
I didn't really understand your choice here. It seems to me that the invariant would be best maintained by resetting |
If |
You would need to set The basic problem is that in BCF (and therefore HTSlib's internal format) the data is denormalised between Meanwhile, for |
I've pushed up my latest changes to this:
|
This is an extension of ea879e1 which added checks for reading VCF. Resolves samtools/bcftools#1154
Where END is less than or equal to the reference position, a warning is printed and the length of the REF allele (which has already been measured) will be used instead.
This can happen where VCF files with invalid END tag values (less than the reference position) have been converted to BCF. bcf_read1_core() is changed to treat rlen as a signed value (which is what the BCF2 specification says). This means the maximum rlen now supported by HTSlib for BCF files is reduced to 2^31. If rlen is negative, bcf_record_check() will now print out a warning. It will also attempt to fix up rlen by substituting the length of the reference allele. A call to bcf_record_check() is added to bcf_readrec() so reads using iterators will be protected from bad input data. Unfortunately checks involving the header have to be disabled for this interface as it isn't available.
Rebased and squashed a bit... |
This is an extension of ea879e1 which added checks for reading
VCF.
Resolves samtools/bcftools#1154