-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore colons in HLA contig names #705
Conversation
Thanks for the PR, but I'm not sure I like disallowing ranges in HLA contigs. Is it not possible to try parsing the range as normal and if the resulting reference doesn't exist, then try again with no range to see if the reference then does exist? If so this permits HLA contigs to work, with and without a range being used. |
For now I made this as a 'quick fix' because I cannot proceed with my project without it... (Note that I'm not actually familiar with the codebase, please correct me if necessary!)
Another problem is that there is no guarantee that it is always possible to disambiguate. It's truly unfortunate that the HLA contigs got these impossible names... Maybe you could consider changing the syntax for range specifiers slightly, as a compromise. |
Some background on this: 2014: Issue raised as a toolchain problem: samtools/samtools#149 So the problem is two fold; specs (VCF, ongoing) and tools (command line argument parsing). I'll think some more on this... |
So there was one nugget in all those discussions I didn't spot before, the offending HLA names have leading zeros after the colon. This isn't likely to occur when a user specifies a range, so as a minimal hack we could perhaps disallow leading zeros and treat it as a continuation of the reference. Edit: scratch that - it helps sometimes, but we have Ideally though I'd like a callback function to hts_parse_reg (hts_parse_reg2?) to validate if a reference name exists, so it can use this in deciding how to parse and "do the obvious thing". Finally if all hell breaks loose, then we could manually add quoting rules, which sadly has to use =name= as quoting due to quotes being legal. Gah! However the intelligent "it's obvious" approach is how it should work most of the time. |
Wow, I had not realized the problem is that deep, even extending to the VCF specs... I definitely feel uneasy that regions with ':' in the name are de facto unparseable without an associated lookup table of reference names. But whatever the solution will be, I will of course be happy if it 'just works' without unpleasant surprises! I would stress that some sort of solution is urgently needed. In my view, even an unelegant special case rule that only handles the most frequent use cases of ':' being in the name (i.e., HLA?) would be preferable to leaving the problem unattended. As it stands right now, unwitting users like myself are falling into this trap. |
As a workaround, any time you're querying an entire chromosome you can just append a colon. This will then make the current code work, even with HLA records. The problem comes when you need a command line that accepts both entire names and regions. |
Regarding needing a lookup table of reference names, this is essentially your bam_hdr_t struct or the analogous thing in vcf world. In terms of number of places in the code, yes hts_parse_reg is called from several places, but the actual bulk of calls will be via hts_itr_querys I think. This does already take a format-agnostic way (via callback function) of validating, so it's most of the way there already. Eg something along these lines (it needs reformatting and shuffling - this is just the minimal implementation):
Note that getid is a function pointer passed into hts_itr_querys (eg bam_name2id when called from sam_itr_querys). This is already enough to get samtools view and mpileup working. We could modify it this way, or we could have a variant of hts_parse_reg which accepts gettid (and client data for hdr) as an argument. I was thinking of the latter and then adding hts_parse_reg_bam (and hts_parse_reg_vcf) specialisms which do the right thing and don't need bam_name2id function name being passed around. |
As you can now see, I added a draft prototype of what I'm thinking in #708. Thanks though for the impetus to sort this out - your PR was most welcome even though I didn't agree with the methods. We'll discuss this next week when more of the core team are around. |
Thank you very much for addressing the issue! And it's great to have an elegant solution too. And thank you as well for pointing me toward the workaround of appending a colon to the name. In the short time that's a life saver for sure. Then, this PR can be closed, I think. |
Elsewhere you mention:
I noticed this behaviour too and it rather surprised me. Failure to find a reference is considered only a warning and it continues on with exit code zero. I do not know why it was coded this way, and I was very tempted to fix it. However I wonder if there are legitimate uses? Eg with a mix of split chromosome bams (eg foo1.bam to foo22.bam and foo[XY].bams) and merged bams we may end up doing something like "samtools mpileup -r 22 *.bam" and just let the ones containing no chr22 be skipped. This works fine in fact, provided all the non-chr22 bams have the full headers containing SQ records for 22 also (which in any sane pipeline they would). What do people think to the idea of making this a hard failure, perhaps with a command line option (although very tricky given the number of commands involved) to revert to the old behaviour? |
This is a fix for a situation where
hts_parse_reg()
is unable to recognize specified HLA regions, because the colons are interpreted as specifier for a range. This leads to errors and unexpected behavior:The fix I am proposing here disallows ':' for specifying a range, in the special case where the region name starts with 'HLA'. Of course that also means that specifying a range for contigs named HLA* is no longer possible at all. However, it is arguably more important that the HLA regions at least get recognized correctly.