Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore colons in HLA contig names #705

Closed
wants to merge 1 commit into from

Conversation

Falcury
Copy link

@Falcury Falcury commented May 29, 2018

This is a fix for a situation where hts_parse_reg() is unable to recognize specified HLA regions, because the colons are interpreted as specifier for a range. This leads to errors and unexpected behavior:

[main_samview] region "HLA-A*01:01:01:01" specifies an unknown reference name. Continue anyway.

The fix I am proposing here disallows ':' for specifying a range, in the special case where the region name starts with 'HLA'. Of course that also means that specifying a range for contigs named HLA* is no longer possible at all. However, it is arguably more important that the HLA regions at least get recognized correctly.

@jkbonfield
Copy link
Contributor

Thanks for the PR, but I'm not sure I like disallowing ranges in HLA contigs.

Is it not possible to try parsing the range as normal and if the resulting reference doesn't exist, then try again with no range to see if the reference then does exist? If so this permits HLA contigs to work, with and without a range being used.

@Falcury
Copy link
Author

Falcury commented May 30, 2018

For now I made this as a 'quick fix' because I cannot proceed with my project without it...
Yeah, I agree something like that would be better (although maybe I would then recommend checking for existence first, and only after that try to parse as a range). However, then there other issues to sort out.

(Note that I'm not actually familiar with the codebase, please correct me if necessary!)

hts_parse_reg() cannot know whether the reference exists or not, so the responsibility of checking that would fall onto the caller. But, there are a bunch of places where hts_parse_reg() gets called (e.g., I'm counting 6 calls in samtools alone). Basically, we would have to change the code in a multitude of places, which adds more possible points of failure. Basically, any caller of hts_parse_reg() that doesn't know about this problem might end up shooting themselves in the foot.

Another problem is that there is no guarantee that it is always possible to disambiguate.
A name like HLA-A*01:01:01:01 would still reduce to HLA-A*01:01:01 (who knows, that one might exist too!) with a range parsed as 1 to INT32_MAX.

It's truly unfortunate that the HLA contigs got these impossible names...

Maybe you could consider changing the syntax for range specifiers slightly, as a compromise.
For example, you could make the syntax for a range specifier invalid if there is no hyphen in it. Then, to specify a range from a certain coordinate until the end of the contig you would write something like chr2:1000000- instead of chr2:1000000. None of the HLA contig names have a hyphen after the last colon, so that trick should perfectly disambiguate without needing to check whether the contig actually exists.
To keep people's scripts from breaking, you could still allow gracefully falling back to the original syntax, with a special case for HLA contigs where the stricter syntax is always enforced by default.

@jkbonfield
Copy link
Contributor

Some background on this:

2014: Issue raised as a toolchain problem: samtools/samtools#149
2016: Issue spotted in the wild with GRCh38: samtools/hts-specs#124
2017: Issue with VCF disallowing colons: samtools/hts-specs#258, samtools/hts-specs#291

So the problem is two fold; specs (VCF, ongoing) and tools (command line argument parsing).

I'll think some more on this...

@jkbonfield
Copy link
Contributor

jkbonfield commented May 30, 2018

So there was one nugget in all those discussions I didn't spot before, the offending HLA names have leading zeros after the colon. This isn't likely to occur when a user specifies a range, so as a minimal hack we could perhaps disallow leading zeros and treat it as a continuation of the reference. Edit: scratch that - it helps sometimes, but we have HLA-DRB1*12:17 too in there.

Ideally though I'd like a callback function to hts_parse_reg (hts_parse_reg2?) to validate if a reference name exists, so it can use this in deciding how to parse and "do the obvious thing". Finally if all hell breaks loose, then we could manually add quoting rules, which sadly has to use =name= as quoting due to quotes being legal. Gah! However the intelligent "it's obvious" approach is how it should work most of the time.

@Falcury
Copy link
Author

Falcury commented May 30, 2018

Wow, I had not realized the problem is that deep, even extending to the VCF specs...

I definitely feel uneasy that regions with ':' in the name are de facto unparseable without an associated lookup table of reference names. But whatever the solution will be, I will of course be happy if it 'just works' without unpleasant surprises!

I would stress that some sort of solution is urgently needed. In my view, even an unelegant special case rule that only handles the most frequent use cases of ':' being in the name (i.e., HLA?) would be preferable to leaving the problem unattended. As it stands right now, unwitting users like myself are falling into this trap.

@jkbonfield
Copy link
Contributor

As a workaround, any time you're querying an entire chromosome you can just append a colon. This will then make the current code work, even with HLA records.

The problem comes when you need a command line that accepts both entire names and regions.

@jkbonfield
Copy link
Contributor

jkbonfield commented May 30, 2018

Regarding needing a lookup table of reference names, this is essentially your bam_hdr_t struct or the analogous thing in vcf world. In terms of number of places in the code, yes hts_parse_reg is called from several places, but the actual bulk of calls will be via hts_itr_querys I think. This does already take a format-agnostic way (via callback function) of validating, so it's most of the way there already.

Eg something along these lines (it needs reformatting and shuffling - this is just the minimal implementation):

diff --git a/hts.c b/hts.c
index 4bd285f..72c5627 100644
--- a/hts.c
+++ b/hts.c
@@ -2537,6 +2537,11 @@ hts_itr_t *hts_itr_querys(const hts_idx_t *idx, const char *reg, h
ts_name2id_f g
     else if (strcmp(reg, "*") == 0)
         return itr_query(idx, HTS_IDX_NOCOOR, 0, 0, readrec);
 
+    if ((tid = getid(hdr, reg)) >= 0) {
+        beg = 0; end = INT_MAX;
+        return itr_query(idx, tid, beg, end, readrec);
+    }
+
     q = hts_parse_reg(reg, &beg, &end);
     if (q) {
         char tmp_a[1024], *tmp = tmp_a;

Note that getid is a function pointer passed into hts_itr_querys (eg bam_name2id when called from sam_itr_querys).

This is already enough to get samtools view and mpileup working. We could modify it this way, or we could have a variant of hts_parse_reg which accepts gettid (and client data for hdr) as an argument. I was thinking of the latter and then adding hts_parse_reg_bam (and hts_parse_reg_vcf) specialisms which do the right thing and don't need bam_name2id function name being passed around.

@jkbonfield
Copy link
Contributor

As you can now see, I added a draft prototype of what I'm thinking in #708. Thanks though for the impetus to sort this out - your PR was most welcome even though I didn't agree with the methods.

We'll discuss this next week when more of the core team are around.

@Falcury
Copy link
Author

Falcury commented May 31, 2018

Thank you very much for addressing the issue! And it's great to have an elegant solution too.
I'm glad I could be of some help.

And thank you as well for pointing me toward the workaround of appending a colon to the name. In the short time that's a life saver for sure.

Then, this PR can be closed, I think.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 4, 2018

Elsewhere you mention:

In my project, samtools was (almost) silently failing and producing incomplete results. What saved me was the fact that I dug deeper and found the cause of the problem. Others may not be so lucky.

I noticed this behaviour too and it rather surprised me. Failure to find a reference is considered only a warning and it continues on with exit code zero. I do not know why it was coded this way, and I was very tempted to fix it. However I wonder if there are legitimate uses? Eg with a mix of split chromosome bams (eg foo1.bam to foo22.bam and foo[XY].bams) and merged bams we may end up doing something like "samtools mpileup -r 22 *.bam" and just let the ones containing no chr22 be skipped. This works fine in fact, provided all the non-chr22 bams have the full headers containing SQ records for 22 also (which in any sane pipeline they would).

What do people think to the idea of making this a hard failure, perhaps with a command line option (although very tricky given the number of commands involved) to revert to the old behaviour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants