Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around colons in reference names (GRCh38 HLA). #708

Merged
merged 9 commits into from
May 31, 2019

Conversation

jkbonfield
Copy link
Contributor

This resolves #705 which is HLA specific with a more general workaround for badly named references.

Note this has some open questions that need resolving, either by deleting the FIXME in the code or actually implementing it. As such this PR is a work in progress for discussion.

Specifically:

  • Do we want to migrate our support for "*" (unmapped) and "." (all records) to this code too? If so tid = -1 is valid and we need to adjust the comment regarding return values.
  • I am unsure of cram_name2id, do we need to provide support for this too? It seems to work OK, so I assume that is only needed by the multi-region iterator.
  • I've no idea how this integrates with BCF world. I think bcftools has a totally different name parser as it accepts foo:100 to be base 100 only and foo:100- is needed for 100 onwards (better IMO). I recall work was done to unify tihs, but I can find no evidence of it in the code, but I haven't read through all of [DISCUSS] Add HTS_PARSE_* parsing flags #260 yet.

@jkbonfield jkbonfield changed the title Work around colons in reference names (GRCh38 HLA). WIP: Work around colons in reference names (GRCh38 HLA). May 31, 2018
hts.c Outdated
// FIXME: do we need to permit tid=-1 for reference "*" to indicate unmapped
// reads, and strictly have NULL as failure test?
int tid_, s_len = strlen(s); // int is sufficient given beg/end types
if (!tid) tid = &tid_; // simplifies code below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check all the pointers at the beginning, especially as the function would now be part of the interface.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 1, 2018

Based on comments from @jmarshall in samtools/hts-specs#124 I've added brace quoting as an experiment, just to see how it feels code wise as well as usage wise. Seems workable IMO.

If this is accepted as a general way forward (irrespective of the usual tweaks in review) then I'll look at a samtools patch too. Eg bedidx.c calls hts_parse_reg directly and could benefit from being given the bam header.

We also need to figure out how this fits with bcftools, but that's a bigger task. Maybe before we finalise this we need to determine whether the new API needs an additional argument about how parsing is done (whether chr:1 is 1-1 or 1-end) so we can share the common code. Even if not done now, making provision for it in the API is key.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 5, 2018

Further points for discussion:

  • 64-bit reference positions - ABI/API placeholder #709 needs a new API to return 64-bit offsets. We may as well fold that prototype into this for hts_parse_ref2 (I called it hts_parse_ref64 there, but maybe hts_parse_region from [DISCUSS] Add HTS_PARSE_* parsing flags #260).

  • [DISCUSS] Add HTS_PARSE_* parsing flags #260 adds a flags argument describing alternative parsing via HTS_PARSE_* macros. At the moment this is purely whether or not commas are permitted, but we have more things to add:

  • HTS_PARSE_SINGLE. If set, chr1:555 means chr1:555-555 and not chr1:555-end. Similarly it enables chr1:555- to indicate that (and chr1:-555 for chr1:1-555). This syntax is used by Bcftools, which currently has its own parsing functions (regidx_parse_reg).

  • HTS_PARSE_FLOAT. If set, permit regions like 1:2.0e6-2.5e6 to indicate 2 to 2.5Mb region. This is also the syntax permitted by Bcftools.

  • HTS_PARSE_UNITS. A better (IMO) alternative to the above allowing k, m, g etc notation. Eg 1:2.0M-2.5M or 1:2000K-2500K.

Some of these could just be implicit. I don't think there is an issue parsing with strtod instead of strtol regardless, and spotting K, M, G etc is also upwardly compatible, so maybe they just need to be there always. Similarly chr1:-555 ought to work in Samtools as it's also compatible with chr1:1-555. The key distinction though is the bcftools vs samtools parsing for single value ranges with the different interpretation for region end.

Edit: one more for the list

  • HTS_PARSE_MULTI. If set, the region is a comma separated list of regions (in which case HTS_PARSE_THOUSANDS_SEP is ignored) and the return value is a pointer to the start of the next region or NULL on failure. It could be used in e.g.
while (*reg) {
    reg = hts_parse_region(reg, &tid, &beg, &end, bcf_name2id, hdr, HTS_PARSE_MULTI);
    if (!reg) break;

    /// ... do something
}

We'd need to supplement this with bam_parse_region, bcf_parse_region etc functions that supply appropriate parameters to ease usage for that header type and the expected set of parameters.

@jmarshall
Copy link
Member

👍 that's the sort of thing I had in mind back in #260. And something like HTS_PARSE_SINGLE so that bcftools could be brought into the fold was first on the list 😄

I'd suggest adding a new all-singing all-dancing hts_parse_region() with a sensible calling convention unrelated to that of hts_parse_reg(), and deprecating hts_parse_reg() and leaving it as it is. (Or removing it if the ABI is being broken anyway, but that'd be a bit mean.)

And as you say, adding sam_parse_region() (note sam_) that would take just hdr instead of bam_name2id and hdr, etc.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 6, 2018

Mutter mutter. Yet another nuance.

So, bcftools regidx_parse_reg has different numbering. Samtools mpileup foo.bam -r chr1:7-8 returns beg=6, end=8. Bcftools same command returns beg=6, end=7. Argh, one more parameter! (I've culled some of the others though as they can just be default on and be done with it. Yes you really can specify 1500000, 1500K, 1.5M, 1.5e3K or even 1500e-3M if you're warped!) Edit: actually with hindsight maybe the hts_parse_region will always return the same coords, but I make the calling regidx_parse_reg wrapper adjust instead. Less faff.

Also the bcftools version has chr_beg and chr_end so {HLA:1:2:3} can return chr_beg pointing to 'H' and chr_end pointing to '}'. This is useful as the single return value from hts_parse_reg doesn't cut it when we may, or may not, start with '{' or even white space (this is removed in the bcftools version). I'm thinking it has the saner interface here.

Finally bcftools mpileup foo.bam takes regions parsed according to bcftools conventions, not according to SAM conventions. Ie it is the tool that defines the parsing rules and not the file format. Hence our sam_parse_region and vcf_parse_region only hide the header checking function for us and we still need to specify a bunch of parsing rule codes. Maybe I'll make some HTS_PARSE_SAMTOOLS and HTS_PARSE_BCFTOOLS macros in the respective packages for simplicity.

What a quagmire. :-( It's asymptotically getting there though.

@jmarshall
Copy link
Member

jmarshall commented Jun 6, 2018

Edit: actually with hindsight maybe the hts_parse_region will always return the same coords, but I make the calling regidx_parse_reg wrapper adjust instead. Less faff.

Yes. What you're saying is that regidx_parse_reg returns begin/end represented as something other than a zero-based half-open interval. That's crazy — your super new function should only support one sensible representation!

Also the bcftools version has chr_beg and chr_end so {HLA:1:2:3} can return chr_beg pointing to 'H' and chr_end pointing to '}'. This is useful as the single return value from hts_parse_reg doesn't cut it when we may, or may not, start with '{' or even white space (this is removed in the bcftools version). I'm thinking it has the saner interface here.

All this returning pointers to the (start/)end of the chromosome name within the input string is a hack based not knowing the table of expected contig names and needing to return the chr substring in some C language-compatible way (and not wanting to malloc or copy it).

If your super new function is given the table of expected contig names (or, in general, a callback function that returns a tid-style code for each expected contig name), it can return a tid (via an output parameter) and there's no need to return the substring at all. So it can have a sensible int return value that just signals parsing failures.

This would mean it wouldn't support not being given a table of expected contig names, but in a way that would be a good thing. In all the current htslib/bcftools/samtools sites where hts_parse_reg-like functions are called, are there any where the calling code doesn't already know the expected contig names?

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 6, 2018

This would mean it wouldn't support not being given a table of expected contig names, but in a way that would be a good thing. In all the current htslib/bcftools/samtools sites where hts_parse_reg-like functions are called, are there any where the calling code doesn't already know the expected contig names?

Yes, because at present I wrap the old API in terms of calling the new API, and the old API isn't given this information. Naturally we can (and should) update all samtools commands to be using the new API (most are already due to iterators), but we cannot assume all third party library users will do this. Either we accept this is optional, or we duplicate the code and have both old and new functions in full. (For API sanity the latter may well be preferable.)

I was also considering having a returned structure. Eg:

typedef struct {
    const char *chr_beg, *chr_end;
    int tid;
    int start, end;
} hts_parsed_region;

const char *hts_parse_region_(const char *str, hts_parsed_region *reg,
                              hts_name2id_f getid, void *hdr, int flags);

I'd assumed that actually where chr_beg and chr_end (and the equivalent char* return value from htslib) were used was both for the header checking later on, but also for messages.

That said, the latter is easily resolved by the reverse bam_id2name equivalent function (it's not called that, but both sam and vcf headers have equivalents) and you're correct the external (samtools) usage of hts_parse_reg has header available. Even to the extent of almost writing the same function again - see bam_parse_region in samtools/bam_aux.c for example!

It's a little complex in bcftools though as the equivalent over there uses regidx_push with strings and then does the index lookup itself in a hash table. This is sort of equivalent to bam_name2id and co, but with the subtlety of being used elsewhere for UTR and Exon names plus the ability to dynamically grow the dictionary and assign new tids on the fly for unseen data. This still fits in with a getid function pointer usage though so it's all workable, just a bit more work. Certainly less complexity in the API is worth a bit more grunt work elsewhere.

As far as return values go, I think returning the end of the parsed string is sensible as it permits lists reg,reg,reg... to be parsed in a simple loop. (NULL for failure. *ret == '\0' for list end). It's more flexible than an integer error code and avoids the need for an additional argument

@jmarshall
Copy link
Member

Yes, because at present I wrap the old API in terms of calling the new API, and the old API isn't given this information

That wasn't my question. My question was: where the old API is called, do the callers have that information at their fingertips (even though the old API doesn't ask for it)? If so, the callers would be in a position to give it to a new API.

You can hack such that the old API can be written in terms of the new API, or you can design something good… 😄

I'd still suggest adding a new all-singing all-dancing hts_parse_region() with a sensible calling convention unrelated to that of hts_parse_reg(), and deprecating hts_parse_reg() and leaving it as it is.

This is sort of equivalent to bam_name2id and co, but with the subtlety of being used elsewhere for UTR and Exon names plus the ability to dynamically grow the dictionary and assign new tids on the fly for unseen data.

This is an interesting point, and also interesting that a callback function can be given something it hasn't seen before and say "yes that's a valid contig name" and return a newly-assigned tid if it wants to.

That wouldn't work too well with asking the callback whether "foo" or "foo:1-200" is the contig part of "foo:1-200" though. Ideally the callback would have a "confidence" argument to signal the context of the putative contig name…

@jkbonfield
Copy link
Contributor Author

I think if it's called with a callback function that just accepts new strings then that's fine, but it can no longer do the same level of sanity checking. We lose nothing though as it already has that issue right now too.

See bcftools csq.c and the regidx_push implementation.

Anyway I'm heeding your advice and making the new API more draconian. It'll have the various parameters as mandatory now, which simplifies the implementation somewhat too. Yes all the code I've seen calling the old API that I've found so far does have some way in getting the header.

I've also taken the liberty of making all parsers cope with chrX:100- to mean 100 to end (and -100 to mean 1-100). This is a change in behaviour as right now with samtools -100 means 0 to end (-100 is beginning and it's clipped to zero), but frankly I view that as a bug!

This can of worms is really quite wriggly. :-)

@jmarshall
Copy link
Member

jmarshall commented Jun 7, 2018

I've also taken the liberty of making all parsers cope with chrX:100- to mean 100 to end (and -100 to mean 1-100).

Totally agree with making sensible things happen when you omit either end of [START]-[END]. That's what I did in my old library (sigh…) — see interval::assign(), which also had [START]+[LENGTH] syntax as a shorthand for start .. start+length which might be of general interest.

I am coming around to your hts_parsed_region struct idea 😄. You could have s_end returned as a member of that struct along with the other pointers to parts of the input str and return to having the function's return type be int. This has the advantage over NULL/not-NULL of being able to signal different error codes.

@jmarshall
Copy link
Member

I am coming around to your hts_parsed_region struct idea …

…except what about keeping int *tid, int *beg, int *end as separate arguments, and having hts_parsed_region just contain all the pointers into the broken-up input string — so callers can pass NULL instead of &my_region_details if they're not interested in those details?

@jkbonfield
Copy link
Contributor Author

For now at least it's the simpler API. I'm testing whether I can get it working universally with samtools and bcftools now, before amending the htslib side from WIP to a PR for actual consideration.

There is one other benefit to using a structure though - the data types returned are under control of the API and not just &local_int. The latter, as we can see elsewhere, breaks if we change it to int64_t or hts_pos_t. Not that it matters mind as once we've done it I doubt we'll be changing them again!

Anyway, I'll see after my explorations with bcftools whether an extra argument actually buys us anything. If it does, then a single structure is the way to go rather the multiple chr_beg, chr_end etc args.

@jkbonfield
Copy link
Contributor Author

Regarding multiple return values, I see the usage at the moment of hts_parse_reg in samtools is by and large just for error reporting. Eg see samtools view:

                    if (hts_parse_reg(argv[i], &beg, &end))
                        fprintf(stderr, "[main_samview] region \"%s\" specifies an unknown reference na
me. Continue anyway.\n", argv[i]);
                    else
                        fprintf(stderr, "[main_samview] region \"%s\" could not be parsed. Continue any
way.\n", argv[i]);

This is perhaps over the top. A separate return value could indeed allow us to distinguish these cases, but a joint "Malformed region or unknown reference" error is probably sufficient, especially given the old (and new) code already writes to stderr if it spots dodgy text in hts_parse_decimal.

If we really care about being specific, then it's possible we could use something like tid as the error anyway. Ie if NULL is returned, see tid for the response code. (-1 for unknown, as it was before, something else for parse error). Personally I think a single works or doesn't is sufficient. I don't want to spend too long of my life catering to documenting all the details of how it can go wrong! ;-)

@daviesrob
Copy link
Member

One advantage of an integer return value is that you can add __attribute__ ((__warn_unused_result__)) (or in htslib HTS_RESULT_USED) to ensure that the caller actually checks it...

@jkbonfield
Copy link
Contributor Author

It also turns out that the samtools bedidx code could benefit from having the chr start and end pointers available (or start + length). It's not strictly necessary as we could hash on tid instead, but it's substantially more code change (arguably an improvement though).

Anyway, comments on a postcard as usual...

@jmarshall
Copy link
Member

Re a joint "Malformed region or unknown reference" error is probably sufficient:

https://www.biostars.org/p/320129/ is a good example of the value of individual specific error messages. Quick, at a glance, what's wrong with the region in the question?

It turns out the two error messages in the code you quoted are there because previously there was a single (not general enough, hence misleading) error message; see samtools/samtools#353.

Just some data points… 😄

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 11, 2018

I saw 353 and agree it's an incorrect error message, but that wasn't quite my point. Incorrect errors are far worse than general non-specific errors. Hence "Unknown reference or failure to parse region string" at least gives you a clue it may be the latter. Same with the end<start scenario on biostars (I saw it instantly, but I've been looking at the code and know what makes it fail).

I think if we want more specific errors though, we should just use hts_log_warning/error in htslib itself as it's far easier than expecting every call to this to report correctly, plus we simply don't have a proper error reporting system for htslib. Ideally we'd have hts_errno and hts_perror style functionality, but that's a totally different PR. :-)

However I'm happy to concede to having an int return for err code and another way of returning the next pointer location, eg via a struct return as suggested before. This gives some flexibility. I'm abivalent as to whether we should have tid, beg, end explicit and not part of the struct. On one hand all-in-struct is tidier and also enforces correct use of data types (eg when we migrate to 64-bit integers), but on the other hand there are 3 essentially required arguments that everyone must deal with plus optional things for the rare case. (The rare case here being bcftools; rare because I really can't see myself tackling the complexity of the parsing interactions there with this round.)

@jkbonfield
Copy link
Contributor Author

Note this may not actually be working despite travis/appveyor claiming it does. As requested the new function now fills out int64_t, but I haven't upgraded everything else to handle 64-bit pos as that's a different PR already.

As a consequence, hts_itr_querys retrieves int64_t beg and end, which may be set to INT64_MAX, which are then passed into itr_query as int. That will probably get interpreted as -1 rather than INT_MAX. Somehow it passes htslib and samtools test harness, but I don't know if this is lack of decent tests or bad analysis of what's likely to be happening.

Either way, ignore any problems this PR makes as it is, now, expected to only work when used in conjunction with #709.

jkbonfield and others added 9 commits May 30, 2019 12:30
This extends the callback function used by the hts iterator code to
the ref:start-end range parser, permitting the parser to validate
whether the whole thing matches a reference name.

This works around parsing conflicts when given range queries like
"HLA-DRB1*12:17", which is either the whole of "HLA-DRB1*12:17" (this
exists in GRCh38) or base 17 onwards of ""HLA-DRB1*12" (which does
not).

Note there are still some undecided questions here.  Do we want to
handle the special names used in the iterator at this level to?  Eg
"*" meaning unmapped data and "." meaning whole file?
Also improved return values so NULL is primary way of detecting
failure rather than tid.  This is more in line with the old
hts_parse_reg code.

See samtools/hts-specs#124 (comment)
for heuristic suggestions.
Eg with contigs named "chr1" and "chr1:100-200" we can specify
"{chr1}:100-200" and "{chr1:100-200}" to disambiguate.
The old function (hts_parse_reg) has been put back, meaning the new
API can now have the getid func pointer as a mandatory requirement.

Added tests.

Tweaked sam_parse_region to have the flags parameter.  This is still
required as bcftools and samtools use different parameters (parsing
style is tool based rather than file format based).
Previously it had its own custom parser which nearly, but not quite,
matched the old hts_parse_reg code in functionality.

Note this changes the fasta coordinate type from long to int, which is
a backwards step.  However it is expected this will subsequently
change to hts_pos_t in another PR.
Mainly copied from the useful (but not so visible) comment
before the function definition in hts.c.
Hash table can now use tid (cast to khash32_t, which is unsigned)
as key instead of the region name, avoiding some string copying.
@daviesrob
Copy link
Member

Fixed bug which broke samtools, updated hts_reglist_create() to use the new interface and put a bit more documentation in the header files.

@daviesrob daviesrob merged commit d26300e into samtools:develop May 31, 2019
@jkbonfield jkbonfield deleted the HLA_fix branch June 5, 2019 17:53
@dmckean
Copy link

dmckean commented Jun 5, 2019

It looks like the work around mentioned in 0eac47d brings back the SIZE_MAX macro, which some glibc versions don't include for c++.

@daviesrob
Copy link
Member

Ah, yes, I'd forgotten about the c++ SIZE_MAX problem. Also, looking at the last fix for this (#772) the possible overflow should already be detected. We'll look into it to see if there's a better way of solving this.

@jkbonfield
Copy link
Contributor Author

I'm not sure if if (l+2 < l) would get optimised away by some compilers or not, but basically it's what it's attempting to detect - an overflow is going to happen.

@daviesrob
Copy link
Member

Fixed by 164d507. gcc had managed to find the one value of l that kputsn() couldn't handle (SIZE_MAX - 1), although it's still a mystery how it could ever be called with that value from the line where the warning was reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants